Metadata-Version: 2.4
Name: pytest-self-healer
Version: 0.2.0
Summary: Auto-heal broken Playwright selectors using a local or cloud LLM
Project-URL: Homepage, https://github.com/athrvrne/Self-healing-Playwright-Tests
Project-URL: Issues, https://github.com/athrvrne/Self-healing-Playwright-Tests/issues
License: MIT
License-File: LICENSE
Keywords: llm,ollama,playwright,pytest,self-healing,test-automation
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.9
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: httpx>=0.27
Requires-Dist: lxml>=5.0
Requires-Dist: playwright>=1.44
Requires-Dist: pytest-asyncio>=0.23
Requires-Dist: pytest>=7.0
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.28; extra == 'anthropic'
Description-Content-Type: text/markdown

# 🛠 Self-Healing Test Automation Framework

> A Playwright wrapper that uses a **local or cloud LLM** to automatically fix broken CSS selectors — no flaky CI pipelines, no manual triaging.

---

## The Problem

UI changes break test selectors constantly:

```
TimeoutError: page.click: Timeout 30000ms exceeded.
  waiting for selector "#submit-btn"
```

The button still exists — it's just `[data-testid="login-submit"]` now. A human would fix it in 10 seconds. But at 3 AM in CI, it blocks your entire pipeline.

---

## How It Works

```
Test runs selector  →  TimeoutError  →  DOM snapshot captured
        ↓
  DOM compressed (scripts/styles stripped, ~8KB)
        ↓
  Prompt sent to LLM (local Ollama or Anthropic Claude)
        ↓
  LLM returns: { "selector": "#new-id", "confidence": "high" }
        ↓
  New selector validated in Playwright
        ↓                          ↘ invalid? retry with feedback
  Test continues ✅                  ("that selector didn't match — try another")
        ↓
  Result cached to disk, keyed by (url, selector) — reused across runs
```

---

## Project Structure

```
pytest-self-healer/
├── src/
│   ├── pytest_self_healer/            # Installable package (pip install pytest-self-healer)
│   │   ├── __init__.py
│   │   ├── plugin.py                  # pytest entry point (fixtures + CLI options)
│   │   ├── healing_engine.py          # Core: LLM clients, DOM compression, healing logic
│   │   └── page_wrapper.py            # SelfHealingPage: drop-in Playwright Page replacement
│   ├── evals/
│   │   ├── selector_evalset.json      # Ground-truth dataset for LLM accuracy benchmarking
│   │   ├── run_eval.py                # Standalone eval runner (scores + saves report)
│   │   └── compare_models.py          # Diff two eval reports side by side
│   ├── tests/
│   │   ├── test_healing_examples.py   # Integration tests with intentionally stale selectors
│   │   ├── test_evalset.py            # pytest integration for the evalset
│   │   ├── test_accuracy.py           # LLM accuracy benchmarks (3 tiers)
│   │   └── test_unit.py               # Unit tests (no browser/LLM required)
│   └── conftest.py                    # pytest fixtures, CLI options, report hook
├── docker/
│   ├── Dockerfile                     # Test runner image (Playwright + Python)
│   └── docker-compose.yml             # Ollama + test runner, health-checked
├── reports/
│   ├── healing_report_<ts>.json       # Per-run healing reports
│   └── evals/
│       └── eval_<provider>_<ts>.json  # Per-run eval reports
├── requirements.txt
├── pytest.ini
└── README.md
```

---

## Quickstart

### Option 1: Unit tests only (no browser or LLM needed)

```bash
pip install -r requirements.txt
playwright install chromium
PYTHONPATH=src pytest src/tests/test_unit.py -v
```

### Option 2: Full integration tests (requires Ollama running locally)

```bash
# Install and start Ollama
brew install ollama        # or: curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5-coder:3b

# Run the tests
PYTHONPATH=src pytest src/tests/ -v \
  --ollama-url=http://localhost:11434 \
  --ollama-model=qwen2.5-coder:3b
```

### Option 3: Use Anthropic Claude instead of Ollama

```bash
export ANTHROPIC_API_KEY=sk-ant-...

PYTHONPATH=src pytest src/tests/ -v \
  --llm-provider=anthropic \
  --anthropic-model=claude-haiku-4-5-20251001
```

### Option 4: Docker (everything bundled)

```bash
docker compose -f docker/docker-compose.yml up --build
# Reports land in ./reports/healing_report_<timestamp>.json
```

---

## Writing Your Own Healing Tests

Replace `page` with `SelfHealingPage`. Add a `purpose` string to every interaction:

```python
# After: pip install pytest-self-healer
# No import needed — healing_page fixture is auto-available

async def test_checkout(healing_page):
    await healing_page.goto("https://myapp.com/checkout")

    # Selector is stale — LLM will find the real one
    await healing_page.click(
        selector="button#old-checkout-id",
        purpose="checkout submit button in the cart summary",
    )

    await healing_page.fill(
        selector="input.card-num",
        value="4242424242424242",
        purpose="credit card number input in payment form",
    )
```

**Tips for better healing:**
- Be specific in `purpose`: *"blue submit button in the login modal"* > *"button"*
- Use `data-testid` attributes in your app for stable baseline selectors
- The LLM favors `data-testid` > `aria-label` > `id` > semantic CSS

**Healing-aware actions:** `click`, `dblclick`, `hover`, `fill`, `type`, `press`,
`check`, `uncheck`, `select_option`, `focus`, `tap`, `set_input_files`,
`text_content`, `inner_text`, `input_value`, `get_attribute`, `is_visible`,
`is_enabled`, `wait_for_selector`, and `drag_and_drop` (which heals both the
source and the target). Any other Playwright `Page` method (`goto`, `keyboard`,
`mouse`, `wait_for_load_state`, …) is transparently delegated to the underlying
page — `SelfHealingPage` is a true drop-in.

---

## CLI Options

| Flag | Default | Description |
|------|---------|-------------|
| `--llm-provider` | `ollama` | `ollama` \| `anthropic` \| `auto` |
| `--ollama-url` | `http://localhost:11434` | Ollama server endpoint |
| `--ollama-model` | `qwen2.5-coder:3b` | Model name (also works with `llama3`, `mistral`) |
| `--anthropic-model` | `claude-haiku-4-5-20251001` | Any Claude model ID |
| `--anthropic-api-key` | `None` | Falls back to `ANTHROPIC_API_KEY` env var |
| `--healing-report-dir` | `reports` | Where to write JSON healing reports |
| `--screenshot-dir` | `reports/screenshots` | Where to write BEFORE/AFTER screenshots |
| `--healing-max-attempts` | `2` | How many times the LLM may retry a heal with feedback |
| `--selector-cache-file` | `reports/selector_cache.json` | Persistent selector cache, reused across runs |
| `--no-selector-cache` | `false` | Disable the persistent cache for this run |
| `--headless` | `true` | Run browser headless |

---

## Healing Report

After each run, a JSON report is written to `reports/`:

```json
{
  "total_healings_attempted": 3,
  "successful_healings": 3,
  "failed_healings": 0,
  "attempts": [
    {
      "original_selector": "#user-name",
      "element_purpose": "username input field on login form",
      "suggested_selector": "#username",
      "success": true,
      "timestamp": "2024-01-15T10:23:45.123456",
      "model_response_time_ms": 1840.5,
      "dom_size_chars": 4231,
      "provider": "ollama"
    }
  ]
}
```

---

## Evalset — Benchmarking LLM Accuracy

The evalset is a structured ground-truth dataset (`src/evals/selector_evalset.json`) used to measure how accurately the LLM finds correct selectors. It is independent of the healing tests — no browser required.

### What's in the evalset

12 cases across 6 categories and 3 difficulty levels:

| Category | Cases | Difficulty |
|----------|-------|------------|
| login | 3 | easy |
| checkout | 2 | medium |
| search | 2 | easy |
| navigation | 1 | easy |
| modal | 2 | medium |
| profile | 1 | hard |
| data-table | 1 | hard |

Each case contains a stale selector, a purpose string, a minimal HTML snippet, and a list of acceptable correct selectors.

### Running the evalset

**Standalone runner** (fastest, no pytest overhead):

```bash
# Against local Ollama
PYTHONPATH=src python src/evals/run_eval.py

# Against Anthropic Claude
PYTHONPATH=src python src/evals/run_eval.py \
  --provider anthropic \
  --anthropic-model claude-haiku-4-5-20251001

# Filter to a category or difficulty
PYTHONPATH=src python src/evals/run_eval.py --category login
PYTHONPATH=src python src/evals/run_eval.py --difficulty hard
```

**Via pytest** (integrates with your existing test flags):

```bash
PYTHONPATH=src pytest src/tests/test_evalset.py -v
PYTHONPATH=src pytest src/tests/test_evalset.py -v -k "login"
```

### Comparing two models

Each eval run saves a timestamped report to `reports/evals/`. Use `compare_models.py` to diff two runs:

```bash
# Run against model A
PYTHONPATH=src python src/evals/run_eval.py --ollama-model qwen2.5-coder:3b

# Run against model B
PYTHONPATH=src python src/evals/run_eval.py --ollama-model llama3

# Compare
python src/evals/compare_models.py \
  reports/evals/eval_ollama_20260601_120000.json \
  reports/evals/eval_ollama_20260601_120500.json
```

Output:
```
  Metric                          A          B     Delta
  -------------------------------------------------------
  Accuracy                    75.0%      91.7%    +16.7%
  Avg response (ms)            2340       1820     -520.0
```

### Adding new evalset cases

Open `src/evals/selector_evalset.json` and append to the `cases` array. Each case needs:

```json
{
  "id": "unique-slug",
  "category": "login",
  "difficulty": "easy",
  "stale_selector": "#old-btn",
  "purpose": "login submit button",
  "expected_selectors": ["[data-testid='login-btn']", "button[type='submit']"],
  "html": "<minimal HTML snippet containing the target element>"
}
```

No code changes needed — the runner and pytest integration pick up new cases automatically.

---

## Architecture Decisions

| Decision | Rationale |
|----------|-----------|
| **Local LLM first (Ollama)** | No API keys, no data leakage, works offline in CI |
| **Anthropic as opt-in cloud backend** | Higher accuracy on complex DOMs; useful when RAM is limited |
| **`auto` provider mode** | Uses Claude if `ANTHROPIC_API_KEY` is set, otherwise Ollama — same command works locally and in CI |
| **DOM compression** | Strips scripts/styles, keeps semantic attrs. Fits in small model context (~8KB) |
| **Persistent, URL-scoped cache** | Keyed by `(url, selector)` so the same selector on different pages never collides; written to disk so a selector healed once is reused across runs, not just within one |
| **Retry with feedback** | If a suggestion fails validation, the next prompt names the failed selector and asks for a different one — meaningfully lifts success rate on small models |
| **Confidence scores** | LLM self-reports certainty; useful for alerting on `low` confidence heals |
| **`purpose` string** | Natural language > brittle heuristics. Tells LLM *why* you want the element |
| **Automatic passthrough** | Non-selector `Page` APIs are delegated via `__getattr__`, so the wrapper never lags behind Playwright's API |
| **Evalset separate from tests** | Ground-truth data lives in JSON, not test code — easy to grow and compare across models |

---

## Extending

- **Swap the LLM**: Change `--ollama-model=mistral` or use `--llm-provider=anthropic` for Claude
- **Tune retries**: Raise `--healing-max-attempts` on small/local models, lower it to `1` to fail fast
- **Alert on low confidence**: Check `attempt["confidence"] == "low"` in the report and open a GitHub issue automatically
- **Grow the evalset**: Add cases to `selector_evalset.json` to cover your app's specific UI patterns
- **CI accuracy gate**: Run `run_eval.py` in CI and fail the build if accuracy drops below a threshold
- **Auto-PR on heal**: Use a high-confidence heal as the trigger to open a PR updating the stale selector at its source
