Pytifex
Automated differential testing for Python type checkers. Mine real bugs, generate targeted test cases, and establish ground truth through runtime validation.
Get Started ›Pipeline
Pytifex employs a four-stage pipeline to systematically generate test cases that expose type checker bugs.
Phase 0 — Seed Mining
Query closed GitHub issues from python/mypy, facebook/pyrefly, astral-sh/ty, microsoft/pyright, and zubanls/zuban. Extract Python code blocks from confirmed bug reports. Filter for quality: exclude wontfix issues, require ≥50 character code blocks, sample up to 5 examples per repository.
Phase 1 — LLM-Based Generation
Feed 3–5 real bug examples to Gemini with explicit mutation instructions. The LLM generates novel code variations targeting similar typing patterns — not copies of the seeds but semantically related programs that may trigger different checker failures. Each example must reference its seed issue for provenance.
Phase 2 — Differential Testing & Filtering
Run all four type checkers on each generated example. Keep only examples where checkers disagree — at least one reports ok while another reports error. Non-divergent examples are sent back to the LLM with real checker feedback for up to 2 refinement attempts.
Phase 3 — Ground Truth Evaluation
A multi-phase oracle determines which checker is correct. Runtime crashes (Phase 1) provide irrefutable proof. AST-based PEP analysis (Phase 0), Hypothesis property testing (Phase 2), and PEP compliance matching (Phase 3) provide additional evidence. Verdicts are aggregated by confidence priority.
Evaluation Phases
The evaluation oracle uses multiple independent phases, all running on every disagreement. Evidence is aggregated by confidence priority:
Phase 0: AST PEP Oracle
Source-level analysis identifies definitive PEP specification violations independently of any checker. Covers 18 PEPs with rule-based AST checks.
0.85–0.95Phase 1: Runtime Crashes
Execute generated code. TypeError, KeyError, or AttributeError at runtime proves a checker that said "OK" is wrong.
Phase 2: Hypothesis Testing
Property-based testing with auto-generated inputs. Discovers type bugs on code paths that single execution misses.
0.85Phase 3: PEP Compliance
Match checker error messages against regex patterns derived from 24 PEP specifications. Determines spec conformance.
0.80Key insight: Runtime behavior is the ultimate ground truth. If code raises TypeError at runtime, any checker that reported "OK" is definitively incorrect — a false negative.
Getting Started
Installation
Prerequisites: Python 3.12+, uv (recommended)
git clone https://github.com/benedekaibas/pytifex-demo.git
cd pytifex-demo/src/tc_disagreement
export GEMINI_API_KEY=your_key # Required
export GITHUB_TOKEN=ghp_your_token # Optional (higher rate limits)
Type checkers (mypy, pyrefly, zuban, ty) are automatically installed by uv when you run the tool.
Usage
# Run the full pipeline: mine → generate → filter → evaluate
uv run main.py
# Generate until 10 disagreements are found
uv run main.py --num-examples 10
# Use a different model
uv run main.py --model gemini-2.5-pro
# Skip GitHub seed fetching (pattern-based only)
uv run main.py --no-github
# Verbose output
uv run main.py -v
| Command | Description |
|---|---|
uv run main.py | Full pipeline (generate + evaluate) |
uv run main.py generate | Generate disagreements only |
uv run main.py check | Run checkers on existing examples |
uv run main.py eval | Evaluate existing results |
| Option | Default | Description |
|---|---|---|
--num-examples N | 5 | Target disagreement count |
--batch-size N | 15 | Examples per LLM batch |
--max-attempts N | 5 | Max generation attempts |
--max-refinements N | 2 | Refinement attempts per example |
--model MODEL | gemini-2.5-flash | Gemini model to use |
--eval-method | comprehensive | Evaluation method |
--no-github | — | Skip GitHub seed fetching |
-v, --verbose | — | Show all examples |
Type Checkers
Pytifex tests these four type checkers spanning the maturity spectrum:
| Checker | Version | Command |
|---|---|---|
| mypy | 1.19.0 | mypy file.py |
| pyrefly | 0.44.2 | pyrefly check file.py |
| zuban | 0.3.0 | zuban check file.py |
| ty | 0.0.1-alpha.32 | ty check file.py |
Divergence Patterns
The tool targets known areas of type checker disagreement:
| Pattern | Description | PEPs |
|---|---|---|
protocol-defaults | Protocol methods with different default argument values | 544 |
typed-dict-total | TypedDict with mixed total/Required/NotRequired inheritance | 589, 655 |
typeguard-narrowing | TypeGuard/TypeIs with generic type parameters | 647, 742 |
param-spec-decorator | ParamSpec decorators on classmethods/staticmethods | 612 |
self-generic | Self type in generic classes with abstract methods | 673 |
newtype-containers | NewType in containers (covariance/contravariance) | 484 |
overload-literals | Overloaded functions with Literal type discrimination | 484, 586 |
final-override | Final attributes overridden by properties | 591 |
keyword-vs-positional | Protocol callables with keyword-only parameters | 544, 570 |
bounded-typevars | TypeVar bounds with nested generics | 484 |
Architecture
src/tc_disagreement/
├── main.py # CLI entry point
├── pipeline.py # Core generation/filtering pipeline
├── github_issues.py # Mines code from GitHub issues
├── prompts.py # LLM prompt construction
├── patterns.py # Divergence pattern definitions
├── agent.py # Gemini API client
├── run_checkers.py # Type checker execution
├── comprehensive_eval.py # Multi-phase evaluation system
├── oracle.py # AST-based PEP oracle (Phase 0)
├── hypothesis_tier2.py # Hypothesis property testing (Phase 2)
├── source_analysis.py # PEP rule analysis
├── static_tier4.py # Static flow analysis (Phase 4)
├── code_metrics.py # Code complexity metrics
├── config.py # Checker commands and paths
└── generate_json.py # LLM output parsing
Important: All type checker outputs are real — they come from actual tool execution on your system, not LLM simulation. The LLM is used only for generating candidate code and resolving uncertain verdicts.