Pytifex

Automated differential testing for Python type checkers. Mine real bugs, generate targeted test cases, and establish ground truth through runtime validation.

Get Started ›
Research tool. Pytifex was developed for a senior comprehensive project. It implements a bug-seeded mutation methodology for proactively finding type checker bugs before users encounter them.

Pipeline

Pytifex employs a four-stage pipeline to systematically generate test cases that expose type checker bugs.

Phase 0 — Seed Mining

Query closed GitHub issues from python/mypy, facebook/pyrefly, astral-sh/ty, microsoft/pyright, and zubanls/zuban. Extract Python code blocks from confirmed bug reports. Filter for quality: exclude wontfix issues, require ≥50 character code blocks, sample up to 5 examples per repository.

Phase 1 — LLM-Based Generation

Feed 3–5 real bug examples to Gemini with explicit mutation instructions. The LLM generates novel code variations targeting similar typing patterns — not copies of the seeds but semantically related programs that may trigger different checker failures. Each example must reference its seed issue for provenance.

Phase 2 — Differential Testing & Filtering

Run all four type checkers on each generated example. Keep only examples where checkers disagree — at least one reports ok while another reports error. Non-divergent examples are sent back to the LLM with real checker feedback for up to 2 refinement attempts.

Phase 3 — Ground Truth Evaluation

A multi-phase oracle determines which checker is correct. Runtime crashes (Phase 1) provide irrefutable proof. AST-based PEP analysis (Phase 0), Hypothesis property testing (Phase 2), and PEP compliance matching (Phase 3) provide additional evidence. Verdicts are aggregated by confidence priority.

Evaluation Phases

The evaluation oracle uses multiple independent phases, all running on every disagreement. Evidence is aggregated by confidence priority:

Phase 0: AST PEP Oracle

Source-level analysis identifies definitive PEP specification violations independently of any checker. Covers 18 PEPs with rule-based AST checks.

0.85–0.95

Phase 1: Runtime Crashes

Execute generated code. TypeError, KeyError, or AttributeError at runtime proves a checker that said "OK" is wrong.

0.95–1.0

Phase 2: Hypothesis Testing

Property-based testing with auto-generated inputs. Discovers type bugs on code paths that single execution misses.

0.85

Phase 3: PEP Compliance

Match checker error messages against regex patterns derived from 24 PEP specifications. Determines spec conformance.

0.80

Key insight: Runtime behavior is the ultimate ground truth. If code raises TypeError at runtime, any checker that reported "OK" is definitively incorrect — a false negative.

Getting Started

Installation

Prerequisites: Python 3.12+, uv (recommended)

git clone https://github.com/benedekaibas/pytifex-demo.git
cd pytifex-demo/src/tc_disagreement

export GEMINI_API_KEY=your_key        # Required
export GITHUB_TOKEN=ghp_your_token    # Optional (higher rate limits)

Type checkers (mypy, pyrefly, zuban, ty) are automatically installed by uv when you run the tool.

Usage

# Run the full pipeline: mine → generate → filter → evaluate
uv run main.py

# Generate until 10 disagreements are found
uv run main.py --num-examples 10

# Use a different model
uv run main.py --model gemini-2.5-pro

# Skip GitHub seed fetching (pattern-based only)
uv run main.py --no-github

# Verbose output
uv run main.py -v
CommandDescription
uv run main.pyFull pipeline (generate + evaluate)
uv run main.py generateGenerate disagreements only
uv run main.py checkRun checkers on existing examples
uv run main.py evalEvaluate existing results
OptionDefaultDescription
--num-examples N5Target disagreement count
--batch-size N15Examples per LLM batch
--max-attempts N5Max generation attempts
--max-refinements N2Refinement attempts per example
--model MODELgemini-2.5-flashGemini model to use
--eval-methodcomprehensiveEvaluation method
--no-githubSkip GitHub seed fetching
-v, --verboseShow all examples

Type Checkers

Pytifex tests these four type checkers spanning the maturity spectrum:

CheckerVersionCommand
mypy1.19.0mypy file.py
pyrefly0.44.2pyrefly check file.py
zuban0.3.0zuban check file.py
ty0.0.1-alpha.32ty check file.py

Divergence Patterns

The tool targets known areas of type checker disagreement:

PatternDescriptionPEPs
protocol-defaultsProtocol methods with different default argument values544
typed-dict-totalTypedDict with mixed total/Required/NotRequired inheritance589, 655
typeguard-narrowingTypeGuard/TypeIs with generic type parameters647, 742
param-spec-decoratorParamSpec decorators on classmethods/staticmethods612
self-genericSelf type in generic classes with abstract methods673
newtype-containersNewType in containers (covariance/contravariance)484
overload-literalsOverloaded functions with Literal type discrimination484, 586
final-overrideFinal attributes overridden by properties591
keyword-vs-positionalProtocol callables with keyword-only parameters544, 570
bounded-typevarsTypeVar bounds with nested generics484

Architecture

src/tc_disagreement/
├── main.py              # CLI entry point
├── pipeline.py          # Core generation/filtering pipeline
├── github_issues.py     # Mines code from GitHub issues
├── prompts.py           # LLM prompt construction
├── patterns.py          # Divergence pattern definitions
├── agent.py             # Gemini API client
├── run_checkers.py      # Type checker execution
├── comprehensive_eval.py # Multi-phase evaluation system
├── oracle.py            # AST-based PEP oracle (Phase 0)
├── hypothesis_tier2.py  # Hypothesis property testing (Phase 2)
├── source_analysis.py   # PEP rule analysis
├── static_tier4.py      # Static flow analysis (Phase 4)
├── code_metrics.py      # Code complexity metrics
├── config.py            # Checker commands and paths
└── generate_json.py     # LLM output parsing

Important: All type checker outputs are real — they come from actual tool execution on your system, not LLM simulation. The LLM is used only for generating candidate code and resolving uncertain verdicts.