Last updated: December 2025
Large language models have transformed automated code generation from a research curiosity into a practical tool used by millions of developers daily. This review covers key papers, benchmarks, and emerging patterns in the field.
Performance of major models on standard code generation benchmarks:
| Model | Parameters | HumanEval pass@1 | SWE-bench Resolved | License |
|---|---|---|---|---|
| Codex (2021) | 12B | 28.8% | N/A | Proprietary |
| StarCoder (2023) | 15.5B | 33.6% | N/A | Open (BigCode) |
| CodeLlama (2023) | 34B | 53.7% | N/A | Open (Meta) |
| GPT-4 (2023) | Unknown | 67.0% | 1.7% | Proprietary |
| Claude 3.5 Sonnet (2024) | Unknown | 92.0% | 49.0% | Proprietary |
| DeepSeek-V3 (2025) | 671B MoE | 82.6% | 42.0% | Open |
| Claude Opus 4 (2025) | Unknown | 95.2% | 72.5% | Proprietary |
Authors: Chen et al. (OpenAI, 2021)
The Codex paper introduced the HumanEval benchmark
(164 hand-written Python problems). Key finding: sampling multiple solutions
and selecting the best dramatically improves results (pass@100 reached 70.2%
vs 28.8% for pass@1).
Authors: Li et al. (DeepMind, 2022)
Generated millions of candidates for competitive programming problems,
then filtered using test cases. Reached top 54% of Codeforces competitors.
Key insight: brute-force generation + filtering outperforms careful single-shot
generation for algorithmic problems.
Authors: Jimenez et al. (Princeton, 2024)
Benchmark of 2,294 real GitHub issues from 12 Python repos. Models must
produce patches that pass the repo's test suite. See
swebench.com for the leaderboard.
| Benchmark | What It Measures | Realism | Language |
|---|---|---|---|
| HumanEval | Isolated function generation | Low | Python |
| MBPP | Simple programming problems | Low | Python |
| SWE-bench | Real GitHub issue resolution | High | Python |
| LiveCodeBench | Fresh competitive programming | Medium | Multi |
| BigCodeBench | Complex function composition | Medium | Python |
| Aider polyglot | Multi-language editing | High | Multi |
The shift from "model generates code" to "model uses tools to iteratively develop code" has produced the largest practical gains:
Tools like Claude Code, Cursor, and Aider implement variations of this loop.