Benchmark · Phenotypic Screen Prediction · Genentech · 2026

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

Edward De Brouwer, Carl Edwards, Alexander Wu, Jenna Collier, Graham Heimberg, Xiner Li, Meena Subramaniam, Ehsan Hajiramezanali, David Richmond, Jan-Christian Hütter, Sara Mostafavi, Gabriele Scalia

Genentech, South San Francisco, CA, USA  ·  ★ equal contribution

Building the virtual cell requires more than predicting gene expression. Can a model predict the outcome of a CRISPR screen before you run it? AssayBench frames in-silico phenotypic screening as a gene-ranking task on 1,920 public CRISPR screens and gives a single, comparable yardstick (adjusted nDCG) for measuring progress across heterogeneous assays.

TL;DR. We benchmark frontier LLMs, biology-specific LLMs, agents, trainable gene-relevance predictors, and retrieval / frequency baselines. Generalist frontier LLMs lead the board, but all methods remain far below empirically estimated performance ceilings. Fine-tuning and ensembling push the frontier further. Performance partly tracks citation counts, consistent with memorization, and motivates our recent-screens "LaTest" split.

Abstract

Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in-silico phenotypic screening and, more broadly, virtual cell models.

By the numbers

1,920
Public CRISPR screens
Methods evaluated
Model families
5
Phenotype classes
5
Ranking metrics
3
Cohorts (val / test / LaTest)

Headline findings

Explore the results

Every panel below is interactive and powered by the same results cache that produced the paper's figures.

How the task works

Each example is a free-text description of a CRISPR screen (cell line, library, perturbation, condition, phenotype) plus a list of genes in the screen library. The model returns a ranked list of genes most likely to be hits. Predictions are scored against thresholded percentile-relevance scores from the BioGRID ORCS source data, summarized by adjusted nDCG (AnDCG): a chance-corrected ranking metric that is comparable across screens of very different sizes and hit rates.

The benchmark ships three cohorts: a year-split train / val / test built from BioGRID screens published through 2025, and a held-out LaTest split that is refreshed regularly with screens published in the past six months. It serves as an ongoing memorization probe for new frontier models.