Independent Audit: “A Bitter Lesson for Data Filtering”

Christopher Mohri, John Duchi, Tatsunori Hashimoto (Stanford) — arXiv:2605.19407v1, submitted 19 May 2026
Audit scope: wrong assumptions · cheating · irreproducibility · fraud · potential for cheating in evaluation · AI slop · security vulnerabilities.
Method: full read of the paper (main text + appendices), clone & line-by-line inspection of the authors' released code, and independent verification against 12 external sources (arXiv records, GitHub, Epoch AI, xAI model card, search engines). Audit date: 12 June 2026.

Executive Summary

The paper argues that, with enough compute, “the best data filter is no data filter”: large models trained long enough on a fixed pool of raw Common Crawl (CC) eventually achieve lower validation loss than models trained on filtered versions (RefinedWeb, DCLM-Baseline, and individual heuristic filters), and even tolerate or benefit from injected “junk” data. It predicts the full 240-trillion-token DCLM-Pool would beat RefinedWeb at roughly 1030 FLOPs.

Bottom line: this is a legitimate, internally consistent academic preprint from verified Stanford researchers — we found no evidence of fraud, fabrication, or AI-generated slop. However, the audit identified substantive methodological weaknesses that limit the headline claim: the choice of evaluation metric structurally favors the paper's conclusion; many measured benchmarks went unreported; the marquee 1030-FLOPs prediction rests on a ~4-order-of-magnitude extrapolation through a regime its own cited prior work says is unreliable; reproduction is effectively gated on ~20,000 H200 GPU-hours and a mutable Google Drive folder; and the paper's practical advice (don't filter) ignores the well-documented data-poisoning attack surface of raw web crawls.

Fraud / fabricationNOT FOUND

Authors, code, data sources, and all spot-checked numbers and citations are real and internally consistent.

Outright cheatingNOT FOUND

No evidence of manipulated results or test-set training.

Potential for cheating in evaluationHIGH CONCERN

Metric choice favors the conclusion; selective benchmark reporting; best-of-5-checkpoints reporting; no error bars.

Wrong assumptionsSEVERAL FOUND

Unconstrained-compute objective, loss↔capability conflation, junk≠adversarial, 100+-epoch extrapolation.

IrreproducibilityMODERATE

Configs/scripts released (good faith), but compute-gated, single-commit repo, mutable Google Drive data, no license.

AI slopNOT FOUND

All checked references exist and are mostly accurately characterized; two sloppy citations noted.

Security vulnerabilitiesSIGNIFICANT BLIND SPOT

“No filtering” advice ignores practical poisoning attacks; trust_remote_code: true in eval configs; unversioned data distribution.

1. What the Paper Claims (verified summary)

2. Wrong Assumptions SEVERAL FOUND

2.1 “Best achievable loss regardless of computational cost” is an economically vacuous objective

The core objective L*(D) = minM,N ℓ(A(D,M,N)) deliberately ignores cost. By the paper's own Pareto analysis (Figure 2), at every compute budget actually tested (≤ ~1020 FLOPs), filtered data dominates raw CC for most of the frontier; raw CC only wins at the very end. The paper concedes (“Compute” limitation) that below ~1030 FLOPs filtering still matters — i.e., for every training run any organization can afford today (frontier ≈ 5×1026 FLOPs) and for the next several years, the practical recommendation implied by the title is wrong by the paper's own data.

2.2 Validation loss on web-derived corpora is assumed to track capability

The paper asserts NLL “is known to correlate with downstream performance.” But its own cited reference, Saada et al. (arXiv:2510.00866, verified), shows the exact opposite decoupling for quality filtering: “while CQF [classifier-based quality filtering] improves downstream task performance, it does not necessarily enhance language modeling” on high-quality text. Quality filters like DCLM-Baseline were tuned to maximize benchmark accuracy (DCLM's 53-task suite, MMLU CORE), not C4 perplexity. Judging filters by perplexity on lightly-filtered web text is therefore an assumption that bakes in the conclusion (see §4).

2.3 Extrapolating crossings through a regime their own sources call unreliable

Empirical loss crossings used to fit the scaling law occur at up to 121.6 epochs of data repetition. Muennighoff et al. (arXiv:2305.16264, verified) — the paper's own anchor for the 4-epoch fit — found that repeating data beyond ~4 epochs gives rapidly diminishing returns and that with high repetition “the value of adding compute eventually decays to zero.” The authors acknowledge non-monotone validation losses at 100+ epochs and shade those regions as unreliable, yet the headline 1030-FLOPs forecast is built by fitting second-degree polynomials (in log-log space) to a handful of points, then extrapolating pool size by >4 orders of magnitude (10B → 240T tokens). An R² > 0.99 on ~4–5 fitted points is not meaningful evidence of out-of-range validity.

2.4 “Junk data” is assumed to be random, not adversarial

The robustness experiments use statistically random noise (random strings, shuffled words). The security literature shows the dangerous case is adversarial data, which is the primary thing filters remove on purpose (see §7). The paper's only counterargument — a GPT5-mini study of 4 MMLU subjects finding more “supporting” than “refuting” CC documents — is an LLM-as-judge analysis with no human validation, no judge-calibration, unclear denominators (“average judgements” per keyword-matched sample), and covers only naturally occurring misinformation, not injected attacks.

2.5 Smaller technical assumptions

3. Cheating & Fraud NOT FOUND

4. Potential for Cheating in Evaluation HIGH CONCERN

While we found no dishonesty, the evaluation design contains three mechanisms that systematically favor the paper's thesis. Reviewers and readers should treat the headline comparison as metric-dependent.

4.1 The headline metric is biased toward the unfiltered pool

“Dataset value” is measured almost entirely by NLL on C4-English, FineWeb-Edu and Cosmopedia averages. C4 and FineWeb-Edu are themselves filtered Common Crawl — distributionally far closer to raw CC than to DCLM-Baseline's heavily classifier-selected text. A model trained on raw CC gets an automatic distribution-match advantage on web-text perplexity that says little about knowledge or reasoning. The independently verified Saada et al. result (§2.2) shows precisely this dissociation, and DCLM itself (arXiv:2406.11794, verified: 240T-token pool, model-based filtering “key” to its results) was validated on 53 downstream tasks, not web perplexity. Conclusion-flipping risk: on MMLU-style evaluations, DCLM-Baseline-trained models are known to dominate at these scales, and the paper never tests that.

4.2 Measured-but-unreported benchmarks

The authors' released eval config (llama_*_dataless.yaml) computes hellaswag, boolq, piqa, social_iqa, winogrande, openbookqa, arc_easy, arc_challenge, race, commonsense_qa, gpqa_main_n_shot, truthfulqa_mc2 on every run — and mmlu is present but commented out. The paper reports only 3 of ~12 measured benchmarks (ARC-Easy, PIQA, SocialIQA), justified as the ones “easy enough … to provide signal at our scale.” That justification is plausible, but the selection was made post hoc by the authors, and the unreported results are not provided even in an appendix. This is the single largest selective-reporting risk in the paper.

4.3 Weak downstream evidence presented as confirmation

The three reported benchmarks sit barely above chance (SocialIQA 0.34–0.39 vs 0.333 random; ARC-Easy 0.26–0.42 vs 0.25; PIQA 0.52–0.64 vs 0.50), are described by the authors themselves as “much noisier,” and show rank reshuffling across model sizes. They cannot meaningfully confirm that “the trends are roughly the same.” Additionally, the paper reports the best of 5 checkpoints per run (a mild optimistic-selection bias) and contains no repeated-seed variance for any headline number.

5. Irreproducibility Issues MODERATE

The release is a genuine good-faith reproducibility effort, but several gaps remain:

ItemStatusDetail (verified by cloning github.com/chrismohrii/bitter-lesson-data-filtering)
CodePARTIALRepo exists (6 stars, single commit 21b295b, 16 May 2026). Contains only sweep/config/data-gen glue; training and filtering live in external repos (Meta Lingua, DCLM) with no pinned commit hashes or version tags — future upstream changes can silently change results. No development history to audit.
DataFRAGILEPre-processed pools are distributed via a personal Google Drive folder — mutable, unversioned, no checksums, no DOI, can vanish or be altered at any time. DCLM-Pool sampling seed for the original 670M/2B/10B draws is not recorded in the repo.
SeedsMOSTLY OKTraining sweeps fix seed=1 model.seed=1 data.seed=1; but the junk-data generators default to --seed None (unseeded RNG) unless the flag is passed.
ConfigsOKHyperparameters in repo match Table 2; !!!CHANGE_THIS!!! placeholders for cluster paths are normal practice. Note: the “tuned weight decay in [0.1, 0.5]” is in fact a 2-point grid {0.1, 0.5}.
ComputeGATED>20,000 H200 GPU-hours (≈ low-six-figure USD) to replicate; the central 1030-FLOPs claim is, by construction, unfalsifiable for ~years (frontier runs are ~5×1026 FLOPs today).
LicenseMISSINGThe repo has no license file, so third parties technically have no right to reuse or modify the code (the paper itself is CC-BY-4.0).
LLM-judge studyNOT REPRODUCIBLETable 1 depends on GPT5-mini (a closed, version-drifting commercial model); prompts/temperature/document sample lists are not in the repo.

6. AI Slop & Citation Accuracy NOT FOUND 2 SLOPPY CITATIONS

The paper shows none of the hallmarks of LLM-generated content: prose is coherent and specific, the theory is correct, figures are mutually consistent, and every reference we spot-checked exists (DCLM 2406.11794; Muennighoff 2305.16264; Villalobos 2211.04325; Saada 2510.00866; Allen-Zhu & Li 2404.05405; Ru 2502.06604; Kim 2509.14786; Li 2505.04741; Fang 2503.07879; Goyal 2404.07177; Sardana 2401.00448; Sutton's Bitter Lesson; Meta Lingua repo; Epoch AI report). Two citation problems were found by checking the primary sources ourselves:

7. Security Vulnerabilities SIGNIFICANT BLIND SPOT

7.1 “Train on everything” ignores practical data poisoning (the big one)

The paper's recommendation — that at scale one should train directly on raw Common Crawl — addresses only random junk and never engages with adversarial data, despite two independently verified results that directly undercut the safety of that advice:

Filtering pipelines are one of the few existing (if imperfect) defenses against exactly this. Notably, the very Grok 4 model card the paper cites states that xAI applies processes to “ensure data quality and safety prior to training.” A paper advising “no filter” should, at minimum, scope its claim to non-adversarial settings; it does not, and its only factuality audit (GPT5-mini on 4 MMLU subjects) measures naturally occurring misinformation, not attacks. Raw CC also contains PII, CSAM-adjacent material, and toxic content that filters are legally and ethically required to remove — entirely unaddressed.

7.2 Concrete issues in the released artifacts

8. Fact-Check Ledger

Claim in paperVerdictEvidence
DCLM-Pool is 240T tokens of pre-2023 CCTRUEDCLM abstract, arXiv:2406.11794 (“standardized corpus of 240T tokens extracted from Common Crawl”)
DCLM-Baseline ≈ 3.8T tokens, “~1%” of CC≈TRUE3.8T/240T = 1.6%; consistent with DCLM dataset release
Muennighoff et al.: ~4 epochs before diminishing returnsTRUEarXiv:2305.16264 abstract (verified); same source warns repeated-data value → 0, which undercuts the paper's 121-epoch extrapolations
Internet text stock 200–500T tokens (Villalobos et al.)CONSISTENTarXiv:2211.04325 (Epoch AI; median ≈300T tokens)
Epoch AI forecasts 1e29-FLOP runs by 2030TRUEepoch.ai “What will AI look like in 2030?” (D. Owen, Sep 2025): clusters “could support training runs of about 10^29 FLOP”
Frontier compute ≈ 5e26 FLOPs “[xAI 2025]”NUMBER PLAUSIBLE, SOURCE WRONGCited Grok-4 model card PDF (downloaded) contains no FLOPs figure; 5e26 is Epoch AI's third-party estimate
600 tokens/param “following DeepSeek V4”UNVERIFIABLEDeepSeek V4 exists (24 Apr 2026, MoE 1.6T/49B-active) but no source for 600:1; no bibliography entry; active-vs-total ambiguity
Config files released on GitHubTRUEgithub.com/chrismohrii/bitter-lesson-data-filtering (cloned & inspected; matches Table 2)
Tokenizer NLL of uniform-random ≈ 10.8TRUErepo ships tiktoken r50k_base; ln(50257) = 10.825
Batch size 219 tokens; wd “tuned in [0.1,0.5]”TRUE (caveat)Sweep script: 64×1024×8 = 524,288 ✓; weight decay is a 2-point grid {0.1, 0.5}, not a continuous tune
“NLL is known to correlate with downstream performance”CONTESTEDSaada et al. arXiv:2510.00866 (verified): quality filtering improves benchmarks without improving NLL — direct counterexample for this paper's setting
Benchmarks “trends are roughly the same” as lossOVERSTATEDAppendix B values are near chance; 9 of ~12 measured benchmarks (incl. commented-out MMLU) unreported
Junk-robustness implies raw CC is safe to train onUNSUPPORTEDarXiv:2510.07192 (250-doc backdoors, any scale); arXiv:2302.10149 ($60 CC-style poisoning) — adversarial case untested
Authors are Stanford researchersTRUEGoogle Scholar (verified stanford.edu email), Duchi group page, Stanford NLP people page

9. Overall Assessment

This is a real, competent, and honestly-limited research preprint — not fraud, not AI slop, and not deliberate cheating. Its core empirical observation (within these pools and this metric, big patient models close the gap with filtered data, and the paper's own Pareto frontier still shows filters winning at practical budgets) appears genuine and is backed by released, internally consistent code.

However, the memorable headline — “the best data filter is no data filter” — outruns the evidence in four specific, verifiable ways: (1) it holds only under a compute-unbounded objective the paper's own figures show is irrelevant below ~1030 FLOPs; (2) it is demonstrated on a perplexity metric that structurally favors unfiltered web data, while most of the measured benchmark suite (including MMLU, where quality-filtered data is known to excel) went unreported; (3) the 1030-FLOPs forecast extrapolates 4+ orders of magnitude through an epoching regime its own cited prior work labels unreliable; and (4) “no filtering” is unsafe advice in an adversarial web where ~250 planted documents can backdoor a model of any size. Readers should treat the paper as evidence that aggressive quality filtering has compute-dependent costs on web-text loss, not as evidence that data curation is obsolete.

Audit provenance: Paper read in full (main text + Appendices A–C) from the arXiv v1 PDF. Code audit: full clone of chrismohrii/bitter-lesson-data-filtering @ 21b295b. External sources visited and verified directly (12): arXiv:2605.19407 abstract page · GitHub repo · arXiv:2406.11794 (DCLM) · arXiv:2305.16264 (Muennighoff) · arXiv:2211.04325 (Villalobos) · arXiv:2510.00866 (Saada) · arXiv:2510.07192 (Souly/Anthropic-AISI) · arXiv:2302.10149 (Carlini) · arXiv:2404.05405 (Allen-Zhu & Li) · epoch.ai AI-2030 report · DeepSeek V4 release coverage & api-docs.deepseek.com · author-identity searches (Google Scholar / Stanford pages); plus the downloaded xAI Grok-4 model-card PDF. Every external factual statement in this report was checked against these primary sources on 12 June 2026. Statements about figure values reflect the v1 PDF text layer; minor PDF-extraction imprecision is possible for graphical elements.