Executive Summary
The paper argues that, with enough compute, “the best data filter is no data filter”: large models trained long enough on a fixed pool of raw Common Crawl (CC) eventually achieve lower validation loss than models trained on filtered versions (RefinedWeb, DCLM-Baseline, and individual heuristic filters), and even tolerate or benefit from injected “junk” data. It predicts the full 240-trillion-token DCLM-Pool would beat RefinedWeb at roughly 1030 FLOPs.
Bottom line: this is a legitimate, internally consistent academic preprint from verified Stanford researchers — we found no evidence of fraud, fabrication, or AI-generated slop. However, the audit identified substantive methodological weaknesses that limit the headline claim: the choice of evaluation metric structurally favors the paper's conclusion; many measured benchmarks went unreported; the marquee 1030-FLOPs prediction rests on a ~4-order-of-magnitude extrapolation through a regime its own cited prior work says is unreliable; reproduction is effectively gated on ~20,000 H200 GPU-hours and a mutable Google Drive folder; and the paper's practical advice (don't filter) ignores the well-documented data-poisoning attack surface of raw web crawls.
Authors, code, data sources, and all spot-checked numbers and citations are real and internally consistent.
No evidence of manipulated results or test-set training.
Metric choice favors the conclusion; selective benchmark reporting; best-of-5-checkpoints reporting; no error bars.
Unconstrained-compute objective, loss↔capability conflation, junk≠adversarial, 100+-epoch extrapolation.
Configs/scripts released (good faith), but compute-gated, single-commit repo, mutable Google Drive data, no license.
All checked references exist and are mostly accurately characterized; two sloppy citations noted.
“No filtering” advice ignores practical poisoning attacks; trust_remote_code: true in eval configs; unversioned data distribution.
1. What the Paper Claims (verified summary)
- Setup: random CC subsets (670M–10B tokens) from DCLM-Pool; Llama-style dense transformers (15M–7B params) trained with Meta Lingua; context 1024; batch 219 tokens; steps swept in powers of 2; weight decay tuned; >20,000 H200 GPU-hours.
- Headline metric: average negative log-likelihood (NLL) on C4-English, FineWeb-Edu, and Cosmopedia. Three downstream benchmarks (ARC-Easy, PIQA, SocialIQA) appear only in Appendix B.
- Result 1: at ≥330M params with enough epochs, unfiltered CC achieves lower avg NLL (best 3.37 at 1B) than all five filtered variants (DCLM-Baseline worst at 5.29 — it retains only 2.1% of the pool).
- Result 2: injecting up to +200% random-string data or +800% shuffled-word documents barely hurts large models; shuffled injections even beat the clean pool at 330M/1B (e.g., 3.36 vs 3.40).
- Result 3: two scaling-law fits predict raw 240T-token CC overtakes RefinedWeb near 1030 FLOPs (fits: 9.0×1029 and 3.6×1030).
- Supporting analyses: a GPT5-mini corpus study claiming CC documents “supporting” MMLU answers outnumber “refuting” ones ~10–60×, and a correct low-rank matrix-factorization proposition showing capacity-limited models suffer task interference while high-rank models absorb orthogonal “junk” tasks for free.
2. Wrong Assumptions SEVERAL FOUND
2.1 “Best achievable loss regardless of computational cost” is an economically vacuous objective
The core objective L*(D) = minM,N ℓ(A(D,M,N)) deliberately ignores cost. By the paper's own Pareto analysis (Figure 2), at every compute budget actually tested (≤ ~1020 FLOPs), filtered data dominates raw CC for most of the frontier; raw CC only wins at the very end. The paper concedes (“Compute” limitation) that below ~1030 FLOPs filtering still matters — i.e., for every training run any organization can afford today (frontier ≈ 5×1026 FLOPs) and for the next several years, the practical recommendation implied by the title is wrong by the paper's own data.
2.2 Validation loss on web-derived corpora is assumed to track capability
The paper asserts NLL “is known to correlate with downstream performance.” But its own cited reference, Saada et al. (arXiv:2510.00866, verified), shows the exact opposite decoupling for quality filtering: “while CQF [classifier-based quality filtering] improves downstream task performance, it does not necessarily enhance language modeling” on high-quality text. Quality filters like DCLM-Baseline were tuned to maximize benchmark accuracy (DCLM's 53-task suite, MMLU CORE), not C4 perplexity. Judging filters by perplexity on lightly-filtered web text is therefore an assumption that bakes in the conclusion (see §4).
2.3 Extrapolating crossings through a regime their own sources call unreliable
Empirical loss crossings used to fit the scaling law occur at up to 121.6 epochs of data repetition. Muennighoff et al. (arXiv:2305.16264, verified) — the paper's own anchor for the 4-epoch fit — found that repeating data beyond ~4 epochs gives rapidly diminishing returns and that with high repetition “the value of adding compute eventually decays to zero.” The authors acknowledge non-monotone validation losses at 100+ epochs and shade those regions as unreliable, yet the headline 1030-FLOPs forecast is built by fitting second-degree polynomials (in log-log space) to a handful of points, then extrapolating pool size by >4 orders of magnitude (10B → 240T tokens). An R² > 0.99 on ~4–5 fitted points is not meaningful evidence of out-of-range validity.
2.4 “Junk data” is assumed to be random, not adversarial
The robustness experiments use statistically random noise (random strings, shuffled words). The security literature shows the dangerous case is adversarial data, which is the primary thing filters remove on purpose (see §7). The paper's only counterargument — a GPT5-mini study of 4 MMLU subjects finding more “supporting” than “refuting” CC documents — is an LLM-as-judge analysis with no human validation, no judge-calibration, unclear denominators (“average judgements” per keyword-matched sample), and covers only naturally occurring misinformation, not injected attacks.
2.5 Smaller technical assumptions
- Contamination dismissal: “Since our experiments use pool sizes of only up to 10B tokens, we do not expect to suffer from test set contamination.” Contamination probability scales with sample size but is not eliminated; more importantly the validation sets themselves (C4 = filtered CC; FineWeb-Edu = filtered CC) overlap the training distribution by construction, which is a train/eval distribution-match advantage for the unfiltered pool independent of contamination.
- MoE-derived ratio applied to dense models: the 600 tokens-per-parameter anchor “following DeepSeek V4” is taken from a Mixture-of-Experts model (1.6T total / 49B active params — verified released 24 Apr 2026) while all experiments are dense; the paper itself warns MoE behavior may differ. The ratio also has no bibliography entry and could not be independently verified (see §6).
- Within-noise “benefits”: claims that models “in fact benefit” from poor data rest on margins like +20% random 3.38 vs pool 3.40, or +400% shuffled 3.36 vs 3.40, from single runs with no error bars, no repeated seeds, and no significance testing. The benchmark appendix shows rank reshuffling across panels consistent with noise.
3. Cheating & Fraud NOT FOUND
- Authors verified real: Christopher Mohri is a Stanford CS PhD student (verified via Google Scholar profile with verified stanford.edu email, John Duchi's group page web.stanford.edu/~jduchi/group.html, and the Stanford NLP people page). John Duchi and Tatsunori Hashimoto are established Stanford professors. The self-citation pattern (Awasthi–Cortes–Mohri AISTATS 2023; Cheng–Asi–Duchi) matches the authors' real prior work.
- Internal consistency checks passed: e.g., Figure 1's 1B-model values (CC 3.37, RW 3.93) match Figure 5's 700M-pool panel; the claimed batch size of 219 = 524,288 tokens matches the released sweep script (64 batch × 1024 seq × 8 GPUs); the claimed vocabulary giving −log(1/V) ≈ 10.8 matches the released r50k_base tokenizer (ln 50257 = 10.825); Table 2 architecture/hyperparameters match the released YAML configs.
- Theory verified: Proposition 7.1's proof (Appendix C) is a standard and correct Eckart–Young–Mirsky argument; Fact C.2's algebra checks out.
- Code matches paper: the junk-data generator in the repo implements exactly the described 10,000-word random vocabulary (3–8 lowercase chars, 2,000 words/doc) and document-level word shuffling.
- No test-on-train, no manipulated figures, no fabricated references were found.
4. Potential for Cheating in Evaluation HIGH CONCERN
While we found no dishonesty, the evaluation design contains three mechanisms that systematically favor the paper's thesis. Reviewers and readers should treat the headline comparison as metric-dependent.
4.1 The headline metric is biased toward the unfiltered pool
“Dataset value” is measured almost entirely by NLL on C4-English, FineWeb-Edu and Cosmopedia averages. C4 and FineWeb-Edu are themselves filtered Common Crawl — distributionally far closer to raw CC than to DCLM-Baseline's heavily classifier-selected text. A model trained on raw CC gets an automatic distribution-match advantage on web-text perplexity that says little about knowledge or reasoning. The independently verified Saada et al. result (§2.2) shows precisely this dissociation, and DCLM itself (arXiv:2406.11794, verified: 240T-token pool, model-based filtering “key” to its results) was validated on 53 downstream tasks, not web perplexity. Conclusion-flipping risk: on MMLU-style evaluations, DCLM-Baseline-trained models are known to dominate at these scales, and the paper never tests that.
4.2 Measured-but-unreported benchmarks
The authors' released eval config (llama_*_dataless.yaml) computes hellaswag, boolq, piqa, social_iqa, winogrande, openbookqa, arc_easy, arc_challenge, race, commonsense_qa, gpqa_main_n_shot, truthfulqa_mc2 on every run — and mmlu is present but commented out. The paper reports only 3 of ~12 measured benchmarks (ARC-Easy, PIQA, SocialIQA), justified as the ones “easy enough … to provide signal at our scale.” That justification is plausible, but the selection was made post hoc by the authors, and the unreported results are not provided even in an appendix. This is the single largest selective-reporting risk in the paper.
4.3 Weak downstream evidence presented as confirmation
The three reported benchmarks sit barely above chance (SocialIQA 0.34–0.39 vs 0.333 random; ARC-Easy 0.26–0.42 vs 0.25; PIQA 0.52–0.64 vs 0.50), are described by the authors themselves as “much noisier,” and show rank reshuffling across model sizes. They cannot meaningfully confirm that “the trends are roughly the same.” Additionally, the paper reports the best of 5 checkpoints per run (a mild optimistic-selection bias) and contains no repeated-seed variance for any headline number.
5. Irreproducibility Issues MODERATE
The release is a genuine good-faith reproducibility effort, but several gaps remain:
| Item | Status | Detail (verified by cloning github.com/chrismohrii/bitter-lesson-data-filtering) |
|---|---|---|
| Code | PARTIAL | Repo exists (6 stars, single commit 21b295b, 16 May 2026). Contains only sweep/config/data-gen glue; training and filtering live in external repos (Meta Lingua, DCLM) with no pinned commit hashes or version tags — future upstream changes can silently change results. No development history to audit. |
| Data | FRAGILE | Pre-processed pools are distributed via a personal Google Drive folder — mutable, unversioned, no checksums, no DOI, can vanish or be altered at any time. DCLM-Pool sampling seed for the original 670M/2B/10B draws is not recorded in the repo. |
| Seeds | MOSTLY OK | Training sweeps fix seed=1 model.seed=1 data.seed=1; but the junk-data generators default to --seed None (unseeded RNG) unless the flag is passed. |
| Configs | OK | Hyperparameters in repo match Table 2; !!!CHANGE_THIS!!! placeholders for cluster paths are normal practice. Note: the “tuned weight decay in [0.1, 0.5]” is in fact a 2-point grid {0.1, 0.5}. |
| Compute | GATED | >20,000 H200 GPU-hours (≈ low-six-figure USD) to replicate; the central 1030-FLOPs claim is, by construction, unfalsifiable for ~years (frontier runs are ~5×1026 FLOPs today). |
| License | MISSING | The repo has no license file, so third parties technically have no right to reuse or modify the code (the paper itself is CC-BY-4.0). |
| LLM-judge study | NOT REPRODUCIBLE | Table 1 depends on GPT5-mini (a closed, version-drifting commercial model); prompts/temperature/document sample lists are not in the repo. |
6. AI Slop & Citation Accuracy NOT FOUND 2 SLOPPY CITATIONS
The paper shows none of the hallmarks of LLM-generated content: prose is coherent and specific, the theory is correct, figures are mutually consistent, and every reference we spot-checked exists (DCLM 2406.11794; Muennighoff 2305.16264; Villalobos 2211.04325; Saada 2510.00866; Allen-Zhu & Li 2404.05405; Ru 2502.06604; Kim 2509.14786; Li 2505.04741; Fang 2503.07879; Goyal 2404.07177; Sardana 2401.00448; Sutton's Bitter Lesson; Meta Lingua repo; Epoch AI report). Two citation problems were found by checking the primary sources ourselves:
- The xAI citation does not contain the claimed number. The paper states frontier pretraining compute is “near 5e26 [xAI, 2025],” citing the Grok 4 model card (data.x.ai/2025-08-20-grok-4-model-card.pdf). We downloaded that PDF: it contains no FLOPs figure at all. The ~5×1026 estimate for Grok 4 originates from Epoch AI's third-party estimates. The number itself is defensible; the attribution is wrong.
- “600:1 following DeepSeek V4” is unreferenced and ambiguous. DeepSeek V4 is real (preview released 24 Apr 2026; 1.6T-parameter MoE with 49B active), but no DeepSeek reference appears in the bibliography, no public document we found states a 600 tokens-per-parameter training ratio, and for an MoE the ratio differs by ~33× depending on whether total or active parameters are counted. One of the paper's two scaling-law anchors therefore rests on an unverifiable figure.
- Minor: Epoch AI's “1e29 FLOP training runs by 2030” citation was verified accurate against epoch.ai/publications/what-will-ai-look-like-in-2030 (David Owen, Sep 2025). The 240T DCLM-Pool figure and ≈300T-token internet-stock attribution (Villalobos et al.) were also verified accurate. The Allen-Zhu & Li citation is fairly made but glosses over a tension: that work shows junk data sharply reduces factual knowledge capacity — a quantity this paper never measures.
7. Security Vulnerabilities SIGNIFICANT BLIND SPOT
7.1 “Train on everything” ignores practical data poisoning (the big one)
The paper's recommendation — that at scale one should train directly on raw Common Crawl — addresses only random junk and never engages with adversarial data, despite two independently verified results that directly undercut the safety of that advice:
- Souly et al., arXiv:2510.07192 (Anthropic / UK AISI / Turing Institute, Oct 2025 — verified): backdoor poisoning needs a near-constant number of documents; ~250 poisoned documents compromised every model tested (600M→13B params) regardless of how much clean data surrounded them. Scale does not dilute poison — the opposite of the paper's “more compute fixes bad data” intuition.
- Carlini et al., arXiv:2302.10149 (verified): poisoning crawled web datasets is practical today — e.g., controlling 0.01% of LAION-400M-scale crawled data cost about $60 via expired domains (split-view poisoning) and snapshot frontrunning.
Filtering pipelines are one of the few existing (if imperfect) defenses against exactly this. Notably, the very Grok 4 model card the paper cites states that xAI applies processes to “ensure data quality and safety prior to training.” A paper advising “no filter” should, at minimum, scope its claim to non-adversarial settings; it does not, and its only factuality audit (GPT5-mini on 4 MMLU subjects) measures naturally occurring misinformation, not attacks. Raw CC also contains PII, CSAM-adjacent material, and toxic content that filters are legally and ethically required to remove — entirely unaddressed.
7.2 Concrete issues in the released artifacts
trust_remote_code: trueis set for theboolqandsocial_iqaeval tasks in every released training config. This causes the HuggingFace datasets library to download and execute arbitrary Python from the dataset repos at eval time — a known supply-chain code-execution vector for anyone replicating the work.- Mutable data distribution: replication data is served from a personal Google Drive folder with no checksums or signatures; a compromised or silently edited folder would be undetectable — ironically, the same “split-view” trust gap Carlini et al. exploit.
- Unpinned external dependencies: training (Lingua) and filtering (DCLM) repos are referenced at floating
main, so replicators execute whatever code those repos contain at clone time. - No credentials, secrets, or malicious code were found in the repository itself.
8. Fact-Check Ledger
| Claim in paper | Verdict | Evidence |
|---|---|---|
| DCLM-Pool is 240T tokens of pre-2023 CC | TRUE | DCLM abstract, arXiv:2406.11794 (“standardized corpus of 240T tokens extracted from Common Crawl”) |
| DCLM-Baseline ≈ 3.8T tokens, “~1%” of CC | ≈TRUE | 3.8T/240T = 1.6%; consistent with DCLM dataset release |
| Muennighoff et al.: ~4 epochs before diminishing returns | TRUE | arXiv:2305.16264 abstract (verified); same source warns repeated-data value → 0, which undercuts the paper's 121-epoch extrapolations |
| Internet text stock 200–500T tokens (Villalobos et al.) | CONSISTENT | arXiv:2211.04325 (Epoch AI; median ≈300T tokens) |
| Epoch AI forecasts 1e29-FLOP runs by 2030 | TRUE | epoch.ai “What will AI look like in 2030?” (D. Owen, Sep 2025): clusters “could support training runs of about 10^29 FLOP” |
| Frontier compute ≈ 5e26 FLOPs “[xAI 2025]” | NUMBER PLAUSIBLE, SOURCE WRONG | Cited Grok-4 model card PDF (downloaded) contains no FLOPs figure; 5e26 is Epoch AI's third-party estimate |
| 600 tokens/param “following DeepSeek V4” | UNVERIFIABLE | DeepSeek V4 exists (24 Apr 2026, MoE 1.6T/49B-active) but no source for 600:1; no bibliography entry; active-vs-total ambiguity |
| Config files released on GitHub | TRUE | github.com/chrismohrii/bitter-lesson-data-filtering (cloned & inspected; matches Table 2) |
| Tokenizer NLL of uniform-random ≈ 10.8 | TRUE | repo ships tiktoken r50k_base; ln(50257) = 10.825 |
| Batch size 219 tokens; wd “tuned in [0.1,0.5]” | TRUE (caveat) | Sweep script: 64×1024×8 = 524,288 ✓; weight decay is a 2-point grid {0.1, 0.5}, not a continuous tune |
| “NLL is known to correlate with downstream performance” | CONTESTED | Saada et al. arXiv:2510.00866 (verified): quality filtering improves benchmarks without improving NLL — direct counterexample for this paper's setting |
| Benchmarks “trends are roughly the same” as loss | OVERSTATED | Appendix B values are near chance; 9 of ~12 measured benchmarks (incl. commented-out MMLU) unreported |
| Junk-robustness implies raw CC is safe to train on | UNSUPPORTED | arXiv:2510.07192 (250-doc backdoors, any scale); arXiv:2302.10149 ($60 CC-style poisoning) — adversarial case untested |
| Authors are Stanford researchers | TRUE | Google Scholar (verified stanford.edu email), Duchi group page, Stanford NLP people page |
9. Overall Assessment
This is a real, competent, and honestly-limited research preprint — not fraud, not AI slop, and not deliberate cheating. Its core empirical observation (within these pools and this metric, big patient models close the gap with filtered data, and the paper's own Pareto frontier still shows filters winning at practical budgets) appears genuine and is backed by released, internally consistent code.
However, the memorable headline — “the best data filter is no data filter” — outruns the evidence in four specific, verifiable ways: (1) it holds only under a compute-unbounded objective the paper's own figures show is irrelevant below ~1030 FLOPs; (2) it is demonstrated on a perplexity metric that structurally favors unfiltered web data, while most of the measured benchmark suite (including MMLU, where quality-filtered data is known to excel) went unreported; (3) the 1030-FLOPs forecast extrapolates 4+ orders of magnitude through an epoching regime its own cited prior work labels unreliable; and (4) “no filtering” is unsafe advice in an adversarial web where ~250 planted documents can backdoor a model of any size. Readers should treat the paper as evidence that aggressive quality filtering has compute-dependent costs on web-text loss, not as evidence that data curation is obsolete.
Audit provenance: Paper read in full (main text + Appendices A–C) from the arXiv v1 PDF. Code audit: full clone of chrismohrii/bitter-lesson-data-filtering @ 21b295b. External sources visited and verified directly (12): arXiv:2605.19407 abstract page · GitHub repo · arXiv:2406.11794 (DCLM) · arXiv:2305.16264 (Muennighoff) · arXiv:2211.04325 (Villalobos) · arXiv:2510.00866 (Saada) · arXiv:2510.07192 (Souly/Anthropic-AISI) · arXiv:2302.10149 (Carlini) · arXiv:2404.05405 (Allen-Zhu & Li) · epoch.ai AI-2030 report · DeepSeek V4 release coverage & api-docs.deepseek.com · author-identity searches (Google Scholar / Stanford pages); plus the downloaded xAI Grok-4 model-card PDF. Every external factual statement in this report was checked against these primary sources on 12 June 2026. Statements about figure values reflect the v1 PDF text layer; minor PDF-extraction imprecision is possible for graphical elements.