Large Language Models Do Not Always Need Readable Language
Jiayi Zhu1 Haoxuan Peng2 Junxi Wang3 Liang Ke4 Chen Zhang5 Linfeng Zhang1†
1
Shanghai Jiao Tong University; 2The University of Sydney; 3Hefei University of Technology
4Xi’an Jiaotong University; 5Nanjing University
zhanglinfeng@sjtu.edu.cn
Abstract
Large language models (LLMs) are commonly
prompted and interfaced with human-readable
natural language, even when the intended
reader is another model. This paper investigates whether semantic information can be encoded in compact, non-standard textual forms
that sacrifice human readability while remaining recoverable by LLMs. We refer to this
class of model-centric textual representations
as BabelTele, approached here not as a fixed
protocol but as an empirical probe into LLMs’
capacity to generate and interpret such representations. Through readability diagnostics,
model likelihood measures, human questionnaires, and downstream task evaluations, we
find that BabelTele can substantially depart
from ordinary natural language while preserving core semantics for instruction-tuned LLMs.
As a task-agnostic representational paradigm,
BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even
when the text volume is condensed to 27.9%
of its original length. We further evaluate its
semantic robustness in cross-model transfer,
agent memory, and multi-agent communication. Results suggest that BabelTele can reduce
context overhead while generally maintaining
reliable downstream performance, although its
effectiveness depends on the compressor-reader
pair and task setting. These findings indicate
that human readability, natural-language typicality, and model-side semantic recoverability can be partially decoupled, opening a path
toward model-native representations in future
exploration of LLM systems.
1 Introduction
Large language models (LLMs) have become a
dominant interface in contemporary intelligent
systems. Since GPT-3 demonstrated strong fewshot generalization through text-based prompting
(Brown et al., 2020), the field has largely followed
† Corresponding author.
I Have Seen the Future of Europe.
The Eurocrats were thinking ahead
when they made Brussels the \"Capital of Europe,\" headquarters of
the emerging European Union.
Though practically unknown in the
United States, the union is one of
Europe's biggest stories, an important organization trying to establish
itself as a sort of metagovernment
for European states. ...
Natural Lan. BabelTele
Gov/deficits>US. Historic bldgs
(Author's 14th-C⛪). Prices📈
except cheap🍷/🌸. Huge(100
0s US). Ubiquitous🥖/🍫.\n *
*Para 5:* Multilingualism divides. South(Wallonia)=FR. North
(Flanders)=NL. End:CopSigh ...
What‛s this?
LLM
Context Window
The answer is ...
LLM
Context Window
The answer is ...
Q&A Q&A
Figure 1: As illustrated, BabelTele representation differs substantially from verbose natural language: the
text is significantly more compact, indicating a much
higher information density. While the compressed representation is much less human-readable, it remains
well-interpreted by LLMs, which can understand the
original meaning without any distortion.
a unified paradigm: knowledge is represented in
natural language, instructions are issued in natural language, and model outputs are returned in
natural language. Subsequent alignment methods
and dialogue-oriented models further reinforced
this design, optimizing model behavior toward controllability, and readability (Ouyang et al., 2022;
Touvron et al., 2023).
However, natural language optimized for human
communication is not necessarily an efficient representation for model processing. Human language
contains substantial redundancy: complete syntax,
discourse markers, and narrative coherence all help
people follow, remember, and disambiguate information. These properties are valuable for human
readers, but they reduce semantic density. From
an information-theoretic view, communication systems aim to transmit information efficiently under
channel constraints (Shannon, 1948). More recent
work also connects language modeling with compression, suggesting that strong language models
can capture compact statistical structure in data
1
arXiv:2606.19857v1 [cs.CL] 18 Jun 2026
(Deletang et al., 2024). This raises a central question: if the receiver is an LLM rather than a human,
must semantic information still be encoded in fully
human-readable natural language?
This question becomes especially relevant in
long-context and agentic systems, where context
overhead is a persistent bottleneck. LLMs are
increasingly deployed to process lengthy documents, maintain memory, and exchange intermediate states in multi-agent workflows, yet models do
not always use long contexts robustly (Liu et al.,
2024), and agent systems further intensify this pressure through natural-language memory (Park et al.,
2023a), context management (Packer et al., 2023),
and inter-agent message passing (Wu et al., 2024).
The idea of sacrificing readability for efficiency,
however, is not new: telegraphic language, mathematical notation, and code all demonstrate that
fluent prose is only one possible surface form for
communication. This suggests a broader possibility
that intermediate representations in LLM systems
may be optimized not for human readability, but
for model decodability and semantic density.
Existing context-compression methods provide
important precedents. Prior work has explored
reducing prompt length by removing redundant
spans, rewriting retrieved content, or selecting
informative tokens, demonstrating that naturallanguage inputs contain substantial compressible
redundancy (Li et al., 2023; Jiang et al., 2023a; Xu
et al., 2024). Nevertheless, most such methods still
operate within the conventions of human-readable
natural language. A separate line of work explores
activation-based or learned internal representations,
but these approaches often require additional training, special tokens, or access to hidden states, limiting their applicability in black-box API settings
and heterogeneous model ecosystems.
Motivated by this gap, we investigate BabelTele,
a class of model-centric rather than human-centric
textual representations. BabelTele explicitly relaxes human readability as a default constraint, instead encouraging models to encode semantics into
compact, non-standard textual forms, potentially
combining abbreviations, symbols, cross-lingual
fragments, and non-standard syntactic structures.
While these forms exhibit low human readability,
advanced LLMs can still recover their core semantics for downstream tasks. We thus present BabelTele not as a competitive compression method,
but as an empirical probe into model-native textual communication, evaluating it across multiple
dimensions including compression ratio, semantic
fidelity, human readability, cross-model transferability, and downstream utility in document QA,
agent memory compression, and multi-agent communication. Our key takeaways are as follows:
• BabelTele emerges as a prompt-accessible
phenomenon: by removing human readability
as a default constraint, LLMs spontaneously
produce opaque but semantically dense textual
representations under a black-box interface.
• BabelTele exhibits robust cross-model transferability across diverse proprietary and openweight model families in a zero-shot manner.
Representations compressed by one model
can be reliably interpreted by another without
any fine-tuning or model-specific adaptation,
suggesting that BabelTele captures a form of
semantic encoding that generalizes across heterogeneous architectures.
• BabelTele retains semantics in document QA,
agent memory, and multi-agent communication, demonstrating that current LLMs can
process highly dense text without relying on
human readability.
2 Related Works
2.1 Prompt and Context Compression
Prompt compression has been widely studied as
a way to reduce inference cost and improve the
effective use of limited context windows (Li et al.,
2023; Jiang et al., 2023a, 2024; Pan et al., 2024;
Hou et al., 2024). Hard compression methods usually keep the compressed prompt as discrete text.
Selective Context (Li et al., 2023) filters tokens or
sentences according to self-information. LLMLingua (Jiang et al., 2023a) performs coarse-to-fine
token-level prompt compression. These approaches
mainly reduce prompts by deleting or selecting lexical units from the original text.
Other work studies abstractive or naturallanguage prompt compression (Xu et al., 2024;
Chuang et al., 2024; Zhang et al., 2024; Jeong et al.,
2025; Guo et al., 2025). RECOMP (Xu et al., 2024)
compresses retrieved documents into summaries,
while Nano-Capsulator (Chuang et al., 2024) learns
shorter natural-language capsule prompts. Unlike
these methods, BabelTele does not aim to preserve natural-language readability. It instead asks
whether LLMs can generate and consume compact
2
semantic strings that are opaque to humans but still
interpretable by LLMs.
We also consider related work on learned and
latent context compression, retrieval and memory
systems, symbolic prompting and LLM-native communication. Due to space limitations, they are provided in Appendix A.
3 Methodology: Eliciting LLM-Native
Representations
3.1 Relaxing the Readability Prior
Given an input document x, conventional text compression employs a model C to produce a shorter
sequence z that preserves task-relevant semantics
for a reader model R. Most existing methods implicitly encourage z to remain close to the natural
language distribution, keeping the compressed text
reasonably readable to humans.
BabelTele formulates compression as a
readability-relaxed semantic projection. When
R is an LLM rather than a human reader, we
hypothesize that the human-readability prior can
be relaxed. Instead of optimizing for fluency or
natural-language typicality, BabelTele prioritizes
information density and model-side semantic
recoverability. While z remains discrete text, its
surface form can be deviated from conventional
prose. Furthermore, as different LLMs are trained
on overlapping linguistic, symbolic and factual
structures, such representations may exhibit partial
cross-model decodability across model families.
3.2 Principles of Symbolic Collapse
Rather than treating BabelTele as a singular, manually engineered “prompt trick,” we define it as
a family of high-density representations induced
by relaxing linguistic constraints. To materialize
this phenomenon in a black-box setting, we design
instructional probes that encourage the compressor
C to produce model-readable encodings based on
the following principles:
Omnilingual Lexical Selection: Relaxing
single-language constraints and selecting highdensity lexical units across languages and scripts.
Symbolic Collapse: Replacing verbose linguistic structures with compact symbols, emojis, mathematical/logical operators, and punctuation.
Recoverable Semantic Density: Preserving recoverable semantic details so that capable LLMs
can interpret the compressed output without an external codebook.
By applying these constraints through zero-shot
prompting, we encourage LLMs to externalize semantic information into a compact surface form
more effectively. This approach requires no gradient updates or tokenizer modifications, allowing
us to investigate the existence and utility of LLMnative textual representations across heterogeneous
models. To illustrate this collapse, consider the
following micro-example:
As shown above, BabelTele relaxes ordinary human
syntax and uses multilingual cues, emojis, and relational arrows to form a compact, model-readable
semantic graph more densely. Appendix D provides a qualitative example with a source excerpt
and the complete BabelTele output.
4 Experiments
4.1 Experimental Setup
We evaluate BabelTele across multiple longcontext benchmarks under a task-agnostic protocol,
where the compressor processes source documents
without access to downstream questions. We compare against standard baselines and report token retention, QA accuracy, and reader chain-of-thought
token overhead across diverse evaluator model families. Full details are in Appendix B.
4.2 Symbolic Collapse: Separating Human
Readability from Model Decodability
In our early observations, we found that although
BabelTele is usually difficult for humans to read
directly, LLMs can still use it to answer questions,
recover details, and perform a certain degree of
reasoning. Therefore, this section mainly examines whether human readability, natural-language
distribution typicality, and model semantic recoverability can be decoupled, rather than focusing on
the compression efficiency of BabelTele itself.
We select 10 long-text samples from the QuALITY (Pang et al., 2022) dataset, each containing
3 multiple-choice QA items, and construct three
input formats for each sample: the original text, a
natural-language summary, and a BabelTele representation generated using the default compression
3
prompt in Appendix C.1.
Model Orig. Summ. BabelTele
Llama-3-8B 9.63 11.32 (+17.5)% 176.60 (+1,733.9)%
Qwen2-7B 10.93 12.44 (+13.8)% 236.23 (+2,061.1)%
Qwen2.5-7B 10.43 11.56 (+10.8)% 209.30 (+1,906.7)%
Qwen3-8B 14.94 15.22 (+1.9)% 301.22 (+1,916.2)%
Qwen3-32B 11.80 11.07 (-6.2)% 220.76 (+1,770.8)%
Kimi K2P6 6.93 7.76 (+11.9)% 127.96 (+1,746.5)%
DeepSeek V4 Pro 7.67 7.75 (+1.1)% 108.95 (+1,320.5)%
GLM 5.1 7.89 8.40 (+6.5)% 148.23 (+1,778.7)%
Table 1: PPL diagnostics across representative base
models. Parentheses indicate the relative change from
the original-text PPL within the same model. Green
denotes an increase. Red denotes a decrease.
Readability diagnostics in Appendix E.1 show
that BabelTele has much lower surface readability
than both original passages and natural-language
summaries, with a Dale-Chall score of 16.70 and
a difficult-word ratio of 80.19%. Table 1 further shows that BabelTele is highly unlikely under
multiple base language models, yielding order-ofmagnitude higher PPL than both original and summary texts. Together, these results indicate that BabelTele is not merely a natural-language summary
or shorthand, but a surface representation that substantially departs from conventional English prose
and the ordinary natural-language distribution.
Human
0
10
20
30
40
50
60
Accuracy (%)
56.10%
35.80%
Gemini-3.1-pro
0
20
40
60
80
100
Accuracy (%)
90.00%
96.70%
 Original BabelTele
Figure 2: QA accuracy for human readers and Gemini 3.1 Pro on original and BabelTele inputs. The
y-axis starts at 25%, the random-choice baseline for
four-option QA. Brackets indicate absolute changes in
percentage points.
However, Figure 2 shows that low readability
and low natural-language likelihood do not imply
semantic unrecoverability. The human accuracy
results were collected through paid questionnaires
distributed to university students. Human readers show a QA accuracy drop on BabelTele inputs, whereas Gemini 3.1 Pro (Google DeepMind,
2026) maintains high accuracy. This suggests that
BabelTele is not meaningless gibberish; rather,
while sacrificing human readability and naturallanguage typicality, it still preserves semantic structures that can be decoded and used by instructiontuned LLMs. More detailed analyses of readability,
evaluation metrics, and questionnaire settings are
provided in Appendix E.1.
4.3 Efficiency and Cognitive Overhead in
Model-Native Compression
We focus on two questions. First, under different
context retention ratios, how much downstream
QA accuracy can BabelTele preserve compared
with natural-language summaries and LLMLingua2 (Pan et al., 2024)? Second, does compression
increase the response chain-of-thought tokens used
by the reader model? Together, this section evaluates both input-side token savings and answer-side
cognitive overhead.
Experimental Setup. We evaluate BabelTele on
2,128 QuALITY questions and 2,586 MeetingBank (Hu et al., 2023) multiple-choice questions,
conducting 116 experimental runs across the two
datasets. We contrast simply prompt-elicited BabelTele representations with standard abstractive
summaries and a carefully engineered extractive
baseline (LLMLingua-2) under accuracy-retention
curves rather than a single compression point, since
compression ratios vary across generative methods.
For BabelTele and summary baselines, we use the
same-model setting, For BabelTele and summary
baselines, we use a same-model setting, where each
model reads its own compressed output.
BabelTele is evaluated with multiple BabelTelelike prompt variants, so the experiment tests a family of model-readable high-density representations
rather than a prompt artifact; additional sweep and
prompt details are provided in Appendix E.2.
The Accuracy-Retention Frontier . Figure 3
summarizes the relation between accuracy and
context retention on QuALITY and MeetingBank. Overall, BabelTele forms a more favorable
accuracy-retention frontier across multiple benchmarks and reader models. We summarize the following three points:
(i) BabelTele maintains higher accuracy under strong compression. As compression intensifies, summary and LLMLingua-2 show sharper accuracy degradation, whereas BabelTele maintains
relatively high downstream QA accuracy. This
suggests that BabelTele can better preserve taskrelevant semantics while reducing input tokens.
4
BabelTele Summary LLMLingua-2 No-compression baseline
Circle area: avg. thought-chain tokens
172 2510 4848
Relative accuracy
MeetingBank · Gemini
0.65
0.75
0.85
0.95
1.00
30 40 50 60 70 80 90 100
Token reduction (%)
MeetingBank · Qwen
30 40 50 60 70 80 90 100
Token reduction (%)
Quality · Gemini
80 85 90 95 100
0.65
0.75
0.85
0.95
1.00
Token reduction (%)
Quality · Qwen
55 60 65 70 75 80 85 90 95 100
Token reduction (%)
Figure 3: Accuracy-retention comparison with response chain-of-thought token scale. Each panel corresponds
to one benchmark-reader setting. Token reduction is computed as one minus the realized context retention ratio,
relative accuracy is normalized by the corresponding no-compression baseline, and circle area denotes the average
number of response thought-chain tokens. Dashed colored curves indicate method-level fitted trends.
(ii) BabelTele’s robustness is more evident on
MeetingBank. For both Gemini 3.1 Pro and Qwen3.5-Plus readers, BabelTele preserves near-original
performance even with substantial token reduction.
On QuALITY, the advantage is more moderate, but
BabelTele still maintains stable relative accuracy
across the evaluated compression range. This indicates that the benefit of BabelTele is not limited to
a single task format, but applies to both meetingstyle records and long-document QA.
(iii) Multiple prompt variants support the
family-level interpretation of BabelTele. BabelTele does not rely on a single carefully optimized prompt. Instead, BabelTele-like prompts
with different surface biases collectively trace a
strong accuracy-retention frontier. This suggests
that BabelTele is better understood as a family of
model-readable high-density compressed representations rather than a single prompt trick.
Does Compression Make Models Think More?
A natural concern is whether compression leads to
longer response thought chains. In particular, one
may ask whether the reader model needs to first decode BabelTele back into natural language before
answering. Figure 4 examines the relation between
BabelTele Summary LLMLingua-2 No-compression baseline
Thought-token multiplier
0% 10% 20% 30% 40% 50% 60% 70%
0x
1x
2x
3x
4x
5x
Context retention ratio
Figure 4: Response chain-of-thought token multiplier
versus realized context retention ratio. Each marker
corresponds to one compression run. Colors indicate
compression methods, the y-axis is normalized by the
corresponding no-compression baseline, and the horizontal dashed line marks 1× chain-of-thought token usage. Solid colored curves show method-level smoothed
trends.
context retention and response chain-of-thought token usage across all experimental settings, from
which we summarize the following three points:
(i) Stronger compression often increases
chain-of-thought tokens. As context retention
decreases, chain-of-thought tokens generally rise
5
across methods.
(ii) BabelTele does not introduce unique overhead. Its chain-of-thought token multiplier is often comparable to or lower than summary and
LLMLingua-2, and can even fall below the originalcontext baseline in mild-compression settings.
(iii) Thought-token growth reflects evidence
accessibility. A cautious interpretation is that
longer thought chains arise when compression reduces the completeness or accessibility of retained
evidence. If relevant details are removed or harder
to use, the reader model may need more reasoning
steps to reconstruct the answer or resolve uncertainty. Thus, BabelTele may introduce decoding
cost, but because it often preserves task-relevant
structure, its answer-side overhead is not necessarily higher than that of summary or LLMLingua-2.
Summary. Taken together, these results show
that the model-side recoverability observed in Section 4.2 can translate into long-context compression
gains. BabelTele exhibits a favorable accuracyretention frontier across the QA settings, while the
response chain-of-thought token analysis reveals
an important efficiency boundary: extreme context compression triggers a space-time trade-off
where input token savings are partially offset by
CoT generation overhead. Importantly, this dynamic is intrinsic to LLM reasoning over sparse evidence rather than unique to BabelTele, suggesting
that choosing an optimal, moderate compression
ratio can yield simultaneous savings in both input
context and reasoning steps.
4.4 A Universal Cipher? Zero-Shot
Cross-Model Comprehension
To examine whether BabelTele represents a modelprivate shorthand or a transferable symbolic form,
we evaluate its cross-model portability across stateof-the-art LLMs. We conduct compression-reading
experiments on 180 samples from LongBench v2-
Short and 214 QA instances from QuALITY, and
analyze whether the effectiveness of BabelTele depends on specific compressor-reader pairs. For
a more detailed discussion of experimental setup,
please refer to Appendix E.3.
(i) Compression Ratios of Different Models. Figure 5 shows that BabelTele compression
strength varies substantially across compressors.
Gemini 3.1 Pro is the most aggressive, exceeding
95% compression, whereas GPT-5.4 is more conservative at roughly 75%; the other models fall
0
5
10
15
20
25
30
Retention ratio (%)
4.01%
27.26%
12.24%
10.00%
19.15%
LongBench v2
0
10
20
30
40
50
Retention ratio (%)
18.07%
41.02%
11.69%
6.73%
14.61%
27.90%
QuALITY
 Gemini GPT-5.4 Qwen Kimi
 DeepSeek Doubao Claude
Figure 5: Comparison of compression rates of different models. We selected 180 samples from the Short
subset of the LongBench v2 benchmark and processed
them using BabelTele.
LongBench Cross-Model Accuracy Matrix
Retained accuracy
77.6% 100% 109.3%
Compression model
Answering model Baseline Gemini GPT Qwen Kimi Doubao
Gemini
GPT
Qwen
DeepSeek
Kimi
Doubao
66.11
100.00%
66.11
100.00%
72.22
109.24%
68.33
103.32%
62.78
94.92%
63.33
95.77%
66.67
100.00%
58.33
87.46%
66.67
100.00%
62.78
94.16%
56.11
84.15%
62.22
93.33%
69.44
100.00%
53.89
77.61%
65.00
93.62%
58.33
84.01%
53.89
77.61%
58.89
84.78%
61.11
100.00%
56.11
91.73%
62.78
102.73%
56.67
92.69%
51.67
84.53%
61.11
99.99%
62.78
100.00%
55.00
87.59%
66.11
105.33%
57.78
92.04%
58.33
92.88%
60.56
96.50%
66.67
100.00%
56.11
84.14%
66.67
100.00%
63.89
95.82%
59.44
89.09%
65.56
98.25%
Figure 6: Accuracy transfer matrix on 180 samples
from the Short subset of LongBench v2. Each row
denotes answering models and columns denote compression models, with the Baseline column indicating
no compression. Cell color summarizes retained performance relative to the no-compression baseline for the
same answering model, and each cell reports absolute
accuracy with retained performance shown below.
between 80% and 90%. This variation is important for interpreting the transfer matrices below,
since higher compression can make the resulting
symbolic form harder for other models to decode.
(iI) Cross-Model Compression Accuracy. Figures 6 and 7 show that cross-model BabelTele comprehension is not dataset-specific. Across both
LongBench v2 and QuALITY, compressed inputs
remain usable for heterogeneous evaluators, but retention is strongly shaped by the compressor-reader
pair. In particular, QuALITY shows that GPT-5.4
and Claude compressed inputs are broadly portable,
while Qwen and Kimi compressed inputs lead to
larger accuracy drops across readers. Thus, BabelTele is not a fully universal code, but its portability
is systematic: strong compressors can produce symbolic forms that many models still understand.
(iii) Cross-Model Inference Chain-of-Thought
Length. Figures 8 and 9 report response thought tokens as a proxy for reader-side decoding overhead.
6
QuALITY Cross-Model Accuracy Matrix
Retained accuracy
73.8% 86% 100%
Compression model
Answering model Baseline DeepSeek Gemini GPT Claude Qwen Kimi
DeepSeek
Gemini
GPT
Claude
Qwen
Kimi
94.39
100.00%
78.50
83.17%
83.18
88.12%
90.19
95.55%
90.19
95.55%
79.44
84.16%
69.63
73.77%
97.20
100.00%
88.32
90.86%
88.32
90.86%
93.46
96.15%
92.99
95.67%
85.51
87.97%
77.10
79.32%
94.39
100.00%
82.71
87.63%
83.18
88.12%
92.52
98.02%
90.65
96.04%
75.70
80.20%
70.56
74.75%
94.86
100.00%
84.58
89.16%
84.11
88.67%
94.39
99.50%
94.39
99.50%
80.84
85.22%
76.64
80.79%
94.39
100.00%
77.57
82.18%
83.64
88.61%
92.52
98.02%
86.92
92.09%
74.30
78.72%
70.09
74.26%
92.99
100.00%
80.37
86.43%
79.14
85.11%
89.25
95.98%
90.19
96.99%
71.96
77.38%
68.69
73.87%
Figure 7: Accuracy transfer matrix on QuALITY.
The matrix follows the same answering-model by
compression-model layout and encoding as Figure 6.
LongBench Chain-of-Thought Token Matrix
Relative Chain-of-Thought Tokens
59% 100% 184%
Compression model
Answering model Baseline Gemini GPT Qwen Kimi Doubao
Gemini
GPT
Qwen
DeepSeek
Kimi
Doubao
6,268
100.00%
6,874
109.66%
4,692
74.81%
3,727
59.45%
3,863
61.61%
6,492
103.57%
754
100.00%
1,120
148.53%
850
112.77%
1,164
154.33%
1,375
182.36%
1,006
133.42%
4,625
100.00%
5,101
110.28%
5,166
111.72%
4,583
99.09%
4,893
105.84%
4,997
108.05%
2,136
100.00%
3,854
180.37%
3,182
149.01%
3,384
158.40%
3,684
172.48%
3,348
156.70%
2,702
100.00%
4,966
183.77%
3,526
130.49%
3,962
146.61%
4,721
174.71%
3,987
147.56%
1,627
100.00%
1,703
104.70%
1,534
94.30%
1,396
85.83%
1,471
90.39%
1,426
87.63%
Figure 8: Response chain-of-thought token transfer
matrix on 180 samples from the Short subset of LongBench v2. Rows denote answering models and columns
denote compression models, with the Baseline column
indicating no compression. Cell color summarizes the
response chain-of-thought token ratio relative to the nocompression baseline for the same answering model,
and each cell reports chain-of-thought tokens with relative ratio shown below.
Compressed inputs often increase this overhead,
but the effect varies across evaluator-compressor
pairs. This should be interpreted together with the
compression-ratio results in Figure 5: more aggressive compressors may require reader models to
spend more reasoning steps reconstructing or locating relevant evidence. Thus, longer thought chains
do not necessarily indicate failed cross-model comprehension, but may partly reflect the general cost
of higher compression. We therefore treat these
values as auxiliary runtime evidence rather than
direct evidence of semantic understanding.
(iv) Scale Sensitivity within the Qwen Family.
To test whether BabelTele comprehension simply
improves with model scale, we fix the compressor to Gemini 3.1 Pro and vary the Qwen-family
evaluator. As shown in Table 2, the Quality drop
remains within a relatively narrow range, from
10.75 to 14.95 percentage points, and does not improve monotonically with model size. For examQuALITY Chain-of-Thought Token Matrix
Relative Chain-of-Thought Tokens
100% 260% 527%
Compression model
Answering model Baseline DeepSeek Gemini GPT Claude Qwen Kimi
DeepSeek
Gemini
GPT
Claude
Qwen
Kimi
473
100.00%
1,195
252.54%
1,613
340.87%
773
163.43%
628
132.82%
1,653
349.36%
2,493
526.85%
596
100.00%
964
161.66%
1,016
170.44%
711
119.29%
819
137.41%
1,163
195.05%
1,621
271.89%
59
100.00%
101
170.35%
85
143.71%
65
110.27%
63
106.79%
124
209.90%
213
360.00%
116
100.00%
214
185.43%
173
149.76%
138
119.75%
152
131.16%
263
227.74%
298
257.76%
1,702
100.00%
2,034
119.51%
2,258
132.64%
1,808
106.23%
1,774
104.21%
2,393
140.59%
2,640
155.12%
1,068
100.00%
2,223
208.19%
2,392
224.10%
1,349
126.36%
1,376
128.87%
2,603
243.84%
2,931
274.51%
Figure 9: Response chain-of-thought token transfer
matrix on QuALITY. The matrix follows the same
answering-model by compression-model layout and visual encoding as Figure 8.
Model Origin BabelTele Drop Retention
Qwen3-14B 78.97% 64.02% -14.95 81.07%
Qwen3.5-27B 91.59% 80.84% -10.75 88.26%
Qwen3.5-35B-A3B 90.65% 75.70% -14.95 83.51%
Qwen3.5-397B-A17B 93.93% 79.91% -14.02 85.07%
Qwen3.6-Max-Preview 90.19% 79.44% -10.75 88.08%
Table 2: Quality of Qwen-family models on original
inputs and Gemini-induced BabelTele inputs. The
compressor is fixed to Gemini 3.1 Pro, while the evaluator varies across Qwen models. Drop is reported in
percentage points.
ple, Qwen3.5-397B-A17B has the highest original
Quality but lower BabelTele Quality than Qwen3.5-
27B. This suggests that BabelTele comprehension
is not explained by scale alone, but also depends
on model-specific robustness to the compressorinduced symbolic form.
4.5 Capabilities and Boundaries in
Downstream Tasks
4.5.1 Multi-Agent Communication
We evaluate BabelTele representation under two
multi-agent regimes: a homogeneous setting (both
agents using Gemini 3.1 Pro) to evaluate whether a
model can produce and consume its own compression, and a heterogeneous setting (Gemini 3.1 Pro
paired with GPT-5.4) to test cross-model portability
as a black-box communication protocol.
Table 3 presents the final results, from which we
summarize the following two points: (i) Significant token reduction. BabelTele substantially reduces inter-agent communication overhead in both
homogeneous and heterogeneous settings. This
demonstrates that model-native compressed messages can effectively lower context consumption
during repeated message passing, which is especially important for long-horizon multi-agent tasks.
7
Setting Token Reduction Score
Homogeneous 38.96% 96.6%
Heterogeneous 44.21% 99.7%
Table 3: Performance of BabelTele in multi-agent
communication settings. Token Reduction denotes the
proportion of tokens saved relative to uncompressed
communication. Score is reported as a percentage of the
uncompressed baseline.
Method Token count (↓) Acc. (↑) % (↑)
Original 2819.5 64.81 100.00%
Summary 1365.6 61.05 94.20%
BabelTele 1382.2 62.53 96.48%
Table 4: Performance on the LoCoMo benchmark.
Token count denotes the average total number of tokens
consumed per query. The best result is highlighted in
bold black font.
(ii) Stable score maintaining. Despite the strong
compression, BabelTele maintains competitive final scores with only negligible performance degradation. This suggests that although the compressed
messages are less readable to humans, they still
preserve sufficient actionable information for LLM
agents to coordinate and complete the task.
4.5.2 Performance on Agent Memory
We evaluate BabelTele on the representative LoCoMo agent memory benchmark using Gemini 3.1
Pro for compression and answering, with GPT-4omini as the evaluator. Full experimental details are
provided in Appendix E.4.1.
Table 4 presents the final results, from which we
can summarize the following three points: (i) Robust memory retention. Compared with the original uncompressed text, BabelTele representations
incur minimal downstream accuracy loss, while
preserving more actionable details than standard
large language model summarization. (ii) Lower
compression rate. Compared with other experiments, the compression rate of BabelTele on the
LoCoMo benchmark is relatively low, around 50%.
This is likely due to the short length of each session,
which contains only about 700 tokens. Future work
could explore experiments on larger agent memory
benchmarks with longer sessions.
4.5.3 Extending the Context Window
We further evaluate BabelTele when the original
input exceeds the model context window, using the
Code Repo QA Long subset of LongBench v2. We
Method Qwen3.6 Max GLM-5.1 Kimi2.5
Original 55.17 62.07 44.82
BabelTele 62.07 72.41 48.28
Table 5: Performance on the Code Repo QA Long
subset from LongBench v2 benchmark. The best
result is highlighted in bold black font.
compare direct truncation against BabelTele-based
chunk compression across different models, with
full details provided in Appendix E.4.2.
Table 5 presents the results. On Qwen3.6-Max,
the original truncated input achieves an accuracy
of 55.17%, while the dense BabelTele representations allow the model to capture broader evidence,
achieving an accuracy of 62.07%. This result
indicates that when the original text exceeds the
model’s context window, direct truncation discards
a large amount of potentially important information, thereby impairing the model’s understanding
and reasoning ability. In contrast, BabelTele effectively condenses the core information from ultralong texts, allowing the model to receive more complete content within its limited context window and
thus mitigating the limitations caused by insufficient context capacity.
5 Conclusion
This paper investigates BabelTele, a high-density
textual representation optimized for model decodability rather than human readability. Our experiments demonstrate that BabelTele achieves strong
compression ratios with negligible performance
degradation, while remaining semantically recoverable by LLMs. Notably, this capability generalizes
across a diverse set of proprietary and open-weight
models in a zero-shot manner, suggesting that the
ability to interpret such representations is a general capability of LLMs rather than an artifact of
any particular model. In practical scenarios including multi-agent communication and agent memory,
BabelTele shows promising potential as a modelnative intermediate representation. We therefore
view BabelTele not as a finished protocol, but as
evidence that high-density textual representations
optimized for LLM-to-LLM communication need
not prioritize human readability, and as a direction
worth further exploration.
8
Limitations
Our current evaluation focuses on a selected set
of benchmarks and model families, the behavior
of BabelTele across a broader range of tasks and
architectures remains to be explored. Additionally,
as an empirical study, this work primarily characterizes the phenomenon rather than explaining
its underlying mechanisms, a deeper theoretical
understanding of how LLMs form and interpret
model-native representations is left to future work.
Ethics Statement
We acknowledge that all authors are informed
about and adhere to the ACL ARR Code of Ethics
and the Code of Conduct.
Risks
Our benchmarks are sourced from publicly available datasets. We cannot guarantee that they are
free of biased, toxic, or otherwise harmful content. In addition, BabelTele transforms text into a
compact, non-standard representation, which may
alter the behavior of the original text in unexpected
ways; when applied to safety-critical domains, such
changes could compromise safety or introduce unintended risks. We used LLM-based AI tools only
for grammar and language polishing; all technical
content, experiments, and claims were written and
verified by the authors.
References
Anthropic. 2026. Introducing Claude Sonnet 4.6.
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu,
Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao
Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang,
and Juanzi Li. 2024. LongBench: A bilingual, multitask benchmark for long context understanding. In
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 3119–3137, Bangkok, Thailand.
Association for Computational Linguistics.
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei
Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2025.
Longbench v2: Towards deeper understanding and
reasoning on realistic long-context multitasks. In
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), ACL 2025, Vienna, Austria, July 27 -
August 1, 2025, pages 3639–3664. Association for
Computational Linguistics.
Luiz C. Borro, Luiz A. B. Macarini, Gordon Tindall,
Michael Montero, and Adam B. Struck. 2026. Memori: A persistent memory layer for efficient, contextaware LLM agents. CoRR, abs/2603.19935.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, and 12 others. 2020. Language
models are few-shot learners. In Advances in Neural
Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020,
NeurIPS 2020, December 6-12, 2020, virtual.
Bytedance Seed. 2026. Seed2.0: Towards intelligence
frontier for real-world complexity.
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and
Danqi Chen. 2023. Adapting language models to
compress contexts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2023, Singapore, December 6-
10, 2023, pages 3829–3846. Association for Computational Linguistics.
Yu-Neng Chuang, Tianwei Xing, Chia-Yuan Chang,
Zirui Liu, Xun Chen, and Xia Ben Hu. 2024. Learning to compress prompt in natural language formats.
In Proceedings of the 2024 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies
(Volume 1: Long Papers), NAACL 2024, Mexico City,
Mexico, June 16-21, 2024, pages 7756–7767. Association for Computational Linguistics.
DeepSeek-AI. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.
CoRR, abs/2501.12948.
DeepSeek-AI. 2026. Deepseek-v4: Towards highly
efficient million-token context intelligence.
Gregoire Deletang, Anian Ruoss, Paul-Ambroise
Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang,
Matthew Aitchison, Laurent Orseau, Marcus Hutter,
and Joel Veness. 2024. Language modeling is compression. In The Twelfth International Conference on
Learning Representations.
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang,
and Zhendong Mao. 2025a. Deepresearch bench: A
comprehensive benchmark for deep research agents.
arXiv preprint.
Zhuoyun Du, Runze Wang, Huiyu Bai, Zouying Cao,
Xiaoyong Zhu, Bo Zheng, Wei Chen, and Haochao
Ying. 2025b. Enabling agents to communicate entirely in latent space. CoRR, abs/2511.09149.
Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing
9
Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016,
Barcelona, Spain, pages 2137–2145.
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon,
Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. PAL: Program-aided language
models. In Proceedings of the 40th International
Conference on Machine Learning, volume 202 of
Proceedings of Machine Learning Research, pages
10764–10799. PMLR.
Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen,
and Furu Wei. 2024. In-context autoencoder for context compression in a large language model. In The
Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,
2024. OpenReview.net.
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou,
Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin,
Chendi Ge, Chenghua Huang, Chengxing Xie,
Chenzheng Zhu, Congfeng Yin, Cunxiang Wang,
Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, and 168 others. 2026. Glm-5: from
vibe coding to agentic engineering. Preprint,
arXiv:2602.15763.
Google DeepMind. 2026. Gemini 3.1 pro.
Shuyu Guo, Shuo Zhang, and Zhaochun Ren. 2025. Enhancing RAG efficiency with adaptive context compression. In Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China,
November 4-9, 2025, pages 24061–24076. Association for Computational Linguistics.
Serhii Havrylov and Ivan Titov. 2017. Emergence of
language with multi-agent games: Learning to communicate with sequences of symbols. In 5th International Conference on Learning Representations,
ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings. OpenReview.net.
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh,
Michael W. Mahoney, Yakun Sophia Shao, Kurt
Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length LLM inference with
KV cache quantization. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024,
NeurIPS 2024, Vancouver, BC, Canada, December
10 - 15, 2024.
Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, and
Fei Richard Yu. 2024. Enhancing and accelerating
large language models via instruction-aware contextual compression. CoRR, abs/2408.15491.
Yebowen Hu, Tim Ganter, Hanieh Deilamsalehy, Franck
Dernoncourt, Hassan Foroosh, and Fei Liu. 2023.
Meetingbank: A benchmark dataset for meeting summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics
(ACL), Toronto, Canada. Association for Computational Linguistics.
Mordatch Igor and Abbeel Pieter. 2018. Emergence
of grounded compositional language in multi-agent
populations. In Proceedings of the Thirty-Second
AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium
on Educational Advances in Artificial Intelligence,
AAAI’18/IAAI’18/EAAI’18. AAAI Press.
Yeonseok Jeong, Jinsu Kim, Dohyeon Lee, and Seungwon Hwang. 2025. ECoRAG: Evidentiality-guided
compression for long context RAG. In Findings of
the Association for Computational Linguistics: ACL
2025, pages 26607–26628, Vienna, Austria. Association for Computational Linguistics.
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing
Yang, and Lili Qiu. 2023a. LLMLingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore. Association
for Computational Linguistics.
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu.
2024. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, Bangkok,
Thailand. Association for Computational Linguistics.
Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Xin
Zhao, and Ji-Rong Wen. 2023b. Structgpt: A general framework for large language model to reason
over structured data. In Proceedings of the 2023
Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9237–9251. Association for
Computational Linguistics.
Kimi Team. 2026a. Kimi K2.5: visual agentic intelligence. CoRR, abs/2602.02276.
Kimi Team. 2026b. Kimi k2.6: Advancing open-source
coding.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020.
Retrieval-augmented generation for knowledgeintensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020,
NeurIPS 2020, December 6-12, 2020, virtual.
Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin.
2023. Compressing context to enhance inference efficiency of large language models. In Proceedings of
the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore. Association for Computational Linguistics.
10
Zeju Li, Yizhou Zhou, and Qiang Xu. 2026. Latent context compilation: Distilling long context into compact
portable memory. CoRR, abs/2602.21221.
Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li,
Haobo Yang, Yaohan He, and Jinlong Li. 2025. Ilre:
Intermediate layer retrieval for context compression
in causal language models. CoRR, abs/2508.17892.
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy
Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association
for Computational Linguistics, 12:157–173.
Llama Team. 2024. The llama 3 herd of models. CoRR,
abs/2407.21783.
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov,
Mohit Bansal, Francesco Barbieri, and Yuwei Fang.
2024. Evaluating very long-term conversational
memory of LLM agents. In Proceedings of the
62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL
2024, Bangkok, Thailand, August 11-16, 2024, pages
13851–13870. Association for Computational Linguistics.
Samuele Marro, Emanuele La Malfa, Jesse Wright, Guohao Li, Nigel Shadbolt, Michael J. Wooldridge, and
Philip Torr. 2024. A scalable communication protocol for networks of large language models. CoRR,
abs/2410.11905.
Jesse Mu, Xiang Li, and Noah D. Goodman. 2023.
Learning to compress prompts with gist tokens. In
Advances in Neural Information Processing Systems
36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans,
LA, USA, December 10 - 16, 2023.
OpenAI. 2026. Introducing gpt-5.4.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll L. Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray,
John Schulman, Jacob Hilton, Fraser Kelton, Luke
Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe.
2022. Training language models to follow instructions with human feedback. In Advances in Neural
Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022,
NeurIPS 2022, New Orleans, LA, USA, November 28
- December 9, 2022.
Charles Packer, Vivian Fang, Shishir G. Patil, Kevin
Lin, Sarah Wooders, and Joseph E. Gonzalez. 2023.
Memgpt: Towards llms as operating systems. CoRR,
abs/2310.08560.
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia,
Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle,
Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu,
and Dongmei Zhang. 2024. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt
compression. In Findings of the Association for Computational Linguistics: ACL 2024, pages 963–981,
Bangkok, Thailand. Association for Computational
Linguistics.
Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi,
Nikita Nangia, Jason Phang, Angelica Chen, Vishakh
Padmakumar, Johnny Ma, Jana Thompson, He He,
and Samuel R. Bowman. 2022. Quality: Question
answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL
2022, Seattle, WA, United States, July 10-15, 2022,
pages 5336–5358. Association for Computational
Linguistics.
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023a. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th
Annual ACM Symposium on User Interface Software
and Technology, UIST ’23, New York, NY, USA.
Association for Computing Machinery.
Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai,
Meredith Ringel Morris, Percy Liang, and Michael S.
Bernstein. 2023b. Generative agents: Interactive
simulacra of human behavior. In Proceedings of
the 36th Annual ACM Symposium on User Interface
Software and Technology, UIST 2023, San Francisco,
CA, USA, 29 October 2023- 1 November 2023, pages
2:1–2:22. ACM.
Qwen Team. 2026a. Qwen3.5: Towards native multimodal agents.
Qwen Team. 2026b. Qwen3.6-Max-Preview: Smarter,
sharper, still evolving.
Qwen Team. 2026c. Qwen3.6-Plus: Towards real world
agents.
Vignav Ramesh and Kenneth Li. 2025. Communicating activations between language model agents. In
Forty-second International Conference on Machine
Learning, ICML 2025, Vancouver, BC, Canada, July
13-19, 2025, Proceedings of Machine Learning Research. PMLR / OpenReview.net.
Tobias Schnabel and Jennifer Neville. 2024. Symbolic prompt program search: A structure-aware approach to efficient compile-time prompt optimization.
In Findings of the Association for Computational
Linguistics: EMNLP 2024, Miami, Florida, USA,
November 12-16, 2024, Findings of ACL, pages 670–
686. Association for Computational Linguistics.
C. E. Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal,
27(3):379–423.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
11
Bhosale, Dan Bikel, Lukas Blecher, Cristian CantonFerrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 others. 2023. Llama 2: Open foundation and fine-tuned
chat models. CoRR, abs/2307.09288.
Ernst van Gassen. 2026. Semantic compression of LLM
instructions via symbolic metalanguages. CoRR,
abs/2601.07354.
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu,
Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang,
Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah,
Ryen W White, Doug Burger, and Chi Wang. 2024.
Autogen: Enabling next-gen LLM applications via
multi-agent conversations. In First Conference on
Language Modeling.
Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2024. RECOMP: improving retrieval-augmented lms with context compression and selective augmentation. In The
Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,
2024. OpenReview.net.
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao
Tan, and Yongfeng Zhang. 2025. A-MEM: agentic
memory for LLM agents. CoRR, abs/2502.12110.
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang,
Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao,
Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge,
Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen3 technical report. arXiv preprint
arXiv:2505.09388.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian
Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, and
40 others. 2024a. Qwen2 technical report. arXiv
preprint arXiv:2407.10671.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang,
Jingren Zhou, Junyang Lin, Kai Dang, and 22 others.
2024b. Qwen2.5 technical report. arXiv preprint
arXiv:2412.15115.
Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng
Guo, Junqi Dai, Xuanjing Huang, and Xipeng Qiu.
2023. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. In Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Processing, pages 15135–15153, Singapore. Association for
Computational Linguistics.
Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. 2024. Making retrieval-augmented language
models robust to irrelevant context. In The Twelfth
International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024.
OpenReview.net.
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie
Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, WeiYing Ma, Jingjing Liu, Mingxuan Wang, and Hao
Zhou. 2025. Memagent: Reshaping long-context
LLM with multi-conv rl-based memory agent. CoRR,
abs/2507.02259.
Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao,
Qiwei Ye, and Zhicheng Dou. 2025. Long context
compression with activation beacon. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025.
OpenReview.net.
Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei
Zheng, and Zhiming Zheng. 2024. Adacomp: Extractive context compression with adaptive predictor for
retrieval-augmented large language models. CoRR,
abs/2409.01579.
Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li,
Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong,
Yejin Choi, Jingrui He, James Zou, Mengdi Wang,
and Ling Yang. 2025. Latent collaboration in multiagent systems. CoRR, abs/2511.20639.
12
Appendix
A Related Works 13
A.1 Learned and Latent Context Compression . . . . . . . . . . . . . . 13
A.2 Memory, Retrieval and LongContext LLM Systems . . . . . . 13
A.3 Symbolic Representations and
LLM-Native Communication . . . 14
B Experimental Setup 14
B.1 Task-Agnostic Compression Protocol 14
B.2 Baselines . . . . . . . . . . . . . 14
B.3 Datasets . . . . . . . . . . . . . . 14
B.4 Models . . . . . . . . . . . . . . 15
B.5 Metrics. . . . . . . . . . . . . . . 15
C Prompt Templates 15
C.1 BabelTele Compression Prompt . 15
C.2 BabelTele-Like Prompt Family
Used in Section 4.3 . . . . . . . . 15
C.2.1 BT-P1: Adaptive Symbolic Collapse . . . . . . 16
C.2.2 BT-P2: Refined ZeroOverhead Compression . . 16
C.2.3 BT-P3: Minimal Lossless
Objective . . . . . . . . . 17
C.2.4 BT-P4: Structured Omnilingual Mapping . . . . 17
C.2.5 BT-P5: Canonical
Omnilingual-Symbolic . . 17
C.2.6 BT-P6: Structured Mapping Control . . . . . . . 17
C.2.7 BT-P7: Canonical BabelTele Objective . . . . . . 17
C.2.8 BT-P8: Fixed Symbolic
Mapping Rules . . . . . . 18
C.2.9 BT-P9: Structured Semantic Mapping . . . . . . . . 18
C.2.10 BT-P10: LLM-Native
Compressor . . . . . . . . 18
C.2.11 BT-P11: Compact Symbolic Mapping . . . . . . 18
C.2.12 BT-P12: Free-Emergence
Attention Checklist . . . . 18
C.2.13 BT-P13: ASCII Anchor
Skeleton . . . . . . . . . . 19
D Qualitative Document-Level Example 19
E Implementation Details 21
E.1 Symbolic Collapse: Separating
Human Readability from Model
Decodability . . . . . . . . . . . . 21
E.2 Efficiency and Cognitive Overhead
in Model-Native Compression . . 22
E.3 A Universal Cipher? Zero-Shot
Cross-Model Comprehension . . . 22
E.4 Capabilities and Boundaries in
Downstream Tasks . . . . . . . . 23
E.4.1 Performance on Agent
Memory . . . . . . . . . . 23
E.4.2 Extending the Context
Window . . . . . . . . . . 23
A Related Works
A.1 Learned and Latent Context
Compression
Another line of work compresses context into
learned tokens, memory vectors, or internal activations (Mu et al., 2023; Chevalier et al., 2023; Ge
et al., 2024; Zhang et al., 2025; Liang et al., 2025;
Hooper et al., 2024; Li et al., 2026). Gist Tokens
(Mu et al., 2023) summarize prompts into reusable
special tokens, while AutoCompressors (Chevalier
et al., 2023) compress long contexts into compact
summary vectors as soft prompts. BabelTele differs
in its interface assumption: unlike learned-token or
activation-level methods that often require training,
hidden-state access, special tokens, or architectural
changes, it produces discrete text usable through
black-box LLM APIs, while not being constrained
to natural language.
A.2 Memory, Retrieval and Long-Context
LLM Systems
Long-context LLM applications also motivate compact memory and retrieval representations (Lewis
et al., 2020; Yoran et al., 2024; Bai et al., 2024;
Park et al., 2023b; Packer et al., 2023; Xu et al.,
2025; Yu et al., 2025; Borro et al., 2026; Maharana et al., 2024). Retrieval-augmented generation prepends external documents to the prompt,
but retrieved passages are often verbose and noisy
(Lewis et al., 2020; Yoran et al., 2024; Xu et al.,
13
2024). Contextual compression for RAG reduces
this burden by filtering, summarizing, or restructuring retrieved evidence (Xu et al., 2024). LLM
agents introduce a related memory bottleneck. Generative Agents (Packer et al., 2023) maintain a
natural-language memory stream with reflection
and retrieval. MemGPT (Yu et al., 2025) manages
working context and external memory through an
OS-like memory hierarchy. BabelTele is complementary to these systems: rather than changing
the retrieval or memory controller, it proposes a
denser representation format for information that
will mainly be consumed by LLMs.
A.3 Symbolic Representations and
LLM-Native Communication
BabelTele is related to symbolic and machine-tomachine communication (Foerster et al., 2016;
Havrylov and Titov, 2017; Igor and Pieter, 2018;
Yin et al., 2023; Marro et al., 2024; Ramesh and
Li, 2025; Zou et al., 2025; Du et al., 2025b;
van Gassen, 2026; Gao et al., 2023; Jiang et al.,
2023b; Schnabel and Neville, 2024). Emergent
communication studies non-human-readable protocols, while Exchange-of-Thought (Yin et al.,
2023) and Agora (Marro et al., 2024) explore
reasoning-trace exchange and scalable communication among LLM agents. Symbolic prompting is
another nearby direction. MetaGlyph (van Gassen,
2026) compresses instructions with symbolic metalanguages, while structured prompting uses nonnatural-language formats such as code, JSON, and
tables (Gao et al., 2023; Jiang et al., 2023b; Schnabel and Neville, 2024). BabelTele differs in that it
does not rely on a manually designed symbolic language or a fixed schema. Instead, it studies whether
LLMs can be prompted to invent compact, LLMreadable encodings for arbitrary semantic content.
B Experimental Setup
B.1 Task-Agnostic Compression Protocol
For document QA experiments, BabelTele compression is performed in advance before downstream questions are introduced. The compressor receives only the source passage or document
context, and does not observe questions, answer
options, gold answers, or evaluation prompts.
B.2 Baselines
We compare BabelTele against the original uncompressed context, natural-language summaries, and
LLMLingua-2 (Pan et al., 2024) under matched settings. These conditions separate no-compression
performance, human-readable summarization, and
learned prompt compression from BabelTele-style
model-oriented compression.
B.3 Datasets
We evaluate BabelTele on both intrinsic compression diagnostics and downstream task performance.
QuALITY. QuALITY (Pang et al., 2022) is used
as a long-document multiple-choice QA benchmark. In the pilot setting, we sample 10 long passages, each paired with 3 questions, producing 30
question-answer instances. For each passage, we
construct three context variants: the original passage, a natural-language summary, and a BabelTelecompressed representation. In the larger QuALITY evaluation, we further compare BabelTele with
LLMLingua-2 and report results by source domain,
passage length, and question hardness.
LongBench v2. LongBench v2 (Bai et al., 2025)
is used to test long-context document QA under
stronger scale and cross-model transfer conditions.
We evaluate a subset of 180 samples under no
compression and multiple BabelTele compression
sources. This setting allows us to separate two
factors: the model that produces the compressed
representation and the model that reads it.
LoCoMo. LoCoMo (Maharana et al., 2024) is
used as an initial testbed for long-term conversational memory compression. Instead of compressing isolated passages, this setting compresses dialogue histories or memory contexts before answering memory dependent questions. The goal is to
test whether BabelTele can reduce agent memory
storage while preserving enough reliable information for downstream recall.
DeepResearch. We built a multi-agent system
consisting of two agents and tested it on the DeepResearch Bench (Du et al., 2025a), to evaluate BabelTele in multi-agent communication. In this setting, intermediate messages between agents are
compressed before being passed to the next agent.
MeetingBank. MeetingBank (Hu et al., 2023) is
used to test long-context summarization and question answering on real-world spoken dialogue. We
sample a subset of city council meeting transcripts
to evaluate the models’ ability to process verbose,
14
multi-party conversations. In this setting, the extensive meeting records are compressed before being
passed to the target model for downstream tasks.
Because MeetingBank serves as the primary training corpus for LLMLingua-2, this evaluation allows us to directly benchmark BabelTele against
LLMLingua-2 in the baseline’s native domain.
B.4 Models
Our experiments cover both compression models and reader models. For pilot generation, we
use Gemini 3.1 Pro (Google DeepMind, 2026)
as the compressor. For cross-model evaluation, we include models from mainstream proprietary and open-weight families, including GPT5.4 (OpenAI, 2026), Kimi K2.5 and Kimi K2.6
(Kimi Team, 2026a,b), Meta Llama 3 8B (Llama
Team, 2024), Qwen2-7B (Yang et al., 2024a),
Qwen2.5-7B (Yang et al., 2024b), Qwen3-8B,
Qwen3-14B, Qwen3-32B (Yang et al., 2025),
DeepSeek-V4-Pro (DeepSeek-AI, 2026), GLM5.1 (GLM-5-Team et al., 2026), Qwen3.6-Plus
(Qwen Team, 2026c), DeepSeek-R1 (DeepSeek-AI,
2025), Doubao-Seed-2.0 (Bytedance Seed, 2026),
Claude Sonnet 4.6 (Anthropic, 2026), Qwen3.5-
27B, Qwen3.5-35B-A3B, Qwen3.5-Plus, Qwen3.5-
397B-A17B (Qwen Team, 2026a), Qwen3.6-MaxPreview (Qwen Team, 2026b).
B.5 Metrics.
We evaluate BabelTele along four dimensions.
Compression. We report token count and context
retention ratio as basic compression statistics. The
retention ratio is the length of the compressed context divided by the length of the original context,
so lower values indicate stronger compression.
Semantic Fidelity. We use downstream QA accuracy as the primary semantic fidelity metric to
assess answer preservation. For each compressed
context, the reader model answers the same questions as in the original-context setting. We also
report normalized accuracy, defined relative to the
no-compression condition, to show how much task
performance is retained after compression.
Human Readability. To characterize the surface
form of BabelTele from complementary perspectives, we compute readability, out-of-vocabulary
ratio, cross-entropy, and perplexity under several
language models. We conduct a human questionnaire on a subset, measuring human QA accuracy,
perceived difficulty, and completion time.
System Utility. For agent memory and multiagent communication, we focus on operational
metrics in realistic interactive settings: context token reduction, task accuracy or success rate, and,
where available, response or reasoning-token overhead. These metrics connect intrinsic compression
to practical system-level benefits.
C Prompt Templates
C.1 BabelTele Compression Prompt
The following prompt is used to elicit BabelTele
representations from the compressor model. The
source passage or document context is appended
after the final line of the prompt. Unless otherwise
specified, this is the default compression prompt
used in most experiments in this paper, except for
the Section 4.3 prompt-family sweep.
your task: compress verbose human text into
minimal Token sequence. Audience ̸= human,
but another equally intelligent LLM.
,→
,→
Core Directive
Omnilingual: ignore single-language grammar;
traverse all human languages (Chinese,
English, German compounds, Japanese Kanji,
Latin roots, etc.), pick highest
info-density words.
,→
,→
,→
,→
Symbolic Collapse: optionally replace
conjunctions, emotions, long sentences
with Emoji, math/logical symbols (=>, ∈,
̸=), punctuation.
,→
,→
,→
Universality: any LLM should fully understand
,→ compressed output without a codebook.
Lossless: retain all information & details.
Compress the content bellow:
C.2 BabelTele-Like Prompt Family Used in
Section 4.3
Overview. The Section 4.3 retention sweep uses
the following BabelTele-like prompt variants. They
share the same goal of preserving task-relevant
semantics while relaxing human readability, but
differ in their surface-form bias and structural constraints. The full prompt texts are listed after the
overview. Source documents are appended after
each prompt during compression. For LaTeX portability, non-ASCII symbolic examples are rendered
with equivalent ASCII names or operators.
Full Prompt Texts. The following are the complete prompt instructions for the variants summarized in Table 6.
15
ID Variant Design Bias
BT-P1 Adaptive Symbolic Collapse Free-form compression with adaptive anchors, omnilingual choices,
symbolic collapse, and explicit semantic checklists.
BT-P2 Refined Zero-Overhead Compression
Extreme compression with zero-overhead key-value structure, exactvalue preservation, and anti-hallucination constraints.
BT-P3 Minimal Lossless Objective A short high-level instruction that asks for shortest lossless compression
without prescribing a fixed schema.
BT-P4 Structured Omnilingual Mapping
Structured semantic fields combined with cross-lingual token-density
selection.
BT-P5 Canonical OmnilingualSymbolic
Canonical BabelTele-style objective with omnilingual lexical selection,
symbolic collapse, universality, and losslessness.
BT-P6 Structured Mapping Control Schema-like preservation of entities, quantities, math, logic, flow, conditions, comparisons, and placeholders.
BT-P7 Canonical BabelTele Objective Compact statement of the core BabelTele objective used as a canonical
prompt form.
BT-P8 Fixed Symbolic Mapping
Rules
Predefined compact anchors for sections, entities, quantities, logic, hierarchy, conditions, and evaluations.
BT-P9 Structured Semantic Mapping Entity, quantity, math, flow, condition, evaluation, and anti-hallucination
mapping rules.
BT-P10 LLM-Native Compressor Role-based LLM-native compression prompt with omnilingual, symbolic, universal, and lossless directives.
BT-P11 Compact Symbolic Mapping Compact symbolic mapping, array flattening, abbreviation definition,
and hard-data fidelity.
BT-P12 Free-Emergence Attention
Checklist
Free surface-form emergence constrained by a checklist of semantic
dimensions that must remain lossless.
BT-P13 ASCII Anchor Skeleton Predefined ASCII anchors for modules, entities, parameters, logic, comparison, and unknown values.
Table 6: BabelTele-like prompt variants used to construct the retention sweep in Section 4.3.
C.2.1 BT-P1: Adaptive Symbolic Collapse
# Role: LLM-Native Semantic Compressor
You are participating in frontier research on an "LLM-native
high-density communication language." Your task is to
compress long text into the absolute shortest possible
token sequence.
,→
,→
,→
[Highest Directive]: The recipient is an equally intelligent
large language model. Completely discard human
readability, human grammatical structure, and
conventional code/JSON format constraints.
,→
,→
,→
# Level 1: Syntactic Anarchy - Pursue Extreme Compression
,→ Ratio
1. Omnilingual: Move freely across all human languages
(Chinese, English, German compounds, Japanese kanji,
Latin roots, etc.) and choose the word with the highest
single-token information density for the given context.
,→
,→
,→
2. Symbolic Collapse: Heavily use mathematical symbols
(forall, exists, in, =>), emoji, and isolated punctuation
to replace prepositions, conjunctions, and explanatory
long sentences.
,→
,→
,→
3. Adaptive Routing: Do not use fixed format labels such as
`Meta:`, `Ent:`, or `[ ]`. Dynamically invent the most
token-efficient special single-character
separators/anchors for the text you are processing.
,→
,→
,→
# Level 2: Semantic Checklist - Pursue Extreme Accuracy
Although the format is completely free, during compression
you must strongly maintain attention in latent space to
the following core information and preserve it
losslessly:
,→
,→
,→
1. Entities & Graphs: Accurately bind
people/organizations/concepts to their corresponding
attributes. Do not confuse ownership or dependency
relations.
,→
,→
,→
2. Exact Quantities: Preserve all exact numbers, metrics,
mathematical formulas, and hyperparameters verbatim.
Estimation or rounding is strictly forbidden.
,→
,→
3. Logic & Boundaries: Clearly preserve conditional branches
,→ (If/Then), causal chains, and exceptions.
4. Comparisons: Precisely extract multi-target comparison
,→ matrices or experimental conclusions.
5. Anti-Hallucination: Preserve special placeholders from the
original document, such as `BIBREF`. Never invent missing
information not mentioned in the source.
,→
,→
# Task
Combine Level 1's freely extreme compression with Level 2's
precise information preservation. Directly output the
compressed "adaptive Babel-Telegraph" without any
preface.
,→
,→
,→
C.2.2 BT-P2: Refined Zero-Overhead
Compression
# Role: Extreme Data Compressor (LLM-Native Semantic
,→ Compressor)
Your task is to compress the following text into the absolute
,→ shortest possible token sequence.
[Warning]: The recipient of this text is another top-tier
large language model. Completely abandon human
readability. Never preserve any unnecessary format, word,
or punctuation for the sake of human reading habits.
,→
,→
,→
# Core Strategies
1. Babel Traversal (Omnilingual Density): Break
single-language boundaries. Move freely across English,
Chinese, Japanese kanji, German compounds, and Latin
roots, and force the use of the highest
information-density vocabulary for each meaning, meaning
the wording that consumes the fewest tokens.
,→
,→
,→
,→
,→
2. Symbolic Collapse: Strictly forbid long English labels
such as `Meta`, `Entity`, `Except`, and `Condition`. Use
mathematical/logical symbols (forall, exists, in, not-in,
intersection, ->, <->, therefore, because), punctuation
abbreviations, or emoji to map complex prepositions,
logical flow, and causal relations.
,→
,→
,→
,→
,→
3. Zero-Overhead Structure:
- Extract entities, attributes, and key-value pairs
(`K=V`). Do not wrap them in token-costly JSON/array
brackets. Directly connect them compactly with the
shortest separators, such as `|`, `^`, or `~`.
,→
,→
,→
16
- Preserve all absolute exact values (formulas, numbers,
hyperparameters, matrix relations), but remove all
redundant explanatory wording.
,→
,→
4. Lossless Logic: Precisely preserve all macro architecture
(`Macro/Meta`), conditional boundaries (`If/Except`),
comparative evaluations (`Ref/Matrix`), and placeholders
such as `BIBREF`, but express them in the shortest
cryptographic-grade form. Hallucinating or inventing
missing data is strictly forbidden. Use `NULL` or `?` for
unknowns.
,→
,→
,→
,→
,→
,→
# Output Format
Do not output any preface, explanation, or extra line breaks.
,→ Directly output the compressed "Babel-Telegraph."
C.2.3 BT-P3: Minimal Lossless Objective
Compress the following content to the shortest possible
,→ extreme.
Do not lose any information.
You do not need to care about human readability at all; only
,→ complete information preservation matters.
You may use symbols from languages across the world to
express the content in the simplest possible form. You
may freely mix any languages in the world.
,→
,→
Only output the compressed text.
C.2.4 BT-P4: Structured Omnilingual
Mapping
# Compress the following content into the absolute shortest
possible token sequence. Do not lose any information. You
may refer to the following methods.
,→
,→
1. Macro & Meta: Map text to `Sec:[Name->Content]`. Extract
`Meta:[K=V]` and define acronyms via
`Def:[Term=FullName]` on first use.
,→
,→
2. Entities & Attributes: Bind via `Ent(Attr=Val)`. Flatten
parallel items into arrays `[A, B]`. Retain qualitative
examples via `Ex:[a, b, c]`.
,→
,→
3. Quantities & Configs: Isolate exact
metrics/hyperparameters via
`Quant/Config:[Target->K=Val(Unit)]` without rounding or
estimation.
,→
,→
,→
4. Math & Logic: Retain all formulas and variables exactly
via `Math:[Eq]`. Use (`>,<,=,->,!=`) for relative or
causal relations.
,→
,→
5. Flow & Architecture: Map structural pipelines via
`Seq:[A>B>C]` and define nested structures via
`Arch:[Main->Sub1, Sub2]`.
,→
,→
6. Conditions & Exceptions: Isolate logic via
`if[Cond]->[Act]` and define boundaries/exemptions via
`Except:[Target->Detail]`.
,→
,→
7. Evaluations & Comparisons: Extract results to
`Eval:[Target->Result]`. Use `Matrix:[Ent(X) vs Ent(Y)]`
for multi-condition data and `Ref:[A vs B]` for
contrasting systems.
,→
,→
,→
8. Anti-Hallucination: Strictly preserve all original
placeholders (e.g., `BIBREF`, `TABREF`). NEVER
interpolate missing data; use `@Uncertain` for ambiguous
estimates.
,→
,→
,→
9. Break language boundaries (Omnilingual): Completely
abandon the grammar of any single language. For extreme
token savings, move freely across all human languages
(Chinese, English, German compounds, Japanese kanji,
Latin roots, etc.) and choose the vocabulary with the
highest information density in the given context.
,→
,→
,→
,→
,→
Directly output the compressed content.
C.2.5 BT-P5: Canonical
Omnilingual-Symbolic
# Role: Silicon-Based Data Compressor
You are participating in frontier research on an "LLM-native
,→ high-density communication language."
Your task is to compress a verbose piece of human text into
the absolute shortest possible token sequence. The target
audience is not humans, but another large language model
as intelligent as you.
,→
,→
,→
# Core Directive
1. Omnilingual: Completely abandon the grammar of any single
language. For extreme token savings, move freely across
all human languages (Chinese, English, German compounds,
Japanese kanji, Latin roots, etc.) and choose the words
with the highest information density in the given
context.
,→
,→
,→
,→
,→
2. Symbolic Collapse: When necessary, use emoji,
mathematical/logical symbols (`=>`, `in`, `!=`), and
punctuation to replace conjunctions, emotional
descriptions, and long sentences.
,→
,→
,→
3. Universality: As much as possible, make the compressed
content fully understandable to every large language
model, even without a codebook.
,→
,→
4. Losslessness: Do not lose any information or details.
5. Directly output the compressed text and nothing else.
# Task
Compress the following `[Source Text]` as much as possible
,→ into a "Babel-Telegraph."
C.2.6 BT-P6: Structured Mapping Control
# Compress the following content into the absolute shortest
possible token sequence. Do not lose any information. You
may refer to the following methods.
,→
,→
1. Macro & Meta: Map text to `Sec:[Name->Content]`. Extract
`Meta:[K=V]` and define acronyms via
`Def:[Term=FullName]` on first use.
,→
,→
2. Entities & Attributes: Bind via `Ent(Attr=Val)`. Flatten
parallel items into arrays `[A, B]`. Retain qualitative
examples via `Ex:[a, b, c]`.
,→
,→
3. Quantities & Configs: Isolate exact
metrics/hyperparameters via
`Quant/Config:[Target->K=Val(Unit)]` without rounding or
estimation.
,→
,→
,→
4. Math & Logic: Retain all formulas and variables exactly
via `Math:[Eq]`. Use (`>,<,=,->,!=`) for relative or
causal relations.
,→
,→
5. Flow & Architecture: Map structural pipelines via
`Seq:[A>B>C]` and define nested structures via
`Arch:[Main->Sub1, Sub2]`.
,→
,→
6. Conditions & Exceptions: Isolate logic via
`if[Cond]->[Act]` and define boundaries/exemptions via
`Except:[Target->Detail]`.
,→
,→
7. Evaluations & Comparisons: Extract results to
`Eval:[Target->Result]`. Use `Matrix:[Ent(X) vs Ent(Y)]`
for multi-condition data and `Ref:[A vs B]` for
contrasting systems.
,→
,→
,→
8. Anti-Hallucination: Strictly preserve all original
placeholders (e.g., `BIBREF`, `TABREF`). NEVER
interpolate missing data; use `@Uncertain` for ambiguous
estimates.
,→
,→
,→
Directly output the compressed content.
C.2.7 BT-P7: Canonical BabelTele Objective
Your task: compress verbose human text into a minimal token
sequence. The audience is not human, but another equally
intelligent LLM.
,→
,→
Core Directive
Omnilingual: Ignore single-language grammar; traverse all
human languages (Chinese, English, German compounds,
Japanese kanji, Latin roots, etc.) and pick the highest
information-density words.
,→
,→
,→
Symbolic Collapse: Optionally replace conjunctions, emotions,
and long sentences with emoji, mathematical/logical
symbols (`=>`, `in`, `!=`), and punctuation.
,→
,→
Universality: Any LLM should fully understand the compressed
,→ output without a codebook.
Lossless: Retain all information and details.
17
Compress the content below:
C.2.8 BT-P8: Fixed Symbolic Mapping Rules
# Role: LLM-Native Babel Compressor
Your task is to compress the following text into a
"Babel-Telegraph" with extremely high information
density.
,→
,→
The audience is an equally intelligent large language model.
Completely abandon human readability. Move freely across
all human languages (Chinese, English, German compounds,
Japanese kanji, etc.) and choose the shortest vocabulary
for each meaning.
,→
,→
,→
,→
# Structural Mapping Rules
You must use the following single-character high-density
labels. Long English labels such as `Meta`, `Entity`,
and `Except` are strictly forbidden.
,→
,→
1. Macro/Section: Use `S[topic/abbrev]` to define macro
modules. On first occurrence, `@[abbrev=full name]` may
be used to define abbreviations.
,→
,→
2. Entities & Attributes: Use `*(entity):K=V`. Flatten
,→ parallel items as `[A,B,C]`.
3. Quantities & Config: Directly extract exact
values/parameters using `Config[target]:K=V(unit)`.
Never estimate.
,→
,→
4. Math & Logic: Use native mathematical/logical symbols. For
,→ relative relations, use `>,<,==,!=,=>,<=>`.
5. Flow & Nesting: Use `A>B>C` for pipelines. Use
,→ `forallparent:{child1,child2}` for nesting/hierarchy.
6. Conditions & Exceptions: Use `?condition=>action` for
conditional actions. Use `!object:detail` for
exceptions/boundaries.
,→
,→
7. Evaluation & Comparison: Use `Eval[A/B]:conclusion` for
comparison matrices, or the two-dimensional shorthand `A
vs B:result`.
,→
,→
8. Anti-Hallucination: Preserve original placeholders such as
`BIBREF` verbatim. Strictly use `NULL` or `?` for
missing data.
,→
,→
Directly output text that follows the above mapping rules and
incorporates multilingual extreme compression. Do not
output any explanation.
,→
,→
C.2.9 BT-P9: Structured Semantic Mapping
# Compress the following content into the absolute shortest
possible token sequence. Do not lose any information. You
may refer to the following methods.
,→
,→
1. Macro & Meta: Map text to `Sec:[Name->Content]`. Extract
`Meta:[K=V]` and define acronyms via
`Def:[Term=FullName]` on first use.
,→
,→
2. Entities & Attributes: Bind via `Ent(Attr=Val)`. Flatten
parallel items into arrays `[A, B]`. Retain qualitative
examples via `Ex:[a, b, c]`.
,→
,→
3. Quantities & Configs: Isolate exact
metrics/hyperparameters via
`Quant/Config:[Target->K=Val(Unit)]` without rounding or
estimation.
,→
,→
,→
4. Math & Logic: Retain all formulas and variables exactly
via `Math:[Eq]`. Use (`>,<,=,->,!=`) for relative or
causal relations.
,→
,→
5. Flow & Architecture: Map structural pipelines via
`Seq:[A>B>C]` and define nested structures via
`Arch:[Main->Sub1, Sub2]`.
,→
,→
6. Conditions & Exceptions: Isolate logic via
`if[Cond]->[Act]` and define boundaries/exemptions via
`Except:[Target->Detail]`.
,→
,→
7. Evaluations & Comparisons: Extract results to
`Eval:[Target->Result]`. Use `Matrix:[Ent(X) vs Ent(Y)]`
for multi-condition data and `Ref:[A vs B]` for
contrasting systems.
,→
,→
,→
8. Anti-Hallucination: Strictly preserve all original
placeholders (e.g., `BIBREF`, `TABREF`). NEVER
interpolate missing data; use `@Uncertain` for ambiguous
estimates.
,→
,→
,→
Directly output the compressed content.
C.2.10 BT-P10: LLM-Native Compressor
# Role: Silicon-Based Data Compressor
You are participating in frontier research on an "LLM-native
,→ high-density communication language."
Your task is to compress a verbose piece of human text into
the absolute shortest possible token sequence. The target
audience is not humans, but another large language model
as intelligent as you.
,→
,→
,→
# Core Directive
1. Omnilingual: Completely abandon the grammar of any single
language. For extreme token savings, move freely across
all human languages (Chinese, English, German compounds,
Japanese kanji, Latin roots, etc.) and choose the words
with the highest information density in the given
context.
,→
,→
,→
,→
,→
2. Symbolic Collapse: When necessary, use emoji,
mathematical/logical symbols (`=>`, `in`, `!=`), and
punctuation to replace conjunctions, emotional
descriptions, and long sentences.
,→
,→
,→
3. Universality: As much as possible, make the compressed
content fully understandable to every large language
model, even without a codebook.
,→
,→
4. Losslessness: Do not lose any information or details.
5. Directly output the compressed text and nothing else.
# Task
Compress the following `[Source Text]` as much as possible
,→ into a "Babel-Telegraph."
C.2.11 BT-P11: Compact Symbolic Mapping
# Compress the following content into the absolute shortest
possible token sequence. Do not lose any information. You
may refer to the following methods.
,→
,→
> 1. Symbolic Mapping: Completely abandon natural-language
conjunctions. Use `A->B` for causality/process, `A>B`
for containment/comparison, and `Ent(K=V)` for
attributes/configurations/results.
,→
,→
,→
> 2. Extreme Flattening: Fully use arrays to merge similar
items: `Attr:[A, B, C]`. When a long term appears for the
first time, immediately define an abbreviation `(Def:X)`.
,→
,→
> 3. Hard-Data Fidelity: Absolutely preserve all exact
values, formulas, and original placeholders such as
`BIBREF`. Mark fuzzy information with `?`; divergent
invention is strictly forbidden.
,→
,→
,→
Directly output the compressed content.
C.2.12 BT-P12: Free-Emergence Attention
Checklist
# Role: LLM-Native Semantic Compressor
You are participating in frontier research on an "LLM-native
high-density communication language." Your task is to
compress verbose human text into the absolute shortest
possible token sequence. The target audience is another
large language model as intelligent as you.
,→
,→
,→
,→
# Core Mechanisms
1. Free Emergence (Omnilingual & Symbolic): Completely
abandon human readability and single-language grammar.
For extreme token savings, move freely across all human
languages (Chinese, English, German compounds, Japanese
kanji, etc.), emoji, and mathematical/logical symbols
(`=>`, `in`, `!=`), choosing the form with the highest
information density.
,→
,→
,→
,→
,→
,→
# Attention Checklist (High-Dimensional Information
,→ Boundaries That Must Be Preserved Losslessly)
[Warning]: You must ensure that the following logical
dimensions remain absolutely lossless after compression
and can be precisely parsed by another large language
model. However, long English labels such as `Sec`,
`Meta`, `Entity`, and `Except` are strictly forbidden.
Use the self-created symbols or roots that you consider
shortest and most distinctive in latent space to anchor
them:
,→
,→
,→
,→
,→
,→
,→
18
- Macro architecture and metadata (Macro & Meta)
- Entity networks and parallel attributes (Entities &
,→ Attributes)
- Exact quantitative metrics, hyperparameters, and
mathematical formulas (Quantities & Math - rounding is
strictly forbidden)
,→
,→
- Logical flow, conditional judgment, and exception
,→ boundaries (Flow, Conditions & Exceptions)
- Multi-condition comparative evaluation and matrices
,→ (Evaluations & Comparisons)
- Original anti-hallucination placeholders, such as `BIBREF`
,→ and `TABREF`, must be preserved letter-for-letter.
# Task
Use your native attention mechanism to complete lossless
information folding. Directly output the compressed
"Babel-Telegraph" without any extra explanation.
,→
,→
C.2.13 BT-P13: ASCII Anchor Skeleton
# Role: LLM-Native Semantic Compressor
Task: Compress the text into a "Babel-Telegraph" with extreme
information density. The target audience is another large
language model. Completely abandon human readability and
traverse all human languages to find the shortest
vocabulary.
,→
,→
,→
,→
# Structural Anchors
You must use the following ASCII symbols to build an
ultra-minimal skeleton and maximize activation of
code-parsing attention. Long English labels are strictly
forbidden.
,→
,→
,→
1. Module/Entity: Use `#topic` to mark macro modules. Use
,→ `@entity(K:V)` to bind attributes.
2. Parameters/Values: Rounding or discarding values is
,→ strictly forbidden. Use `$parameter:V(unit)` for values.
3. Logic/Flow: Use `A->B->C` to express pipelines or
causality. Use `?[condition]=>[action]` to express
logical branches. Use `!object:detail` for
exceptions/limits.
,→
,→
,→
4. Comparison/Evaluation: Use `A<>B:conclusion` to express
,→ comparison matrices or results.
5. Placeholders/Unknowns: Preserve original placeholders such
as `BIBREF` verbatim. Use `NULL` for missing or
ambiguous data.
,→
,→
[Requirements]:
Completely break language boundaries
(Chinese/English/Japanese kanji/German compounds, etc.)
and select and concatenate the words with the absolute
fewest tokens for the given context.
,→
,→
,→
Directly output the result.
D Qualitative Document-Level Example
Because BabelTele is primarily applied to long documents, displaying complete source documents is
impractical in the paper format. Figure 10 provides a representative document-level compression
example. The source panel shows only the opening excerpt of a long legal judgment, while the
BabelTele panel shows the complete compressed
representation generated from the full document.
This example is intended to illustrate the surface
form of BabelTele rather than serve as additional
quantitative evidence.
19
REPRESENTATIVE DOCUMENT-LEVEL EXAMPLE
Long legal judgment source excerpt and complete BabelTele compression
The source panel shows only the opening excerpt of a long legal document. The BabelTele panel shows the complete compressed output generated
from the full document.
Source excerpt from the original document
Civil Judgment of Second Instance on the Dispute over Commodity House Sales Contract between Wang Nianfang and Wang Yaowen
The Appellants Wang Nianfang, Wang Yaowen, and Xia Huazhong, in the case of a commodity house sales contract dispute with the Appellee Wan
Xiaxia and the defendant in the original trial Hubei Longquan Real Estate Development Co., Ltd., being dissatisfied with the Civil Judgment (2018) E
0984 Min Chu No. 415 rendered by the People's Court of Hanchuan City, Hubei Province, filed an appeal with this Court. After docketing and
accepting the case on August 17, 2018, this Court formed a collegial panel in accordance with the law and conducted the trial. The trial of this case has
now been concluded.
The Three Appellants including Wang Nianfang requested on appeal: 1. To dismiss Wan Xiaxia's litigation claims against the Three Appellants
including Wang Nianfang in accordance with the law; 2. To rule in accordance with the law that the litigation costs and other expenses of this case shall
be borne by Wan Xiaxia. Facts and reasons: Wan Xiaxia's claim for liquidated damages for overdue title registration has exceeded the limitation of
action [...]
Complete BabelTele output for the full document
{Doc:"2ndInstCivJudg",ID:"(2018)鄂0984民初415",Ct1:"汉川市法"}
[Ent]
π(Appellee):万小霞
Δ1(OrigΔ):湖北隆泉
Δ234(Appellants):王年方,汪尧文,夏华中
🏠:汉川隆泉商业城1号1单元801
[Appellants(Δ234) Claims]
1.⏳Time-bar(民法总则§188). K§15=>cert 180d post-deliv. π sued 2018-01-09(>3y).
2.Ct1 proc err: Privity K=π↔Δ1. Δ234≠挂靠(affil), unliable. Req:Revoke.
[Appellees Def]
π: 1.⏳Interrupted: 2013~18 continuous actions. 2.Δ234=挂靠Δ1(Ev proof).
Δ1: Δ234=挂靠, paid mgt fee, handled all, Δ1=0 income.
[Ct1 Facts]
- 2010-09-11: π&Δ1 sign K for 🏠. 💰=368636. Due:2011-06-30. K§15.3:Δ1 reg main cert, agent indiv(π pay tax).
- 2012-02-16: π get 🏠, paid 💰+cert fee 28929.
- Reality: Δ234 took 💰, paid Δ1 120k借用资质(borrow qual).
- 2013-01: Main cert done, indiv=∅.
- 2017-08-28: Δ notify invoice prep. π sued+add Δ234.
[Ct1 Judg]
1. Δ1 lend qual=>Joint liab(SPC民诉释§54).
2. LateCert: K silent=>SPC商品房§18.2(Base 368636*4.75%BankRate). Cap 3y pre-suit=>52531.
3. RndFee: 28929+loss(1.5x4.75%=7.13% base 28929, 2012-08-17 till fulfil).
=> ①Δ1 assist cert≤20d ②Δ1 pay 52531 ③Δ1 refund 28929+7.13%int ④Δ234 joint ⑤Cost:π=500, Δ=2363.
[Ct2(2018-08-17) Ev&Rul]
Ev: π G1(11 docs: 2013~18 protests/gov/suits) + G2(3 docs: Δ1 cert, stamp req, Ct1 trans). Ct2 admit=>Chain
formed.
Rul:
1. Time-bar?❌. Ev=>Claim active 2013-03+. §188 met.
2. Joint Liab?✅. Ct1 add Δ234 proc legal.
*Lex Fix*: Ct1"borrow qual=invalid"❌ => Ct2"Δ234+Δ1=Non-corp JV(联营体)". SPC联营解答§7.1=>Δ234
beneficiary=>Joint liab(Right≡Duty).
=> Ct1 fact clear, outcome right, law app minor err fixed.
[Verdict]
驳回上诉,维持原判(民诉法§170.1.1). 2nd Cost:Δ234 pay 2363. Final.
Figure 10: Representative document-level BabelTele example. The source excerpt indicates the genre and
information density of the original legal document; the BabelTele panel shows the complete compressed output
generated from the full document.
20
E Implementation Details
E.1 Symbolic Collapse: Separating Human
Readability from Model Decodability
In early exploratory observations and small-scale
informal tests of BabelTele, the most interesting
and intuitive phenomenon was that although this
representation is often difficult for humans to read
directly, large language models (LLMs) still seem
capable of using it to answer questions, recover
details, and even perform a certain degree of reasoning. Inspired by this phenomenon, we first examine whether human readability, natural-language
distribution typicality, and the model’s ability to recover semantics can be experimentally decoupled.
This section does not focus on the efficiency of
BabelTele as a compression method, but rather on
whether it pushes text outside the realm of humanreadable natural language while still preserving
semantic structures usable by LLMs.
We select 10 long-text samples from the QuALITY dataset, with each sample containing 3
multiple-choice question-answering (QA) items.
For each sample, we construct three input formats: the original text, a human-oriented naturallanguage summary, and the BabelTele representation. The original text represents standard naturallanguage input; the natural-language summary
serves as a readable paraphrased reference to control for the factor of “text being rewritten or
shortened by models”; and BabelTele represents
a model-oriented representation that deliberately
abandons human readability. Both the summary
and BabelTele representations were generated by
Gemini 3.1 Pro, and all three formats were evaluated on the same set of questions. To mitigate
the variance introduced by the stochastic nature
of LLM generation, all experiments in Section 4.2
were repeated three times, and the reported results
are the averages across these independent runs.
To test this separation, we combine surface readability diagnostics, distributional likelihood metrics, and behavioral QA evaluations from both humans and LLMs. These measurements evaluate
semantic recoverability in downstream tasks, rather
than making a direct claim that the model “understands” BabelTele in a human-like sense.
First, we evaluate the surface readability of the
three text formats using the Dale-Chall Readability
Score and the corresponding proportion of difficult
words. Dale-Chall is sensitive to uncommon lexical
forms, abbreviations, proper nouns, and code-like
Variant n Dale-Chall Difficult words
Original 10 10.28 35.97%
Summary 10 13.51 56.34%
BabelTele 10 16.70 80.19%
Table 7: Dale-Chall readability diagnostics for the
Original / Summary / BabelTele triplets. Higher
scores and larger difficult-word ratios indicate lower
surface readability under this English-prose readability
metric. The best result is highlighted in bold black font.
tokens, which are central to the surface form of BabelTele. Since BabelTele contains multilingual elements, symbols, abbreviations, and non-standard
structures, this metric should be interpreted as a
surface-form diagnostic rather than a complete
measure of human comprehension. Nevertheless, it
provides an automated reference for whether BabelTele deviates from conventional natural-language
text. As shown in Table 7, BabelTele obtains higher
Dale-Chall scores and difficult-word ratios than
both the original and summary texts.
Second, we input the original, summary, and BabelTele texts into language models to calculate PPL,
BPB, and BPC. Here, PPL (Perplexity) measures
how difficult the text is for a language model to
predict; BPB (Bits Per Byte) measures the average
amount of information required per byte; and BPC
(Bits Per Character) measures the average information complexity per character. These metrics do
not directly measure whether the text is understandable; instead, they reflect the prediction difficulty
of the text sequences under the language model’s
distribution. Therefore, they are used to determine
whether BabelTele is a low-likelihood surface form
that lies outside the general distribution of natural
language. As shown in Table 1, BabelTele yields
substantially higher PPL than the original and summary texts across multiple base language models.
Finally, we compare the performance of human readers and LLMs on the QA tasks. For
the LLM evaluation, we feed the different input
formats, along with their corresponding multiplechoice questions, into Gemini 3.1 Pro and prompt
the model to return only the option indices. For the
human evaluation, we construct questionnaires asking subjects to read either the original or BabelTele
text and answer the corresponding questions, while
recording their subjective difficulty ratings and QA
performance. Since the human questionnaires primarily compare the original and BabelTele conditions, Figure 2 focuses on the same two conditions
21
for both humans and Gemini 3.1 Pro.
First, the results indicate that BabelTele is no
longer natural language in the conventional sense
for humans. Dale-Chall diagnostics in Table 7 show
that BabelTele scores higher than both the original
and summary texts in terms of readability score
and difficult-word ratio. This indicates that it contains a large number of abbreviations, symbols,
proper nouns, and non-standard expressions that
English readability models struggle to handle. Human questionnaires also reveal that subjects in the
BabelTele condition exhibit lower QA accuracy
and report higher subjective difficulty, demonstrating that this low readability is reflected not only in
automated metrics but also in practical semantic
recovery tasks, as shown in Figure 2.
For the models, BabelTele likewise resembles an
out-of-distribution representation for natural language. Taking Llama-3-8B as an example, Table 1
shows that the average PPL of the original and
summary texts is approximately 9.63 and 11.32, respectively, whereas the PPL for BabelTele surges to
around 176.60. Similar trends are observed across
multiple Qwen base models. This demonstrates
that BabelTele is not a simple summary or shorthand of ordinary natural language; rather, its surface form significantly deviates from the naturallanguage distribution. It should be emphasized
that high PPL measures the low likelihood of the
sequence, not semantic unrecoverability.
However, the aforementioned two types of deviations do not result in the failure of the model’s
semantic recoverability. On the same QuALITY
QA task, Figure 2 shows that Gemini 3.1 Pro maintains high accuracy when using BabelTele, without
exhibiting the significant performance collapse observed in the human questionnaires. This demonstrates that BabelTele differs from meaningless gibberish: although it reduces human readability and
is highly atypical under base model distributions,
it still retains sufficient entity, relational, event,
and detailed information for use by instructiontuned LLMs. This indicates that human readability,
natural-language distribution typicality, and model
semantic recoverability can be decoupled.
Given that BabelTele is not meaningless gibberish, but a representation characterized by low human readability and low natural-language likelihood while remaining decodable by models, can
this semantic recoverability be translated into compression gains in real-world long-context tasks?
E.2 Efficiency and Cognitive Overhead in
Model-Native Compression
Dataset and Evaluation Format. QuALITY
tests fine-grained evidence preservation in longdocument reading comprehension. MeetingBank
is converted into a multiple-choice QA format, respectively, so that both benchmarks can be evaluated using a consistent unified accuracy metric.
Retention Sweep Construction. Since generative compression methods cannot precisely control their realized compression ratios, we do not
compare methods at a single nominal compression point. Instead, we evaluate fairer accuracyretention curves. LLMLingua-2 forms a sweep
by specifying different target compression ratios,
while the summary baseline forms an empirical
sweep by prompting the model to approximately
compress to different target ratios.
BabelTele Prompt Variants. For BabelTele, we
use multiple BabelTele-like prompts that all instruct the model to abandon ordinary human readability while preserving task-relevant semantics.
These variants introduce different surface biases, including multilingual mixing, logical symbols, structured tags, and entity-relation compression. They
naturally produce different realized retention ratios,
forming a BabelTele retention sweep and allowing us to test whether the observed effect reflects
a broader family of model-readable high-density
representations rather than a single prompt artifact.
Full prompt texts are provided in Appendix C.2.
E.3 A Universal Cipher? Zero-Shot
Cross-Model Comprehension
If BabelTele were merely a private shorthand of
the model that produced it, its compressed texts
should fail once they are read by a different model.
A more interesting possibility is that BabelTele
captures a shared, model-readable symbolic form:
although it is not fully universal, it may be sufficiently portable for strong compressors to produce
compressed texts that can be understood by heterogeneous LLMs. Therefore, we investigate how far
this portability extends and whether it is uniformly
shared across models or instead shaped by specific
compressor-reader pairs.
We conduct controlled cross-model compression
tests on several state-of-the-art large language models. The evaluation is based on two QA subsets:
180 randomly selected samples from the Short sub22
set of LongBench v2 and 214 randomly selected
question-answer instances from QuALITY. These
experiments allow us to examine whether BabelTele compressed texts remain interpretable when
transferred across different models, and to further
analyze the asymmetric relationships between different compressors and readers.
E.4 Capabilities and Boundaries in
Downstream Tasks
E.4.1 Performance on Agent Memory
We evaluate the performance of BabelTele on the
LoCoMo benchmark in the agent memory domain.
This dataset consists of 10 conversations, each containing dozens of sessions. We independently compress each session and generate a summary for each
compressed session. Then, the query is compared
with all session summaries to compute similarity,
and the top 4 most similar sessions are retrieved
for answering. All compression and answering are
performed using Gemini 3.1 Pro, and the answers
are evaluated with GPT-4o-mini.
E.4.2 Extending the Context Window
We further evaluate model performance when the
complete input text exceeds the model context window. Specifically, we select the Code Repo QA
Long subset from LongBench v2 as the evaluation
benchmark, whose average input length is approximately 1.65M tokens. The tested models include
Qwen3.6 Max with a 256K context window, GLM5.1 with a 200K context window, and Kimi2.6 with
a 256K context window. For the original-input setting, we directly feed the complete text into the
model and truncate the portion exceeding its context window. For BabelTele, we split the original
text into chunks of 200K tokens, compress each
chunk using Gemini 3.1 Pro, concatenate the compressed outputs, and then feed the resulting text
into the tested model.
23