This document is a Unicode stress-test and demo for UniWorld.
It is intentionally long and visually dense. Treat it as both: - A presentation of what UniWorld enables. - A test fixture for rendering, fonts, and text handling across scripts.
uniworld), JavaScript/WASM, C, Go – same behavior, same tests.Use this file to: - Inspect how your editor, renderer, or terminal handles complex Unicode. - Verify that UniWorld-based tools respect grapheme clusters, widths, and breaks. - Copy-paste samples into your own test harnesses.
Regular ASCII Latin:
The quick brown fox jumps over the lazy dog.Latin with diacritics and composed vs decomposed forms:
é, ö, ñ, å, ç, ů, ẞ, Ǘe + COMBINING ACUTE: é o + COMBINING DIAERESIS: ö n + COMBINING TILDE: ñNFKC/NFKD-sensitive examples (ligatures, compatibility forms):
ff, fi, fl, ffi, ffl (should normalize to ff, fi, fl, ffi, ffl).½ vs 1/2, ⅓ vs 1/3.These samples exercise: - Normalization (NFC/NFD/NFKC/NFKD). - Grapheme clustering for combining marks.
Greek:
α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ σ τ υ φ χ ψ ωΑ Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ ΩΟΔΥΣΣΕΥΣ → lowercase should end with final sigma.Cyrillic (Russian phrase):
Привет, мир! Это тест Юникода.These samples exercise: - Case mapping (upper/lower/title). - Locale-neutral casing and Greek final sigma.
Hebrew:
שלום עולם (shalom olam – “peace, world / hello, world”)שלום (שלום), 123, סוף. Arabic:
مرحبا بالعالم (marḥaban bil-ʿālam – “hello, world”)النسخة 2.0 من UniWorldتجربة UniWorld مع النص المختلط: English + العربية + 1234Mixed BiDi paragraph:
RTL: مرحبا بالعالم, then LTR: Hello, and back to RTL: שלום. These lines should be rendered in correct visual order under UAX #9, with appropriate cursor mapping if UniWorld’s bidi and cursor logic is used.
नमस्ते दुनिया (namaste duniya – “hello, world”) क् + ष → क्ष, श् + र → श्रप्रोग्रामिंग (programming)হ্যালো বিশ্ব (hyalo biśśo – “hello, world”)ক্ষ, জ্ঞ, ম্প্রਸਤ ਸ੍ਰੀ ਅਕਾਲ (sat sri akal – traditional greeting)வணக்கம் உலகம் (vaṇakkam ulagam – “hello, world”)හෙලෝ වර්ල්ඩ් (hello world, approximated)These samples exercise: - Grapheme cluster boundaries for Indic consonant + virama + consonant sequences. - Line breaking around scripts with complex clusters.
These scripts traditionally do not use spaces between words, so line breaking depends on dictionary-based segmentation.
สวัสดีชาวโลกนี่คือการทดสอบการตัดคำด้วย UniWorldສະບາຍດີໂລກນີ້ແມ່ນການທົດສອບການຕັດຄໍາសួស្តីពិភពលោកនេះគឺជាការធ្វើតេស្តការកាត់ពាក្យမင်္ဂလာပါကမ္ဘာလောကဤသည်မှာUniWorld၏စမ်းသပ်ခြင်းဖြစ်သည်When viewed in a UniWorld-aware editor or test harness:
你好,世界 你好,世界 (same orthography here, different fonts/locale may differ)こんにちは、UniWorld ライブラリへようこそ – Unicode テキスト処理のテストです。안녕하세요, 유니월드 라이브러리입니다.Display width checks (should be full-width = 2 columns per CJK ideograph/Hangul syllable):
text
ASCII: [H][e][l][l][o] -> width 5
CJK: [你][好][世][界] -> width 8 (4 × 2)
Mixed: [H][i][!][你][好] -> width 2 + 1 + 4 = 7
These samples exercise:
- East Asian Width handling in display_width.
- Grapheme clustering for Hangul and kana with diacritics.
ሰላም ዓለም (selam ālem – “hello, world”)ⵣⵉⵙⵉⵏ ⵏ ⵍⵎⵎⵉ (zisin n lmmi – sample text)ᎣᏏᏲ ᎢᎬᏱ (osiyo igvyi – “hello there, friend” approximation)Unified Canadian Aboriginal Syllabics (UCAS), used for Cree, Inuktitut, Ojibwe, and other Indigenous languages:
ᐊᐦᐊᐤ ᐊᐧᐊᐧᐱᐩ (sample greeting)ᐃᓄᒃᑎᑐᑦ (Inuktitut – the language name)ᒥᓂᐦᐅᒋᐊᐧᐤ‑ᐃᐦᐃ (Canadian syllabics hyphen U+1400)These test: - Line breaking: UCAS block (U+1400–U+167F) is AL (alphabetic) in UAX #14; hyphen U+1400 allows break after. - Grapheme clustering: one cluster per syllabic; no special GCB rules beyond defaults. - LTR; no dictionary-based segmentation (word boundaries follow spaces).
Section 8 (African and Other) tests: - Unicode range coverage. - Grapheme clustering in less-common scripts.
Each should be a single grapheme cluster when treated by UniWorld.
👍 👍🏻 👍🏼 👍🏽 👍🏾 👍🏿👋 👋🏻 👋🏼 👋🏽 👋🏾 👋🏿Each base + modifier pair must be a single grapheme cluster (emoji base + Extend).
ZWJ (Zero Width Joiner, U+200D) sequences combine multiple emoji into a single rendered glyph. UniWorld must treat each full ZWJ sequence as one grapheme cluster.
Examples (codepoint sequences):
These are the most complex grapheme clusters in Unicode: multiple ExtPict code points joined by ZWJ into a single user-perceived character. UniWorld handles them correctly; the rendered ZWJ emoji are omitted in this PDF only because headless browser print-to-PDF has layout issues with such sequences, not because of any Unicode limitation in the library.
Regional Indicator pairs (U+1F1E6..U+1F1FF) form flag emoji:
Regional indicators must pair: RI + RI = one cluster. UniWorld's grapheme segmentation
must not split inside a flag pair.
Layered combining marks:
à̀ (a + grave + grave) o͂̂ (o + tilde + circumflex) ḗ (e with acute and macron when normalized)Visually similar but canonically distinct sequences:
Å vs Å (precomposed vs decomposed).These cases stress: - Canonical equivalence under NFC/NFD. - Cluster boundaries in the presence of multiple combining marks.
Unicode box-drawing table (small preview):
┌────────────────┐
│ UniWorld Box │
├────────────────┤
│ [x] Graphemes │
│ [x] Width │
│ [x] Bidi │
│ [x] Breaks │
└────────────────┘
Monospace art with mixed characters:
Column ruler:
12345678901234567890
ASCII: Hello, world!
CJK: 你好世界
Mixed: Hi你好
These test: - Display width alignment in monospace contexts. - Handling of box drawing characters.
Hello, مرحبا, שלום, नमस्ते, สวัสดี, 你好, 한녕하세요 -- this single paragraph mixes Latin, Arabic, Hebrew, Devanagari, Thai, Han, and Hangul. 12345. UniWorld’s job is to ensure that segmentation (grapheme/word/sentence), bidi reordering, and line breaking all behave as the Unicode Standard intends, regardless of which binding (Rust, Python, JS/WASM, C, Go) you use.
You can use this paragraph to:
- Test cursor movement (logical and visual) across mixed scripts.
- Check line wrapping with different terminal widths.
- Verify case folding (HELLO vs hello vs Straße vs STRASSE).
Some suggested experiments:
uniworld.segment.grapheme_boundaries() (Python) or grapheme_boundaries() (Rust/JS) on each section.Verify that:
Word boundaries:
word_boundaries() and inspect how mixed scripts and punctuation are grouped.Confirm that dictionary-based segmentation improves Thai/Lao/Khmer/Myanmar breaks.
Normalization:
Ensure canonical-equivalent strings compare equal after normalization.
Line breaking:
line_break_opportunities_with_dictionary() on the SE Asian sections.Render with and without dictionary support to see the difference.
Display width and truncation:
display_width() and truncate_display_width() on CJK/emoji sections at various widths to confirm visual clamping.UniWorld exists so that every script gets first-class treatment in modern software:
If this document renders cleanly in your environment – with legible, correctly ordered text across scripts – then UniWorld is doing its job underneath. If you find edge cases, mis-renderings, or missing coverage, they become actionable test cases we can add to the suite.
This file is a test output, but it is also an invitation: build a world where every script works correctly by default.