Metadata-Version: 2.4
Name: opencc-py
Version: 1.4.0
Summary: Conversion between Traditional and Simplified Chinese (pure Python)
Home-page: https://github.com/BYVoid/OpenCC
Author: OpenCC contributors
License: Apache-2.0
Keywords: opencc,convert,chinese
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: Chinese (Simplified)
Classifier: Natural Language :: Chinese (Traditional)
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Localization
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: opencc-data==1.4.0
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# opencc-py (OpenCC Pure Python Implementation)

This directory contains a pure Python implementation of the
OpenCC Chinese conversion algorithm. It
provides the same import surface as the Python package:

```python
import opencc

converter = opencc.OpenCC("s2t")
print(converter.convert("汉字"))  # 漢字
```

## Data Dependency

The package does not bundle OpenCC configs or dictionaries directly. Built-in
conversion data is loaded from the
`opencc-data` PyPI package at runtime.

This keeps the pure Python package small and avoids depending on generated files
under the OpenCC source tree. The converter reads:

- config JSON files from `opencc_data.config_path()`
- dictionary text files from `opencc_data.data_path()`
- test cases from `opencc_data.test_data_path()`

Custom config files are still supported. When a custom config references a local
dictionary path such as `CustomPhrases.ocd2`, the pure Python implementation
looks for the corresponding `CustomPhrases.txt` next to the config file.

## Installation

The PyPI package name is `opencc-py`. Users can install it with pip:

```bash
python -m pip install opencc-py
```

For local development from this directory:

```bash
python -m pip install .
```

The package version matches its `opencc-data` version and declares the matching
data package as an exact install dependency, so pip installs the compatible data
package automatically.

Or use editable development mode:

```bash
python -m pip install -e .
```

## Supported Configs

`opencc.CONFIGS` is populated from the configs exposed by `opencc-data`.

```python
import opencc

print(opencc.CONFIGS)
```

The standard mmseg configs and configs that do not require segmentation are
supported. Jieba plugin configs are not included in `opencc-data`, so they are
not exposed as built-in configs by this package.

## Testing

Install test dependencies, then run pytest from the repository root:

```bash
python -m pip install -r python-pure/tests/requirements_lock.txt
PYTHONPATH=python-pure python -m pytest python-pure/tests
```

The tests verify:

- importing and initializing every built-in config
- conversion against `opencc-data` test cases
- custom config and local dictionary resolution
- golden output compatibility for supported configs

## OpenCC 1.3.2 Feature Coverage

The following OpenCC 1.3.2 features are fully supported:

- **CJK Compatibility Ideographs normalization** — all built-in configs include
  a pre-processing normalization step that maps U+F900–U+FAFF characters to
  their canonical code points before conversion.
- **`match_policy: union`** — dictionary groups with `"match_policy": "union"`
  return the globally longest match across all sub-dictionaries.
- **`normalization` config field** — custom configs may add a `normalization`
  array to apply conversion steps before segmentation.
- **New configs** — `s2hkp` and `hk2sp` (Simplified ↔ Hong Kong, with phrase
  conversion) are available through `opencc-data`.
- **Tofu-risk dictionary suppression** — pass
  `include_tofu_risk_dictionaries=False` to `OpenCC()` to exclude dictionaries
  that may produce characters absent from modern CJK fonts.
- **JSONC** — config files may use `//` line comments and `/* */` block
  comments; the pure Python backend strips them before JSON parsing.
- **Inline dictionaries** — `{"type": "inline", "entries": {"key": "value",
  ...}}` dict nodes are supported in custom configs.

## Differences from the Official Implementation

This package intentionally implements only the pieces needed for pure Python
text conversion. Compared with the official C++ library and command-line tools,
it omits several lower-level details. The official Python implementation is the
`opencc` PyPI package.

- binary dictionary loading for `.ocd2`/`.ocd`; built-in dictionaries are read
  from `.txt` data supplied by `opencc-data`
- dictionary compilation and extraction tools such as `opencc_dict` and
  `opencc_phrase_extract`
- the C API, shared-library loading behavior, and ABI/plugin compatibility
  guarantees
- native CLI behavior, including streaming I/O, command-line option parity, and
  platform-specific path handling
- package, runfiles, and source-tree data discovery fallbacks; built-in data
  comes from `opencc-data`
- automatic loading of optional plugin configs or plugin resources, including
  the Jieba plugin package layout
- performance optimizations from marisa-trie, Darts, and the C++ segmentation
  implementation

The conversion semantics still mirror OpenCC's config-driven pipeline: mmseg
segmentation, ordered dictionary groups, longest-prefix matching within a
dictionary, conversion chains, normalization, and optional suppression of
tofu-risk dictionaries.

## License and Compliance

This package is distributed under the Apache License 2.0.

This project is a derivative work of
OpenCC. Runtime conversion data is provided
by the `opencc-data` PyPI package.

---

# opencc-py (OpenCC 純 Python 實作)

此目錄包含 OpenCC 中文轉換演算法的純
Python 實作，提供與 Python package 相同的匯入介面：

```python
import opencc

converter = opencc.OpenCC("s2t")
print(converter.convert("汉字"))  # 漢字
```

## 資料依賴

此 package 不再直接內嵌 OpenCC config 或 dictionary。內建轉換資料會在執行時
從 PyPI package `opencc-data` 載入。

這能讓 pure Python package 保持精簡，並避免依賴 OpenCC source tree 底下的
生成檔案。converter 會讀取：

- `opencc_data.config_path()` 提供的 config JSON 檔案
- `opencc_data.data_path()` 提供的 dictionary text 檔案
- `opencc_data.test_data_path()` 提供的測試案例

自訂 config 仍然支援。當自訂 config 參照本地 dictionary 路徑，例如
`CustomPhrases.ocd2`，純 Python 實作會在 config 檔案旁尋找對應的
`CustomPhrases.txt`。

## 安裝

PyPI package 名稱是 `opencc-py`。使用者可以透過 pip 安裝：

```bash
python -m pip install opencc-py
```

從此目錄進行本地開發安裝：

```bash
python -m pip install .
```

此 package 的版本會與 `opencc-data` 版本一致，並將相同版本的資料 package
宣告為精確安裝依賴，因此 pip 會自動安裝相容的資料 package。

也可以使用 editable development mode：

```bash
python -m pip install -e .
```

## 支援的 Configs

`opencc.CONFIGS` 由 `opencc-data` 提供的 configs 產生。

```python
import opencc

print(opencc.CONFIGS)
```

標準 mmseg configs 與不需要 segmentation 的 configs 皆受支援。Jieba plugin
configs 不包含在 `opencc-data` 中，因此此 package 不會把它們列為內建 configs。

## 測試

先安裝測試依賴，再從 repository root 執行 pytest：

```bash
python -m pip install -r python-pure/tests/requirements_lock.txt
PYTHONPATH=python-pure python -m pytest python-pure/tests
```

測試會驗證：

- 每個內建 config 都能 import 與初始化
- 轉換結果符合 `opencc-data` 測試案例
- 自訂 config 與本地 dictionary 解析
- 支援 configs 的 golden output 相容性

## OpenCC 1.3.2 功能支援狀況

以下 OpenCC 1.3.2 功能已完整支援：

- **CJK 相容表意文字正規化** — 所有內建 config 均包含正規化前處理步驟，
  在轉換前先將 U+F900–U+FAFF 區塊字元映射至標準碼位。
- **`match_policy: union`** — 使用 `"match_policy": "union"` 的 dictionary
  group 會取所有子 dictionary 中最長的前綴命中。
- **`normalization` config 欄位** — 自訂 config 可加入 `normalization` 陣列，
  在 segmentation 前插入正規化步驟。
- **新 configs** — `s2hkp` 與 `hk2sp`（簡體 ↔ 香港繁體，含詞組轉換）
  透過 `opencc-data` 提供。
- **Tofu-risk dictionary 停用** — 建構 `OpenCC()` 時傳入
  `include_tofu_risk_dictionaries=False` 可停用可能輸出現代字型缺字的 dictionary。
- **JSONC** — config 檔案支援 `//` 行注釋與 `/* */` 區塊注釋；純 Python
  後端在解析 JSON 前會先剝除注釋。
- **Inline dictionary** — 自訂 config 支援 `{"type": "inline", "entries":
  {"key": "value", ...}}` 節點。

## 與官方實作的差異

此 package 刻意只實作純 Python 文字轉換所需的部分。相較於官方 C++ library
與 command-line tools，它省略了幾個較底層的實作細節。官方 Python 實作是 PyPI
上的 `opencc` package。

- `.ocd2` / `.ocd` 二進位 dictionary 載入；內建 dictionary 會讀取
  `opencc-data` 提供的 `.txt` 資料
- `opencc_dict`、`opencc_phrase_extract` 等 dictionary 編譯與抽取工具
- C API、shared-library 載入行為，以及 ABI/plugin 相容性保證
- native CLI 行為，包括 streaming I/O、命令列選項完整對齊，以及平台相關路徑處理
- package、runfiles、source-tree 資料搜尋 fallback；內建資料一律來自
  `opencc-data`
- optional plugin configs 或 plugin resources 的自動載入，包括 Jieba plugin 的
  package layout
- marisa-trie、Darts 與 C++ segmentation 實作帶來的效能最佳化

轉換語意仍會對齊 OpenCC 的 config-driven pipeline：mmseg segmentation、
ordered dictionary groups、dictionary 內 longest-prefix matching、conversion
chains、normalization，以及 tofu-risk dictionaries 的可選停用。

## License 與合規

此 package 以 Apache License 2.0 發佈。

此專案屬於 OpenCC 的衍生作品。執行時轉換
資料由 PyPI package `opencc-data`
提供。
