Metadata-Version: 2.4
Name: opencc-py
Version: 1.3.2.dev20260628
Summary: Conversion between Traditional and Simplified Chinese (pure Python)
Home-page: https://github.com/BYVoid/OpenCC
Author: OpenCC contributors
License: Apache-2.0
Keywords: opencc,convert,chinese
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: Chinese (Simplified)
Classifier: Natural Language :: Chinese (Traditional)
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Localization
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: opencc-data==1.3.2.dev20260628
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# opencc-py (OpenCC Pure Python Implementation)

This directory contains a pure Python implementation of the
OpenCC Chinese conversion algorithm. It
provides the same import surface as the Python package:

```python
import opencc

converter = opencc.OpenCC("s2t")
print(converter.convert("汉字"))  # 漢字
```

## Data Dependency

The package does not bundle OpenCC configs or dictionaries directly. Built-in
conversion data is loaded from the
`opencc-data` PyPI package at runtime.

This keeps the pure Python package small and avoids depending on generated files
under the OpenCC source tree. The converter reads:

- config JSON files from `opencc_data.config_path()`
- dictionary text files from `opencc_data.data_path()`
- test cases from `opencc_data.test_data_path()`

Custom config files are still supported. When a custom config references a local
dictionary path such as `CustomPhrases.ocd2`, the pure Python implementation
looks for the corresponding `CustomPhrases.txt` next to the config file.

## Installation

The PyPI package name is `opencc-py`. Users can install it with pip:

```bash
python -m pip install opencc-py
```

For local development from this directory:

```bash
python -m pip install .
```

The package version matches its `opencc-data` version and declares the matching
data package as an exact install dependency, so pip installs the compatible data
package automatically.

Or use editable development mode:

```bash
python -m pip install -e .
```

## Supported Configs

`opencc.CONFIGS` is populated from the configs exposed by `opencc-data`.

```python
import opencc

print(opencc.CONFIGS)
```

The standard mmseg configs and configs that do not require segmentation are
supported. Jieba plugin configs are not included in `opencc-data`, so they are
not exposed as built-in configs by this package.

## Testing

Install test dependencies, then run pytest from the repository root:

```bash
python -m pip install -r python-pure/tests/requirements_lock.txt
PYTHONPATH=python-pure python -m pytest python-pure/tests
```

The tests verify:

- importing and initializing every built-in config
- conversion against `opencc-data` test cases
- custom config and local dictionary resolution
- golden output compatibility for supported configs

## Differences from the Official Implementation

This package intentionally implements only the pieces needed for pure Python
text conversion. Compared with the official C++ library and command-line tools,
it omits several lower-level details. The official Python implementation is the
`opencc` PyPI package.

- binary dictionary loading for `.ocd2`/`.ocd`; built-in dictionaries are read
  from `.txt` data supplied by `opencc-data`
- dictionary compilation and extraction tools such as `opencc_dict` and
  `opencc_phrase_extract`
- the C API, shared-library loading behavior, and ABI/plugin compatibility
  guarantees
- native CLI behavior, including streaming I/O, command-line option parity, and
  platform-specific path handling
- package, runfiles, and source-tree data discovery fallbacks; built-in data
  comes from `opencc-data`
- automatic loading of optional plugin configs or plugin resources, including
  the Jieba plugin package layout
- performance optimizations from marisa-trie, Darts, and the C++ segmentation
  implementation

The conversion semantics still mirror OpenCC's config-driven pipeline: mmseg
segmentation, ordered dictionary groups, longest-prefix matching within a
dictionary, conversion chains, normalization, and optional suppression of
tofu-risk dictionaries.

## License and Compliance

This package is distributed under the Apache License 2.0.

This project is a derivative work of
OpenCC. Runtime conversion data is provided
by the `opencc-data` PyPI package.

---

# opencc-py (OpenCC 純 Python 實作)

此目錄包含 OpenCC 中文轉換演算法的純
Python 實作，提供與 Python package 相同的匯入介面：

```python
import opencc

converter = opencc.OpenCC("s2t")
print(converter.convert("汉字"))  # 漢字
```

## 資料依賴

此 package 不再直接內嵌 OpenCC config 或 dictionary。內建轉換資料會在執行時
從 PyPI package `opencc-data` 載入。

這能讓 pure Python package 保持精簡，並避免依賴 OpenCC source tree 底下的
生成檔案。converter 會讀取：

- `opencc_data.config_path()` 提供的 config JSON 檔案
- `opencc_data.data_path()` 提供的 dictionary text 檔案
- `opencc_data.test_data_path()` 提供的測試案例

自訂 config 仍然支援。當自訂 config 參照本地 dictionary 路徑，例如
`CustomPhrases.ocd2`，純 Python 實作會在 config 檔案旁尋找對應的
`CustomPhrases.txt`。

## 安裝

PyPI package 名稱是 `opencc-py`。使用者可以透過 pip 安裝：

```bash
python -m pip install opencc-py
```

從此目錄進行本地開發安裝：

```bash
python -m pip install .
```

此 package 的版本會與 `opencc-data` 版本一致，並將相同版本的資料 package
宣告為精確安裝依賴，因此 pip 會自動安裝相容的資料 package。

也可以使用 editable development mode：

```bash
python -m pip install -e .
```

## 支援的 Configs

`opencc.CONFIGS` 由 `opencc-data` 提供的 configs 產生。

```python
import opencc

print(opencc.CONFIGS)
```

標準 mmseg configs 與不需要 segmentation 的 configs 皆受支援。Jieba plugin
configs 不包含在 `opencc-data` 中，因此此 package 不會把它們列為內建 configs。

## 測試

先安裝測試依賴，再從 repository root 執行 pytest：

```bash
python -m pip install -r python-pure/tests/requirements_lock.txt
PYTHONPATH=python-pure python -m pytest python-pure/tests
```

測試會驗證：

- 每個內建 config 都能 import 與初始化
- 轉換結果符合 `opencc-data` 測試案例
- 自訂 config 與本地 dictionary 解析
- 支援 configs 的 golden output 相容性

## 與官方實作的差異

此 package 刻意只實作純 Python 文字轉換所需的部分。相較於官方 C++ library
與 command-line tools，它省略了幾個較底層的實作細節。官方 Python 實作是 PyPI
上的 `opencc` package。

- `.ocd2` / `.ocd` 二進位 dictionary 載入；內建 dictionary 會讀取
  `opencc-data` 提供的 `.txt` 資料
- `opencc_dict`、`opencc_phrase_extract` 等 dictionary 編譯與抽取工具
- C API、shared-library 載入行為，以及 ABI/plugin 相容性保證
- native CLI 行為，包括 streaming I/O、命令列選項完整對齊，以及平台相關路徑處理
- package、runfiles、source-tree 資料搜尋 fallback；內建資料一律來自
  `opencc-data`
- optional plugin configs 或 plugin resources 的自動載入，包括 Jieba plugin 的
  package layout
- marisa-trie、Darts 與 C++ segmentation 實作帶來的效能最佳化

轉換語意仍會對齊 OpenCC 的 config-driven pipeline：mmseg segmentation、
ordered dictionary groups、dictionary 內 longest-prefix matching、conversion
chains、normalization，以及 tofu-risk dictionaries 的可選停用。

## License 與合規

此 package 以 Apache License 2.0 發佈。

此專案屬於 OpenCC 的衍生作品。執行時轉換
資料由 PyPI package `opencc-data`
提供。
