Metadata-Version: 2.4
Name: twgy
Version: 3.0.0
Summary: Taiwan Mandarin Phonetic Similarity Processor - 台灣國語語音相似性處理系統
Home-page: https://github.com/yourusername/twgy
Author: TWGY Development Team
Author-email: TWGY Development Team <twgy.dev@example.com>
License: MIT
Project-URL: Homepage, https://github.com/twgy-team/twgy-v3
Project-URL: Repository, https://github.com/twgy-team/twgy-v3.git
Project-URL: Documentation, https://twgy-v3.readthedocs.io/
Project-URL: Bug Tracker, https://github.com/twgy-team/twgy-v3/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.20.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: pypinyin>=0.44.0
Requires-Dist: dimsim>=0.2.0
Requires-Dist: tqdm>=4.60.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: requests>=2.25.0
Requires-Dist: pyyaml>=5.4.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: flake8>=3.8; extra == "dev"
Requires-Dist: mypy>=0.812; extra == "dev"
Requires-Dist: jupyter>=1.0; extra == "dev"
Provides-Extra: full
Requires-Dist: sentence-transformers>=2.0; extra == "full"
Requires-Dist: faiss-cpu>=1.7.0; extra == "full"
Requires-Dist: torch>=1.9.0; extra == "full"
Provides-Extra: api
Requires-Dist: fastapi>=0.68.0; extra == "api"
Requires-Dist: uvicorn>=0.15.0; extra == "api"
Requires-Dist: pydantic>=1.8.0; extra == "api"
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# TWGY - Taiwan Mandarin Phonetic Similarity Processor
**台灣國語語音相似性處理系統**

[![Python Version](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://python.org)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![PyPI Version](https://img.shields.io/pypi/v/twgy.svg)](https://pypi.org/project/twgy/)

TWGY is a comprehensive phonetic similarity processing system specifically optimized for Taiwan Mandarin variations. It provides advanced ASR (Automatic Speech Recognition) post-processing capabilities and phonetic similarity analysis for Chinese text.

## 🎯 Key Features

### Core Functionality
- **Three-Layer Architecture**: L1 consonant filtering → L2 first/last similarity → L3 full phonetic analysis
- **Taiwan Mandarin Optimized**: Handles common Taiwan pronunciation variations:
  - 平翹舌不分 (Retroflex/non-retroflex confusion)
  - 前後鼻音不分 (Front/back nasal confusion)
  - 邊鼻音不分 (Lateral/nasal confusion)
- **170,000+ Word Dictionary**: Comprehensive Chinese word coverage
- **High Performance**: <250ms processing time with concurrent query support

### Advanced Features
- **DimSim Integration**: Enhanced similarity scoring with deep learning models
- **Batch Processing**: Efficient handling of multiple queries
- **Caching System**: Optimized performance with intelligent caching
- **Training Data Collection**: Automatic data logging for model improvement
- **CLI Interface**: Command-line tools for easy usage
- **RESTful API Ready**: Can be easily wrapped into web services

## 📦 Installation

### From PyPI (Recommended)
```bash
pip install twgy
```

### From Source
```bash
git clone https://github.com/yourusername/twgy
cd twgy
pip install -e .
```

### Development Installation
```bash
git clone https://github.com/yourusername/twgy
cd twgy
pip install -e ".[dev]"
```

### Optional Dependencies
```bash
# For enhanced features
pip install "twgy[full]"

# For API development
pip install "twgy[api]"

# All features
pip install "twgy[full,api,dev]"
```

## 🚀 Quick Start

### Basic Usage
```python
from twgy import PhoneticReranker

# Initialize the reranker
reranker = PhoneticReranker()

# Find similar words
result = reranker.rerank("知道")
print(result.candidates[:5])
# Output: ['知道', '指導', '智道', '志道', '制導']

# Check processing details
print(f"Processing time: {result.processing_time_ms:.1f}ms")
print(f"Pipeline: {result.l1_candidates_count} → {result.l2_candidates_count} → {result.l3_candidates_count}")
```

### Convenience Functions
```python
from twgy import quick_rerank, get_similar_words, batch_process

# Quick single query
similar = quick_rerank("知道", max_candidates=5)
print(similar)
# Output: ['知道', '指導', '智道', '志道', '制導']

# Get similarity scores
similar_with_scores = get_similar_words("知道", threshold=0.7)
for item in similar_with_scores[:3]:
    print(f"{item['word']}: {item['similarity']:.2f}")
# Output:
# 指導: 0.85
# 智道: 0.80
# 志道: 0.75

# Batch processing
words = ["知道", "資道", "吃飯"]
results = batch_process(words)
for result in results:
    print(f"{result.query}: {len(result.candidates)} candidates")
```

### Advanced Configuration
```python
from twgy import PhoneticReranker, RerankerConfig

# Custom configuration
config = RerankerConfig(
    l3_top_k=20,                    # Return top 20 candidates
    enable_dimsim=True,             # Enable DimSim reranking
    dimsim_stage="L2",              # Apply DimSim at L2 stage
    dimsim_weight=0.3,              # DimSim score weight
    max_processing_time_ms=500.0,   # Performance timeout
    enable_training_data_logging=True  # Collect training data
)

reranker = PhoneticReranker(config)
result = reranker.rerank("語音辨識")
```

## 🚀 快速開始

### 環境要求

- Python 3.8+
- 已安裝萌典數據(17萬詞)
- 推薦使用MPS/CUDA加速

### 安裝與初始化

```bash
# 進入項目目錄
cd TWGY_V3

# 安裝依賴
pip install -r requirements.txt
```

### 基礎使用

```python
from src.phonetic_reranker import PhoneticReranker

# 初始化系統(自動載入17萬詞典)
reranker = PhoneticReranker()

# ASR錯誤修正
result = reranker.rerank("資道")  # 輸入錯誤識別
print(result.candidates[:5])     # ['知道', '自動', '指導', '資料', '指標']
print(f"處理時間: {result.processing_time_ms:.1f}ms")  # 處理時間: 142.3ms
print(f"信心度: {result.confidence_score:.2f}")       # 信心度: 0.78

# 批量處理ASR輸出
queries = ["資道", "次飯", "醬瓜"]
results = reranker.batch_rerank(queries)
for result in results:
    print(f"{result.query} → {result.candidates[0]}")
    # 資道 → 知道
    # 次飯 → 吃飯  
    # 醬瓜 → 將瓜
```

### 高級配置

```python
from src.phonetic_reranker import PhoneticReranker, RerankerConfig

# 自定義配置
config = RerankerConfig(
    l3_top_k=20,                        # 返回前20個候選
    enable_training_data_logging=True,  # 啟用數據收集
    max_processing_time_ms=200.0        # 處理時間限制200ms
)

reranker = PhoneticReranker(config)

# 啟用數據收集的處理
result = reranker.rerank("知道")

# 會話結束時導出訓練數據
session_summary = reranker.finalize_session()
print(f"收集了 {session_summary.total_queries} 個訓練案例")
```

## 🧪 測試與驗證

### 運行完整測試套件

```bash
# 核心組件測試
python test_l1_consonant_filter.py        # L1聲母篩選測試
python test_l2_first_last_reranker.py     # L2首尾重排測試  
python test_l3_full_phonetic.py           # L3完整精排測試

# 整合測試
python test_l1_l2_integration.py          # L1+L2整合測試
python test_full_pipeline.py              # 完整三層測試

# 主API測試
python src/phonetic_reranker.py           # 主API功能測試

# 最終部署驗證(89.5%通過率)
python test_final_deployment.py           # 部署就緒驗證
```

### 使用範例

```bash
# 完整使用範例演示
python example_usage.py
```

## 📝 應用場景

### 1. ASR錯誤修正
```python
# 語音識別後處理
asr_errors = ["資道", "次飯", "醬瓜"]
for asr_output in asr_errors:
    result = reranker.rerank(asr_output)
    corrected = result.candidates[0]
    print(f"ASR修正: {asr_output} → {corrected}")
    # ASR修正: 資道 → 知道
    # ASR修正: 次飯 → 吃飯
    # ASR修正: 醬瓜 → 將瓜
```

### 2. 語音相似詞搜索
```python
# 查找語音相似詞
similar_words = reranker.get_similar_words(
    "知道", 
    similarity_threshold=0.6,
    max_results=10
)
for sim_word in similar_words:
    print(f"{sim_word['word']}: {sim_word['similarity']:.2f}")
```

### 3. 批量處理服務  
```python
# 高效批量處理(支援並發)
batch_queries = ["資道", "次飯", "醬瓜", "安全"] * 25  # 100個查詢
batch_results = reranker.batch_rerank(batch_queries)

# 統計批量處理結果
successful = [r for r in batch_results if not r.error]
avg_time = sum(r.processing_time_ms for r in successful) / len(successful)
print(f"批量處理: {len(successful)}/{len(batch_queries)} 成功")
print(f"平均處理時間: {avg_time:.1f}ms")
```

## 🔧 Development

### Setup Development Environment
```bash
git clone https://github.com/yourusername/twgy
cd twgy
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e ".[dev]"
```

### Running Tests
```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=twgy

# Run performance tests
pytest -m performance

# Run specific test categories
pytest -m "not slow"
```

### Code Quality
```bash
# Format code
black twgy/

# Check style
flake8 twgy/

# Type checking
mypy twgy/
```

### Building Package
```bash
# Build distribution
python -m build

# Install locally
pip install dist/twgy-3.0.0-py3-none-any.whl
```

## 📊 Performance Benchmarks

### Processing Speed
- **Simple queries** (e.g., "知道"): ~50-100ms
- **Medium queries** (e.g., "語音辨識"): ~100-200ms
- **Complex queries** (e.g., compound terms): ~200-250ms

### Memory Usage
- **Initial load**: ~100MB (dictionary + models)
- **With caches**: ~150MB (includes L1/L2/L3 caches)
- **Peak usage**: ~200MB (during batch processing)

### Accuracy Metrics
- **Exact match in top-5**: >95%
- **Phonetically similar in top-10**: >90%
- **Handles Taiwan variations**: >85%

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

### Areas for Contribution
- **Performance optimization**: Faster algorithms, better caching
- **Accuracy improvement**: Better phonetic models, more test cases
- **Language support**: Additional Chinese variants, multilingual support
- **Integration**: Web APIs, cloud deployment, ML pipeline integration

### Development Workflow
1. Fork the repository
2. Create a feature branch
3. Make changes with tests
4. Run quality checks
5. Submit a pull request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 📞 Support

- **Documentation**: [https://twgy.readthedocs.io/](https://twgy.readthedocs.io/)
- **Issues**: [GitHub Issues](https://github.com/yourusername/twgy/issues)
- **Discussions**: [GitHub Discussions](https://github.com/yourusername/twgy/discussions)
- **Email**: twgy.dev@example.com

## 🙏 Acknowledgments

- **Dictionary Sources**: Various open Chinese dictionaries and corpora
- **Research**: Based on Taiwan Mandarin phonetic variation studies
- **DimSim**: Integration with DimSim similarity models
- **Community**: Contributors and users who provided feedback

## 🔄 Changelog

### v3.0.0 (Current)
- Complete rewrite with three-layer architecture
- DimSim integration for enhanced accuracy
- Comprehensive CLI interface
- Performance optimizations (<250ms processing)
- Training data collection capabilities
- Improved Taiwan Mandarin variation handling

### v2.x (Legacy)
- Basic phonetic similarity processing
- Limited dictionary coverage
- Single-layer processing

---

**Made with ❤️ for the Chinese NLP community**

*TWGY v3.0.0 - Empowering Chinese language processing with Taiwan Mandarin phonetic intelligence*

