Metadata-Version: 2.4
Name: wordextractor
Version: 0.2.0
Summary: Pipeline for extracting unregistered Korean blend words (혼성어) from corpora
Project-URL: Homepage, https://github.com/yourname/wordextractor
Project-URL: Repository, https://github.com/yourname/wordextractor
Author-email: Your Name <your@email.com>
License-Expression: MIT
License-File: LICENSE
Keywords: NLP,blend-word,corpus-linguistics,korean,혼성어
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Requires-Dist: ahocorasick-rs>=0.22
Requires-Dist: click>=8.0
Requires-Dist: kiwipiepy>=0.18
Requires-Dist: openpyxl>=3.1
Requires-Dist: pandas>=2.0
Requires-Dist: polars>=0.20
Requires-Dist: pyyaml>=6.0
Requires-Dist: tqdm>=4.60
Requires-Dist: xlrd>=2.0
Provides-Extra: all
Requires-Dist: beautifulsoup4>=4.12; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: python-dotenv>=1.0; extra == 'all'
Requires-Dist: selenium>=4.10; extra == 'all'
Provides-Extra: dev
Requires-Dist: beautifulsoup4>=4.12; extra == 'dev'
Requires-Dist: build; extra == 'dev'
Requires-Dist: openai>=1.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: python-dotenv>=1.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: selenium>=4.10; extra == 'dev'
Provides-Extra: llm
Requires-Dist: openai>=1.0; extra == 'llm'
Requires-Dist: python-dotenv>=1.0; extra == 'llm'
Provides-Extra: naver
Requires-Dist: beautifulsoup4>=4.12; extra == 'naver'
Requires-Dist: selenium>=4.10; extra == 'naver'
Description-Content-Type: text/markdown

# wordextractor

한국어 혼성어(blend word) 미등재어 추출 파이프라인

## 설치

```bash
# 기본 설치 (step 1-4)
pip install wordextractor

# LLM 주석 기능 포함 (step 5)
pip install wordextractor[llm]

# 네이버 검색 포함 (step 6)
pip install wordextractor[naver]

# 전체 설치
pip install wordextractor[all]
```

## 파이프라인 개요

| Step | 설명 | 주요 의존성 |
|------|------|-------------|
| step1 | 기등재 혼성어에서 N-Gram 패턴 추출 | pandas |
| step2 | 말뭉치에서 어절 빈도 목록 구축 | polars |
| step3 | 패턴 매칭 + 사전 필터링 + 형태소 분석 | ahocorasick-rs, kiwipiepy |
| step4 | 말뭉치 용례 추출 | polars, ahocorasick-rs |
| step5 | LLM 보조 혼성어 판정 (OpenAI Batch API) | openai |
| step6 | 네이버 뉴스 최초 출현일 검색 | selenium |

## 사용법

### 1. 설정 파일 작성

`config.yaml`을 작성합니다. 예시: [examples/config.yaml](examples/config.yaml)

### 2. CLI로 실행

```bash
# 개별 step 실행
wordextractor -c config.yaml step1
wordextractor -c config.yaml step2

# 단축 명령어
wordextractor -c config.yaml step3

# 전체 파이프라인 실행
wordextractor -c config.yaml run-all

# 특정 구간만 실행
wordextractor -c config.yaml run-all --start 3 --end 5

# 설정 확인
wordextractor -c config.yaml show-config
```

### 3. Python API로 사용

```python
from wordextractor import PipelineConfig
from wordextractor.steps.step1_extract_patterns import run as run_step1
from wordextractor.steps.step3_pattern_matching import run as run_step3

cfg = PipelineConfig.from_yaml("config.yaml")
run_step1(cfg)
run_step3(cfg)
```

## 필요 리소스

- `wordlist.xlsx` — 기등재 혼성어 목록 (`혼성어(색인표제어)`, `음절 수` 컬럼 필요)
- 우리말샘 XLS 파일 디렉토리 (선택)
- 말뭉치 Parquet 파일 (`SC_YYYYMM.parquet` 형식)

## 라이선스

MIT
