Metadata-Version: 2.2
Name: kotok
Version: 1.0.0
Summary: Korean morphological analyzer based on the BERT architecture
Home-page: http://github.com/Daeun271/kotok
Author: Daeun Jung
Author-email: Daeun.Jung@ruhr-uni-bochum.de
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: transformers>=4.48.0
Requires-Dist: torch>=2.6.0
Requires-Dist: accelerate>=1.3.0
Requires-Dist: tensorboard>=2.18.0
Requires-Dist: kiwipiepy>=0.20.0
Requires-Dist: tqdm>=4.67.0
Requires-Dist: requests>=2.32.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# kotok

kotok is a Korean morphological analysis tool based on the BERT architecture. It is able to lemmatize and POS-tag Korean sentences. Furthermore, it can detect and correct spacing as well as spelling errors in Korean text.

## Features

kotok has the following features:
1. Correct spacing errors
1. Correct spelling errors
1. Split Korean text into morphemes
1. Assign POS tags to morphemes (Uses the Sejong POS tag set)
1. Lemmatize morphemes

## Implementation details

### Spacing error detection and correction
All code related to spacing is located in the `kotok/spacing` directory.

Spacing errors are detected by fine-tuning a BERT model for token classification with the task of detecting spacing errors. The model is trained on a dataset of Korean text with simulated spacing errors to predict whether a specific token has a missing or extra space.

Spacing errors are corrected by inserting or removing spaces between tokens based on the predictions of the spacing error detection model. All spacing possibilities are considered and the one that achieves the lowest error score using the spacing error detection model is chosen to be the correct spacing variant. See `kotok/spacing/inference.py` for the implementation.

### Spelling error detection and correction
All code related to spacing is located in the `kotok/error` directory.

Spelling errors are detected by fine-tuning a BERT model for token classification with the task of detecting spelling errors. The model is trained on a dataset of Korean text with simulated spelling errors to predict whether a specific token is a spelling error.

Spelling errors are generated by the `TypoTransformer` class in `kotok/error/typo.py` which is able to generate likely spelling errors based on common Korean typo patterns.

Spelling errors are corrected by replacing the misspelled token with the token corrections generated by the `TypoTransformer` class. The token correction with the highest probability of being the correct spelling is chosen as the corrected token. See `kotok/error/inference.py` for the implementation.

### Morpheme splitting, POS-tagging and lemmatization
All code related to morpheme splitting, POS-tagging and lemmatization is located in the `kotok` directory.

Morpheme splitting and POS-tagging are performed by fine-tuning a BERT model for token classification. Training data is generated by tokenizing plain text files with [Kiwi](https://github.com/bab2min/Kiwi/tree/main).

Resulting morphmes are lemmazizzed (=stemmed) by a rule-based lemmatizer. The lemmatizer is based on the Korean stemming system by [Yomitan](https://github.com/yomidevs/yomitan/blob/master/ext/js/language/ko/korean-transforms.js/), a web-browser dictionary.

## Installation for development

#### Create and activate a virtual environment:

Linux and macOS:
```bash
python3 -m venv .venv
source .venv/bin/activate
```

Windows (PowerShell):
```powershell
python3 -m venv .venv
.venv\Scripts\Activate.ps1
```

#### Install the required packages:
```bash
pip install -r requirements.txt
```

## Train the classification models

To run kotok, the classification models need to be trained. If not using pre-trained model files, follow the instructions below to generate the models from scratch.

### Aquire training data

The default training data set can be downloaded by running the following command:
```bash
python -m kotok data_dl
```

If a custom training data set is to be used, place plain text files into the `data/txt` directory. The directory is recursively searched for all `.txt` files.

### Train the classification models

Choose a BERT based tokenizer model which should be fine-tuned for the 3 classification tasks. The model name or path should be specified with the `-m` option for all of the following commands. The best results have been observed with the [`klue/bert-base`](https://huggingface.co/klue/bert-base) model.

#### Train the spacing error classification model
```bash
# Simulate spacing errors in the training data and label them
python -m kotok.spacing data -m <tokenizer model name or path>

# Train the spacing error classification model
python -m kotok.spacing train -m <tokenizer model name or path> -o <output model directory>
```

#### Train the spelling error classification model
```bash
# Simulate spelling errors in the training data and label them
python -m kotok.error data -m <tokenizer model name or path>

# Train the spelling error classification model
python -m kotok.error train -m <tokenizer model name or path> -o <output model directory>
```

#### Train the POS-tagging and lemmatization model
```bash
# Prepare training and validation data
python -m kotok data -m <tokenizer model name or path>

# Train the POS-tagging and lemmatization model
python -m kotok train -m <tokenizer model name or path> -o <output model directory>
```

## Run kotok as a command line tool

Run the following command to start the command line interface, allowing for the input of Korean text to be analyzed:
```bash
python -m kotok inference \
    -m <pos tokenizer model name or path> -cm <fine-tuned pos classification model directory> \
    -em <error tokenizer model name or path> -ecm <fine-tuned error classification model directory> \
    -sm <spacing tokenizer model name or path> -scm <fine-tuned spacing error classification model directory>
```

### User dictionary

User dictionary entries are stored in tsv (=tab-separated values) files with the following format:
```
<word> <pos tag>
```
If words should be tagged even if they just begin with the specified word, the following format should be used. This is useful for specifiying words that conjugate, such as verbs.
```
<word>* <pos tag>
```
If the POS tag should be enforced, ie the user dictionary entry should only be applied if the POS makes sense in the context, the following format should be used:
```
<word> <pos tag>!
```
If the word makes sense in place of multiple POS tags, the following format should be used. This is especially useful for POS tags that are closely related, such as NNG and NNP.
```
<word> <pos tag>!<check pos tag1>,<check pos tag2>,...
```

To enable the user dictionary, the `-u` option should be used with the path to the user dictionary file or directory. If a directory is specified, all tsv files in the directory are loaded recursively.

### Further options
Further command line options can be found by running `python -m kotok inference --help`.

## Use kotok as a library

To use kotok as a library, the `Analyzer` class can be imported and used as follows:

```python
from kotok import Analyzer

analyzer = Analyzer(
    model="<pos tokenizer model name or path>",
    classification_model="<fine-tuned pos classification model directory>",
    error_model="<error tokenizer model name or path>",
    error_classification_model="<fine-tuned error classification model directory>",
    spacing_model="<spacing tokenizer model name or path>",
    spacing_classification_model="<fine-tuned spacing error classification model directory>",
    lemma_data="<lemmatization data directory>",
)

result = analyzer.analyze("안녕하세요.")
print(result) # [안녕/NNG, 하/XSA, 세요/EF, ./SF]
```

Detailed information on the `Analyzer` class can be found by checking the docstrings of the class.

## License

kotok is licensed under the GNU General Public License v3.0. See the [LICENSE](LICENSE) file for more information.
