Metadata-Version: 2.4
Name: gec-datasets
Version: 0.2.0
Summary: A tools to easily load Grammatical Error Correction datasets.
Author: gotuiyan
License-File: LICENSE
Requires-Python: >=3.11.0
Requires-Dist: gdown>=5.2.0
Requires-Dist: gecommon>=0.1.0
Requires-Dist: requests>=2.32.3
Description-Content-Type: text/markdown

# gec-datasets

This library is to handle datasets of Grammatical Error Correction.

# Install

```sh
pip install gec-datasets
```


# Usage

### API
```python
from gec_datasets import GECDatasets
gec = GECDatasets(
    base_path='gec_datasets_base/'
)
conll14 = gec.load('conll14')

assert conll14.srcs is not None
assert conll14.refs is not None
# The number of sentences is 1312.
assert len(conll14.srcs) == 1312
# CoNLL-2014 contains two official references.
assert len(conll14.refs) == 2
# Each reference also contains 1312 sentences.
assert len(conll14.refs[0]) == 1312
assert len(conll14.refs[1]) == 1312
```

Available ids can be found by:
```python
import gec_datasets
print(gec_datasets.available())
```

### CLI
You can specify multiple ids of the data you want to download in the `--ids` field.

```sh
gecdatasets-download --base_path "gec_datasets_base/" --ids conll14 bea19-dev
```

Available ids can be found by:
```sh
gecdatasets-available
```


In both API and CLI, datasets will be stored under `base_path=`.  
The first time it is downloaded automatically, and thereafter it is loaded from the saved files.

When you call `gec.load('sample')`, gec-datasets simply refers to `<base_path>/'sample'/{src.txt|ref0.txt|...}`.

```
gec_datasets_base/
├── conll14
│   ├── ref0.txt
│   ├── ref1.txt
│   └── src.txt
├── bea19-dev
│   ├── ref0.txt
│   ├── src.txt
├── bea19-test
│   └── src.txt
...
```

# Supported datasets

### Public datasets

|ID `.load(ID)`|Description|
|:--|:--|
|'conll13'|CoNLL-2013 test set [[Ng+ 2013]](https://aclanthology.org/W13-3601/).|
|'conll14'|CoNLL-2014 test set [[Ng+ 2014]](https://aclanthology.org/W14-1701/).|
|'jfleg-test'|JFLEG test set [[Napoles+ 2017]](https://aclanthology.org/E17-2037/).|
|'jfleg-dev'|JFLEG development set [[Napoles+ 2017]](https://aclanthology.org/E17-2037/).|
|'fce-test'|FCE test set [[Yannakoudakis+ 2011]](https://aclanthology.org/P11-1019/)|
|'fce-dev'|FCE development set [[Yannakoudakis+ 2011]](https://aclanthology.org/P11-1019/).|
|'fce-train'|FCE training set [[Yannakoudakis+ 2011]](https://aclanthology.org/P11-1019/).|
|'cweb-g-test'|CWEB-G test set [[Flachs+ 2020]](https://aclanthology.org/2020.emnlp-main.680/).|
|'cweb-g-dev'|CWEB-G development set [[Flachs+ 2020]](https://aclanthology.org/2020.emnlp-main.680/).|
|'cweb-s-test'|CWEB-S test set [[Flachs+ 2020]](https://aclanthology.org/2020.emnlp-main.680/).|
|'cweb-s-dev'|CWEB-S development set [[Flachs+ 2020]](https://aclanthology.org/2020.emnlp-main.680/).|
|'bea19-test'|BEA-2019 shared task test set [[Bryant+ 2019]](https://aclanthology.org/W19-4406/).|
|'bea19-dev'|BEA-2019 shared task development set [[Bryant+ 2019]](https://aclanthology.org/W19-4406/). It contains only source sentences.|
|'wi-locness-train'|W&I+LOCNESS training set [[Yannakoudakis+ 2018]](https://www.cl.cam.ac.uk/~hy260/WI-cefr.pdf).|

The following is synthetic data.

|ID `.load(ID)`|Description|
|:--|:--|
|'troy-1bw-train'|Synthetic data based on the One Billion Words Benchmark for distillation [[Tarnavskyi,+ 2022]](https://aclanthology.org/2022.acl-long.266/).|
|'troy-1bw-dev'|Another split of the synthetic data based on the One Billion Words Benchmark for distillation [[Tarnavskyi,+ 2022]](https://aclanthology.org/2022.acl-long.266/).|
|'troy-blogs-train'|Synthetic data based on the Blog Authorship Corpus for distillation [[Tarnavskyi,+ 2022]](https://aclanthology.org/2022.acl-long.266/).|
|'troy-blogs-dev'|Another split of the synthetic data based on the Blog Authorship Corpus for distillation [[Tarnavskyi,+ 2022]](https://aclanthology.org/2022.acl-long.266/).|
|'pie-synthetic-a1'|Synthetic data based on the One Billion Words Benchmark [[Awasthi+ 19]](https://aclanthology.org/D19-1435/). You can also specify `a2`, `a3`, `a4`, and `a5`. [This attachment](https://aclanthology.org/attachments/D19-1435.Attachment.pdf) describes how to make synthetic errors.|

### Non-public datasets

|ID `.load(ID)`|Description|
|:--|:--|
|'nucle-train'|NUCLE training set. [[Dahlmeier+ 2013]](https://aclanthology.org/W13-1703/)|
|'lang8-train'|Lang-8 training set. [[Mizumoto+ 2012]](https://aclanthology.org/C12-2084/) [[Tajiri+ 2012]](https://aclanthology.org/P12-2039/)|

### nucle-train

- Request data from [HERE](https://www.cl.cam.ac.uk/research/nl/bea2019st/).
- You will receive an email with release3.3.tar.bz2 attached.
- `mkdir <base_path>/nucle/` and put the data as ` <base_path>/nucle/release3.3.tar.bz2`.
- You can now use the data with `.load("nucle-train")`. The data will be extracted automatically.

### lang8-train

- Request data from [HERE](https://www.cl.cam.ac.uk/research/nl/bea2019st/).
- You will receive an email titled "[NAIST Lang-8 Corpus of Learner English for the 14th BEA Shared Task]".
- `mkdir <base_path>/lang8/` and put the data as `<base_path>/lang8/lang8.bea19.tar.gz`.
- You can now use the data with `.load("lang8-train")`. The data will be extracted automatically.
