Metadata-Version: 2.4
Name: translate_package
Version: 0.3.9
Summary: Contain functions and classes to efficiently train a sequence to sequence to translate between two languages.
Author: Oumar Kane
Author-email: oumar.kane@univ-thies.sn
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: accelerate==1.12.0
Requires-Dist: torch==2.7.0
Requires-Dist: torchvision==0.22.0
Requires-Dist: spacy==3.8.11
Requires-Dist: nltk==3.9.2
Requires-Dist: gensim==4.4.0
Requires-Dist: furo==2025.12.19
Requires-Dist: streamlit==1.54.0
Requires-Dist: tokenizers==0.22.2
Requires-Dist: tensorboard==2.20.0
Requires-Dist: evaluate==0.4.6
Requires-Dist: transformers==5.0.0
Requires-Dist: pandas==2.3.3
Requires-Dist: scikit-learn
Requires-Dist: matplotlib==3.10.8
Requires-Dist: plotly==6.5.2
Requires-Dist: sacrebleu==2.6.0
Requires-Dist: nlpaug==1.1.11
Requires-Dist: wandb==0.24.2
Requires-Dist: pytorch-lightning==2.6.1
Requires-Dist: selenium==4.40.0
Requires-Dist: sentencepiece
Requires-Dist: peft==0.18.1
Requires-Dist: rouge-score==0.1.2
Requires-Dist: wolof-translate==0.0.6
Requires-Dist: numpy<2.0
Dynamic: author
Dynamic: author-email
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# translate-package

> A Python library for French-Wolof neural machine translation using Transformer-based models (T5, BART, NLLB) with Bayesian hyperparameter optimization, custom tokenization, and a novel bucketing-truncation sampling strategy for low-resource languages.

---

## Table of Contents

- [Overview](#overview)
- [Key Contributions](#key-contributions)
- [Installation](#installation)
- [Getting Started](#getting-started)
- [Workflow](#workflow)
  - [1. Train the Tokenizer](#1-train-the-tokenizer)
  - [2. Hyperparameter Tuning](#2-hyperparameter-tuning)
  - [3. Fine-Tuning](#3-fine-tuning)
  - [4. Testing](#4-testing)
  - [5. Save to Hugging Face Hub](#5-save-to-hugging-face-hub)
- [Supported Models](#supported-models)
- [Arguments Reference](#arguments-reference)
- [Results](#results)
- [Citation](#citation)
- [License](#license)

---

## Overview

`translate-package` implements the methodology described in the paper **"Advancing Wolof-French Sentence Translation"**, which presents a comparative study of Transformer-based models for translating between French and Wolof — a low-resource language spoken in Senegal.

The library provides a full pipeline covering tokenizer training, Bayesian hyperparameter optimization, model fine-tuning, and evaluation, with a novel **bucketing and truncation** sampling strategy that reflects sentence length distribution during fine-tuning.

---

## Key Contributions

- **Novel Bucketing + Truncation Sampling**: Groups sequences of similar lengths into buckets for improved computational efficiency and coherence, combined with per-language maximum length truncation optimized as a hyperparameter. This combined strategy reflects the sentence length distribution during fine-tuning — particularly effective for low-resource language pairs.

- **Bayesian Hyperparameter Optimization**: Uses a Gaussian Process framework with an Upper Confidence Bound (UCB) acquisition function, optimizing the BLEU score as the objective metric over the hyperparameter search space.

- **Custom Tokenization**: Supports Byte Pair Encoding (BPE) for BART/LSTM and SentencePiece for T5, with vocabulary sizes adapted to the Wolof-French corpus.

- **Data Augmentation**: Character-level substitutions and swaps via `nlpaug` to compensate for the scarcity of Wolof parallel corpora.

- **Multi-model Support**: Fine-tune T5, BART, NLLB, or an LSTM baseline within the same pipeline.

---

## Installation

```bash
pip install translate-package
```

After installation, extract the workflow scripts into your working directory:

```bash
translate-init --output-dir ./my_experiment
```

This copies the following scripts locally so you can inspect and run them:

```
my_experiment/
├── translate_tokenizer.py
├── translate_hyperparameter_tuning.py
├── translate_finetuning.py
├── translate_test.py
└── save_to_hub.py
```

---

## Getting Started

All scripts use `argparse` and are run from the command line. The typical workflow follows four sequential steps:

```
Train Tokenizer → Hyperparameter Tuning → Fine-Tuning → Test
```

Optionally push the best model to the Hugging Face Hub with `save_to_hub.py`.

---

## Workflow

### 1. Train the Tokenizer

Train a custom tokenizer on your parallel corpus before fine-tuning. This step is only required for **T5** (SentencePiece) and **BART/LSTM** (BPE). NLLB uses its own built-in tokenizer and skips this step.

**BPE tokenizer (for BART / LSTM):**

```bash
python translate_tokenizer.py \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --vocab_size 15000 \
  --name bpe
```

**SentencePiece tokenizer (for T5):**

```bash
python translate_tokenizer.py \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --vocab_size 15000 \
  --name sp
```

| `--name` value | Tokenizer type | Used by |
|---|---|---|
| `bpe` | Byte Pair Encoding | BART, LSTM |
| `sp` | SentencePiece | T5 |
| *(none needed)* | Model's built-in tokenizer | NLLB |

---

### 2. Hyperparameter Tuning

Run Bayesian optimization to find the best learning rate, sequence lengths, augmentation probabilities, and other hyperparameters. Results are logged to Weights & Biases.

**T5 example (French → Wolof):**

```bash
python translate_hyperparameter_tuning.py \
  --model_generation t5 \
  --model_name google-t5/t5-base \
  --tokenizer_name sp \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --no-use_peft \
  --no-save_model \
  --save_artifact \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --min_lr 1e-4 --max_lr 1e-2 \
  --min_src_max_len 72 --max_src_max_len 111 \
  --min_tgt_max_len 67 --max_tgt_max_len 85 \
  --epochs 1 \
  --batch_size 64 \
  --max_words 21 \
  --min_nts 2245 --max_nts 2245 \
  --project wolof-french-translation-p3-t5-truncation \
  --key <YOUR_WANDB_KEY>
```

**NLLB example (bidirectional):**

```bash
python translate_hyperparameter_tuning.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --bidirectional_tr \
  --no-use_peft \
  --no-save_artifact \
  --no-save_model \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --min_lr 7e-6 --max_lr 1e-4 \
  --min_src_max_len 76 --max_src_max_len 117 \
  --min_tgt_max_len 76 --max_tgt_max_len 97 \
  --epochs 1 \
  --batch_size 64 \
  --max_words 21 \
  --min_nts 4485 --max_nts 4485 \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY>
```

---

### 3. Fine-Tuning

Fine-tune the model using the best hyperparameters found in the previous step. Pass the artifact path from your W&B run.

**T5 example:**

```bash
python translate_finetuning.py \
  --model_generation t5 \
  --model_name google-t5/t5-base \
  --tokenizer_name sp \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --no-use_peft \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --lr 0.006254861077618995 \
  --wd 0.004978258153761286 \
  --src_max_len 97 \
  --tgt_max_len 70 \
  --p_word 0.010406709461818209 \
  --p_char 0.9269797815116728 \
  --max_epochs 5 \
  --batch_size 64 \
  --max_words 21 \
  --nts 2245 \
  --metric bleu --mode max \
  --clean_ckpt_dir \
  --run_name t5-fine-tuning \
  --new_artifact_name machine_translation_best_model_t5_fr_wf \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-t5-truncation \
  --key <YOUR_WANDB_KEY>
```

**NLLB example (bidirectional):**

```bash
python translate_finetuning.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --bidirectional_tr \
  --no-use_peft \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --lr 0.00009877589029418412 \
  --wd 0.003711074112216225 \
  --src_max_len 96 \
  --tgt_max_len 91 \
  --p_word 0.06650106426899016 \
  --p_char 0.7166733632090903 \
  --max_epochs 5 \
  --batch_size 64 \
  --max_words 21 \
  --nts 4485 \
  --metric bleu --mode max \
  --clean_ckpt_dir \
  --run_name nllb-fine-tuning-fr-wf-bid \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY>
```

---

### 4. Testing

Evaluate the best model checkpoint on the test set. Reports BLEU and ROUGE-L scores.

**T5 example:**

```bash
python translate_test.py \
  --model_generation t5 \
  --model_name google-t5/t5-base \
  --tokenizer_name sp \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --no-use_peft \
  --file_name sent_tokenizer \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --src_max_len 97 \
  --tgt_max_len 70 \
  --p_word 0.010406709461818209 \
  --p_char 0.9269797815116728 \
  --batch_size 64 \
  --max_words 21 \
  --run_name t5-test-fr-wf \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-t5-truncation \
  --key <YOUR_WANDB_KEY>
```

**NLLB example:**

```bash
python translate_test.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --use_bucketing \
  --use_truncation \
  --no-bidirectional \
  --bidirectional_tr \
  --no-use_peft \
  --data_path corpus35k.csv \
  --src_label french \
  --tgt_label wolof \
  --num_workers 3 \
  --src_max_len 96 \
  --tgt_max_len 91 \
  --p_word 0.06650106426899016 \
  --p_char 0.7166733632090903 \
  --batch_size 64 \
  --max_words 21 \
  --run_name nllb-test-fr-wf-bid \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY>
```

---

### 5. Save to Hugging Face Hub

Push your best fine-tuned model directly to the Hugging Face Hub:

```bash
python save_to_hub.py \
  --model_generation nllb \
  --model_name facebook/nllb-200-distilled-600M \
  --no-bidirectional \
  --no-use_peft \
  --src_label french \
  --tgt_label wolof \
  --run_name nllb-save-to-hub-fr-wf-bid \
  --artifact_path <YOUR_WANDB_ARTIFACT_PATH> \
  --project wolof-french-translation-p3-nllb-truncation-bid \
  --key <YOUR_WANDB_KEY> \
  --token <YOUR_HF_TOKEN> \
  --username <YOUR_HF_USERNAME> \
  --repo_name nllb_french_wolof_bilateral
```

---

## Supported Models

| Model | `--model_generation` | `--model_name` example |
|---|---|---|
| T5 | `t5` | `google-t5/t5-base`, `google-t5/t5-small` |
| BART | `bart` | `facebook/bart-base` |
| NLLB | `nllb` | `facebook/nllb-200-distilled-600M` |
| LSTM | `lstm` | — (trained from scratch) |

---

## Arguments Reference

### Shared arguments (all scripts)

| Argument | Type | Description |
|---|---|---|
| `--model_generation` | `str` | Model family: `t5`, `bart`, `nllb`, `lstm` |
| `--model_name` | `str` | Hugging Face model identifier |
| `--data_path` | `str` | Path to the parallel corpus CSV file |
| `--src_label` | `str` | Column name for source language (e.g. `french`) |
| `--tgt_label` | `str` | Column name for target language (e.g. `wolof`) |
| `--batch_size` | `int` | Batch size for training / evaluation |
| `--num_workers` | `int` | Number of DataLoader workers |
| `--use_bucketing` | `flag` | Enable bucketing sampling strategy |
| `--use_truncation` | `flag` | Enable truncation of sequences |
| `--bidirectional_tr` | `flag` | Train in both translation directions |
| `--use_peft` | `flag` | Enable parameter-efficient fine-tuning (PEFT) |
| `--project` | `str` | W&B project name |
| `--key` | `str` | W&B API key |

### Hyperparameter tuning specific

| Argument | Type | Description |
|---|---|---|
| `--min_lr` / `--max_lr` | `float` | Learning rate search bounds |
| `--min_src_max_len` / `--max_src_max_len` | `int` | Source max length search bounds |
| `--min_tgt_max_len` / `--max_tgt_max_len` | `int` | Target max length search bounds |
| `--min_nts` / `--max_nts` | `int` | Number of training steps bounds |
| `--epochs` | `int` | Epochs per Bayesian trial |
| `--save_artifact` | `flag` | Save best trial model as W&B artifact |

### Fine-tuning specific

| Argument | Type | Description |
|---|---|---|
| `--lr` | `float` | Learning rate (from tuning step) |
| `--wd` | `float` | Weight decay |
| `--src_max_len` | `int` | Source sequence max length |
| `--tgt_max_len` | `int` | Target sequence max length |
| `--p_word` | `float` | Word-level augmentation probability |
| `--p_char` | `float` | Character-level augmentation probability |
| `--max_epochs` | `int` | Maximum training epochs |
| `--metric` | `str` | Metric to monitor (`bleu`, `loss`) |
| `--mode` | `str` | Optimization direction (`max` or `min`) |
| `--artifact_path` | `str` | W&B artifact path from tuning step |
| `--new_artifact_name` | `str` | Name for the saved fine-tuned model artifact |
| `--clean_ckpt_dir` | `flag` | Remove checkpoint directory after saving |

---

## Results

All models were fine-tuned for 5 epochs on the French-Wolof parallel corpus using the bucketing-truncation sampling strategy and Bayesian hyperparameter optimization. Evaluation metrics include BLEU, ROUGE-1, ROUGE-2, and ROUGE-L.

### French → Wolof

| Model | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L | Train Loss | Eval Loss |
|---|---|---|---|---|---|---|
| **NLLB-200-distilled-600M** | **17.19** | **0.4414** | **0.2236** | **0.4051** | 1.0484 | 1.9932 |
| T5-base | 7.38 | 0.2729 | 0.1010 | 0.2454 | 1.2241 | 3.2776 |
| BART* | — | — | — | — | — | — |
| LSTM* | — | — | — | — | — | — |

### Wolof → French

| Model | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L | Train Loss | Eval Loss |
|---|---|---|---|---|---|---|
| **NLLB-200-distilled-600M** | **27.41** | **0.4833** | **0.3009** | **0.4518** | 0.7981 | 1.3009 |
| BART* | — | — | — | — | — | — |
| LSTM* | — | — | — | — | — | — |

### NLLB-200 Bidirectional (Fr↔Wf joint training)

| Metric | Value |
|---|---|
| BLEU | 18.81 |
| ROUGE-1 | 0.4076 |
| ROUGE-2 | 0.2208 |
| ROUGE-L | 0.3750 |
| Train Loss | 0.9842 |
| Eval Loss | 1.7990 |
| Global Step | 4,485 |

> \* BART and LSTM results will be added upon completion. T5 fr→wf results correspond to `global_step: 2,245` (5 epochs).

NLLB-200 significantly outperforms T5 in both translation directions, achieving a BLEU of 27.41 on Wolof → French — the highest score across all evaluated configurations. The bidirectional NLLB training strategy provides a strong single-model solution for both translation directions simultaneously. The combined bucketing-truncation strategy, Bayesian optimization, and character-level augmentation were key contributors across all models.

---

## Citation

If you use this library or the methodology in your research, please cite:

```bibtex
@INPROCEEDINGS{10747017,
  author    = {Kane, Oumar and Bousso, Mamadou and Allaya, Mouhamad M. and Samb, Dame},
  booktitle = {2024 Fifth International Conference on Intelligent Data Science Technologies and Applications (IDSTA)},
  title     = {Advancing Wolof-French Sentence Translation: Comparative Analysis of Transformer-Based Models and Methodological Insights},
  year      = {2024},
  pages     = {145--152},
  doi       = {10.1109/IDSTA62194.2024.10747017}
}
```

---

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.
