Metadata-Version: 2.4
Name: dialup
Version: 1.0.1
Summary: DialUp! Generating linguistically plausible artificial dialects; preprocessing low-resource language inputs.
Author-email: Niyati Bafna <niyatibafna13@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/niyatibafna/dialup/tree/master/dialup_pkg
Project-URL: Issues, https://github.com/niyatibafna/dialup/issues
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: scipy
Dynamic: license-file

# DialUp

This package contains code for the noising and denoising techniques introduced in [DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models](https://arxiv.org/abs/2501.16581) and [Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization](https://aclanthology.org/2024.emnlp-main.1044/).

Install
```
pip3 install dialup
```

## Noising
### Introduction
This code is for generating synthetic dialectal data from text in a given language.
This is done by applying linguistically-motivated augmentation (noising) that simulates dialectal variation to source language text.
We present several kinds of noisers, introduced briefly below, and described in detail in our papers.

### Noisers

- **Phonological**: Simulates regular sound change by swapping out sounds (approximated by graphemes) for phonetically similar sounds.
- **Morphological**: Noises suffixes of words.
- **Lexical**: Noises function and content words separately. Content words are swapped out for non-words generated by a chargram model. Function words are noised using a high dial of phonological noise.
- Random char: Makes random character substitutions.
- Random word: Makes random word substitutions.

These can also be applied in composition. 

### Example

In order to create artificial dialectal versions of your text, you will need monolingual data in your language, used to train a character 3-gram model as a component of the noiser. Ideally, it should have at least a few thousand sentences.

You can optionally configure your own composition of the above noisers (see below).

Here is an example for noising some Italian text:

```
>>> from dialup import Noiser, print_languages_with_inbuilt_noising_support
>>> print_languages_with_inbuilt_noising_support()
Supported languages for artificial related dialect / variant generation:  ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra']
For any other language, you can include support by following the steps here: https://github.com/niyatibafna/dialup/tree/master/mtd/generating_artificial_dialects.
>>> noiser_ita = Noiser(lang = "ita", noiser_params=None, text_file = "WikiMatrix.en-it.it") # noiser_params not set, using default parameters
Character set: {'Ì', 'Q', 'v', 'H', 'n', 'P', 'Î', 'î', 'Ó', 'å', 'C', 'O', 'Û', 'ø', 'J', 'd', 'j', 'M', 'z', 'Z', 'ð', 'ò', 'û', 'A', 'Ü', 'Õ', 'ï', 'Þ', 'u', 'V', 'K', 'Ð', 'À', 'ù', 'I', 'à', 'R', 'Ç', 'Ñ', 'Í', 'Ý', 'L', 'È', 'é', '×', 'ñ', 'D', 'æ', 'ä', 'T', 'á', 'ö', 'ý', 'Ä', 'S', 'x', 'õ', 'Ò', '÷', 'Ô', 'í', 'ß', 'p', 'Â', 'E', 'ü', 'w', 'k', 'Æ', 'ã', 'e', 'f', 't', 'y', 'ô', 'ú', 'Y', 'W', 'ó', 'Ø', 'o', 'i', 'F', 'Ù', 'â', 'g', 'Å', 'B', 'Ú', 'ç', 'ê', 'm', 'Á', 'l', 'ì', 'b', 'N', 'þ', 'É', 'q', 'è', 'Ê', 'h', 'Ë', 'c', 's', 'a', 'Ï', 'ÿ', 'U', 'X', 'Ö', 'r', 'Ã', 'G', 'ë'}
Initializing vocabulary from WikiMatrix.en-it.it...
Finished initializing vocabulary from WikiMatrix.en-it.it!
Length of vocab: 650576
Training chargram model with chargram length 3...
Finished training chargram model with chargram length 3!
Initializing vocabulary from WikiMatrix.en-it.it...
Finished initializing vocabulary from WikiMatrix.en-it.it!
Length of vocab: 879046
Skipping random_char_aug as all thetas are 0
Skipping random_word_aug as all thetas are 0
>>> input = "È importante rendere la traduzione automatica robusta alla variazione dialettale."
>>> noised_input = noiser_ita.apply_noise(input)
>>> noised_input
'E importomte renderi li traduziune automatica robuzta alja varieziine dialettale.'
```

### Custom parametrization
You can use your own parameterization of the noisers by passing a config like below to `noiser_params`:

```
text_file = "WikiMatrix.en-it.it"
params = {
    "lexical_aug": {
        "lang": lang,
        "theta_content_global": 0.001,
        "theta_func_global": 0.8,
        "text_file": text_file
    },
    "morph_aug": {
        "lang": lang,
        "theta_morph_global": 0.5,
        "text_file": text_file
    },
    "phonological_aug": {
        "lang": lang,
        "theta_phon": 0.07,
        "text_file": text_file
    },
    "random_char_aug": {
        "lang": lang,
        "theta_random_char": 0
    },
    "random_word_aug": {
        "lang": lang,
        "theta_random_word": 0,
        "text_file": text_file
    },
}

noiser_ita = Noiser(lang = "ita", noiser_params=params, text_file = "WikiMatrix.en-it.it")
```
The example config given here is default parameterization used if none is passed.


## Denoising

Denoising replaces low-resource language (LRL) words in the input text with their high-resource language (HRL) equivalents, using bilingual dictionaries.
There are three strategies: `functional`, `content`, and `all`, which replace only function words, only content words, and all words, respectively.
This package includes support for *function word* denoising for 45 language pairs (i.e. no need to pass your own lexicon). 
This package can also be used for any language pair that has `hrl` as one of `['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra']` and `lrl` as any other language if you have a bilingual lexicon. 
Note that your lexicon can include both function and content words; the strategy that you use will determine what class of words are replaced in your LRL. Make sure your lexicon has suitable coverage for your application scenario!

### Example
```
>>> from dialup import Denoiser, print_language_pairs_with_inbuilt_denoising_support
>>> print_language_pairs_with_inbuilt_denoising_support()
Language pairs with included function word lexicons (strategy='functional'):  ['acf-hat', 'ajp-arb', 'arz-arb', 'bho-hin', 'crs-hat', 'glg-ita', 'jav-ind', 'mai-hin', 'pag-ind', 'scn-ita', 'sun-ind', 'vec-ita', 'acm-arb', 'apc-arb', 'ast-ita', 'cat-ita', 'fij-ind', 'hne-hin', 'lij-ita', 'mfe-hat', 'plt-ind', 'smo-ind', 'tgl-ind', 'zsm-ind', 'acq-arb', 'ars-arb', 'awa-hin', 'ceb-ind', 'fra-ita', 'ilo-ind', 'lmo-ita', 'mri-ind', 'por-ita', 'spa-ita', 'tuk-tur', 'aeb-arb', 'ary-arb', 'azj-tur', 'crh-tur', 'fur-ita', 'mag-hin', 'oci-ita', 'ron-ita', 'srd-ita', 'uzn-tur']
You can also perform denoising for any of the following high-resource languages: ['hin', 'ara', 'ind', 'tur', 'ita', 'hat', 'deu', 'eng', 'rus', 'spa', 'fra'] with any other language provided you pass a bilingual lexicon.
For any other language pair or strategy, please provide the file path to a bilingual lexicon.
>>> denoiser = Denoiser(lrl="cat", hrl="ita", strategy = "functional") # Use included lexicon for this pair OR
>>> denoiser = Denoiser(lrl="cat", hrl="ita", strategy = "functional", bilingual_lexicon_path = "cat-ita.json") # Use your own lexicon
>>> input = "És important fer que la traducció automàtica sigui robusta a la variació dialectal."
>>> denoised_input = denoiser.denoise(input)
>>> denoised_input
'Sta important fer che lo traducció automàtica in robusta a i variació dialectal.'
```

### Bilingual lexicon format

Pass in a filepath to a JSON that looks like this:

```
{
    <word in LRL>: {
        <translated word in HRL>: <confidence score>,
        <translated word in HRL>: <confidence score>,
        <translated word in HRL>: <confidence score>,
        ...
    }, ...
}
```
Confidence scores are optional; if present, the translation with the highest confidence is picked.


# Cite

If you use our code, please cite:

```
@inproceedings{bafna-etal-2024-evaluating,
title = "Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization",
author = "Bafna, Niyati  and Murray, Kenton  and Yarowsky, David",
editor = "Al-Onaizan, Yaser  and
  Bansal, Mohit  and
  Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.1044/",
doi = "10.18653/v1/2024.emnlp-main.1044",
pages = "18742--18762"
}

@article{bafna2025dialup,
title={DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models},
author={Bafna, Niyati and Chang, Emily and Robinson, Nathaniel R and Mortensen, David R and Murray, Kenton and Yarowsky, David and Sirin, Hale},
journal={arXiv preprint arXiv:2501.16581},
year={2025}
}
(Accepted at ACL 2025)
```

Contributors: Niyati Bafna, Emily Chang
