Metadata-Version: 2.4
Name: preparelist
Version: 0.1.0
Summary: Wordlist manipulation tools for non-ASCII characters
Author: kost
License: MIT
Project-URL: Homepage, https://github.com/kost/preparelist
Project-URL: Repository, https://github.com/kost/preparelist
Project-URL: Issues, https://github.com/kost/preparelist/issues
Keywords: wordlist,password,security,unicode,character-encoding
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Text Processing
Classifier: Topic :: Utilities
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# preparelist

Wordlist manipulation tools for non-ASCII characters. Designed for security professionals and penetration testers who need to work with wordlists containing special characters from various character encodings.

## Features

- **splitlist**: Split wordlists into files with and without special characters
- **transformlist**: Transform special characters according to configurable rules
- Support for multiple character encodings (UTF-8, ISO-8859-2, CP852, etc.)
- Both command-line tools and Python library API
- Flexible character transformation rules via JSON configuration

## Installation

### From PyPI

```bash
pip install preparelist
```

### From source

```bash
git clone https://github.com/kost/preparelist
cd preparelist
pip install -e .
```

## Command-Line Usage

### splitlist

Split a wordlist into two files: one containing words with special characters and one without.

```bash
# Basic usage
splitlist -i wordlist.txt -s special.txt -n normal.txt

# With specific input encoding
splitlist -i wordlist.txt -s special.txt -n normal.txt --input-encoding iso-8859-2

# With verbose output
splitlist -i wordlist.txt -s special.txt -n normal.txt -v
```

**Options:**
- `-i, --input`: Input wordlist file (required)
- `-s, --special`: Output file for words with special characters (required)
- `-n, --normal`: Output file for words without special characters (required)
- `--input-encoding`: Input file character encoding (default: utf-8)
- `--output-encoding`: Output file character encoding (default: utf-8)
- `-v, --verbose`: Verbose output

### transformlist

Transform characters in a wordlist according to a configuration file.

```bash
# Basic usage
transformlist -i wordlist.txt -o output.txt -c transform_simple.json

# Case-insensitive transformations (applies to both cases)
transformlist -i wordlist.txt -o output.txt -c transform_phonetic.json --case-insensitive

# Only output lines where transformation occurred
transformlist -i wordlist.txt -o output.txt -c transform_to_unicode_digraphs.json --only-transformed

# With specific encodings
transformlist -i wordlist.txt -o output.txt -c config.json \
  --input-encoding iso-8859-2 --output-encoding ascii

# With verbose output
transformlist -i wordlist.txt -o output.txt -c config.json -v
```

**Options:**
- `-i, --input`: Input wordlist file (required)
- `-o, --output`: Output wordlist file (required)
- `-c, --config`: Transformation configuration file in JSON format (required)
- `--input-encoding`: Input file character encoding (default: utf-8)
- `--output-encoding`: Output file character encoding (default: utf-8)
- `--case-insensitive`: Apply transformations to both uppercase and lowercase
- `--handle-titlecase`: Generate titlecase variants for multi-character sequences (e.g., "nj" also matches "Nj")
- `--only-transformed`: Only output lines where transformation occurred
- `-v, --verbose`: Verbose output

## Transformation Configuration Files

Transformation rules are defined in JSON files. Two example configurations are provided:

### transform_phonetic.json

Phonetic transformations that preserve sound:

```json
{
  "Š": "Sh",
  "š": "sh",
  "Đ": "Dj",
  "đ": "dj",
  "Č": "Ch",
  "č": "ch",
  "Ć": "Ch",
  "ć": "ch",
  "Ž": "Z",
  "ž": "z",
  "Dž": "Dz",
  "dž": "dz"
}
```

### transform_simple.json

Simple one-to-one character replacements:

```json
{
  "Š": "S",
  "š": "s",
  "Đ": "D",
  "đ": "d",
  "Č": "C",
  "č": "c",
  "Ć": "C",
  "ć": "c",
  "Ž": "Z",
  "ž": "z",
  "Dž": "Dz",
  "dž": "dz"
}
```

### transform_to_unicode_digraphs.json

Transform ASCII digraphs to Unicode equivalents (explicit all-case mapping):

```json
{
  "NJ": "Ǌ",
  "Nj": "ǋ",
  "nj": "ǌ",
  "LJ": "Ǉ",
  "Lj": "ǈ",
  "lj": "ǉ",
  "DŽ": "Ǆ",
  "Dž": "ǅ",
  "dž": "ǆ"
}
```

### transform_from_unicode_digraphs.json

Simplified config for use with `--handle-titlecase` flag (only lowercase specified):

```json
{
  "nj": "ǌ",
  "lj": "ǉ",
  "dž": "ǆ"
}
```

When used with `--handle-titlecase`, this automatically handles titlecase variants like "Nj" → "ǋ".

### Titlecase Handling

The `--handle-titlecase` flag automatically generates titlecase variants for multi-character sequences. This is useful when you only want to specify lowercase mappings in your config file, and have the tool automatically handle titlecase forms.

**Example:**

```bash
# Config file only contains: "nj": "ǌ", "lj": "ǉ"
transformlist -i wordlist.txt -o output.txt \
  -c examples/transform_from_unicode_digraphs.json \
  --handle-titlecase
```

**Input:**
```
njujork
Njujork
Ljubljana
```

**Output:**
```
ǌujork    (matches "nj" from config)
ǋujork    (matches generated "Nj" → "ǋ" titlecase variant)
Ǉubljana  (matches generated "Lj" → "ǉ" titlecase variant)
```

**Note:** `--handle-titlecase` only generates titlecase variants (first char upper, rest lower). For full uppercase support, use `--case-insensitive` or specify all variants explicitly in your config.

You can create your own configuration files with any character mappings you need.

## Python Library Usage

### Splitting wordlists

```python
from preparelist import split_wordlist

# Split wordlist
special_count, normal_count = split_wordlist(
    input_file='wordlist.txt',
    output_special='special.txt',
    output_normal='normal.txt',
    input_encoding='utf-8',
    output_encoding='utf-8'
)

print(f"Words with special chars: {special_count}")
print(f"Words without special chars: {normal_count}")
```

### Transforming wordlists

```python
from preparelist import load_transformation_config, transform_wordlist

# Load transformation rules
transformations = load_transformation_config('transform_simple.json')

# Transform wordlist
line_count = transform_wordlist(
    input_file='wordlist.txt',
    output_file='transformed.txt',
    transformations=transformations,
    input_encoding='utf-8',
    output_encoding='ascii',
    case_sensitive=False  # Apply to both cases
)

print(f"Processed {line_count} lines")
```

### Transforming individual text

```python
from preparelist import transform_text, load_transformation_config

# Load config
config = load_transformation_config('transform_phonetic.json')

# Transform text
original = "Željko Šarić"
transformed = transform_text(original, config, case_sensitive=False)
print(f"{original} -> {transformed}")
# Output: Željko Šarić -> Zeljko Sharich
```

### Checking for special characters

```python
from preparelist import has_special_chars

print(has_special_chars("hello"))     # False
print(has_special_chars("Šime"))      # True
print(has_special_chars("café"))      # True
```

## Supported Character Encodings

Common encodings include:
- `utf-8` (default)
- `iso-8859-1` (Latin-1)
- `iso-8859-2` (Latin-2, Central European)
- `cp852` (DOS Latin-2)
- `cp1250` (Windows Central European)
- `ascii` (US-ASCII, 7-bit)

For a complete list, see [Python's codec documentation](https://docs.python.org/3/library/codecs.html#standard-encodings).

## Use Cases

- **Password Cracking**: Transform wordlists to account for different character representations
- **Security Testing**: Generate variants of wordlists for comprehensive testing
- **Data Cleaning**: Normalize character encodings in text files
- **Localization**: Adapt wordlists for different locales and character sets

## Examples

### Example 1: Processing a Croatian wordlist

```bash
# Split into special and normal
splitlist -i croatian_words.txt -s croatian_special.txt -n croatian_normal.txt

# Transform special characters to phonetic equivalents
transformlist -i croatian_special.txt -o croatian_phonetic.txt \
  -c examples/transform_phonetic.json --case-insensitive
```

### Example 2: Converting DOS encoding to UTF-8

```bash
# Transform from CP852 to UTF-8
transformlist -i dos_wordlist.txt -o utf8_wordlist.txt \
  -c examples/transform_simple.json \
  --input-encoding cp852 --output-encoding utf-8
```

### Example 3: Library usage for batch processing

```python
import preparelist
from pathlib import Path

# Load transformation config once
config = preparelist.load_transformation_config('transform_simple.json')

# Process multiple files
wordlists = Path('wordlists').glob('*.txt')
for wordlist in wordlists:
    output = f"transformed_{wordlist.name}"
    preparelist.transform_wordlist(
        str(wordlist),
        output,
        config,
        case_sensitive=False
    )
    print(f"Processed {wordlist.name} -> {output}")
```

### Example 4: Filtering wordlists with --only-transformed

The `--only-transformed` flag is useful for extracting only entries that contain specific characters:

```bash
# Extract only entries with ASCII digraphs (NJ, Nj, nj, LJ, Lj, lj, etc.)
transformlist -i mixed_wordlist.txt -o digraph_entries.txt \
  -c examples/transform_to_unicode_digraphs.json --only-transformed

# Result: only words like "Njujork", "Ljubljana" are in output,
# words like "password", "admin" are skipped
```

**Input (`mixed_wordlist.txt`):**
```
password
Njujork
admin
Ljubljana
test123
```

**Output (`digraph_entries.txt`):**
```
Ǌujork
Ǉubljana
```

This is particularly useful for:
- Identifying entries with specific character patterns
- Creating filtered wordlists for targeted testing
- Extracting names or terms from a specific language
- Quality control and validation

## Development

### Running tests

```bash
pip install -e ".[dev]"
pytest
```

### Building the package

```bash
python -m build
```

## License

MIT License - see LICENSE file for details

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Author

kost - https://github.com/kost

## Links

- GitHub: https://github.com/kost/preparelist
- PyPI: https://pypi.org/project/preparelist/
- Issues: https://github.com/kost/preparelist/issues
