Metadata-Version: 2.4
Name: fastnum
Version: 0.1.0
Summary: Zero-boilerplate converter from raw data (images, text, categories) to numeric NumPy arrays.
Project-URL: Repository, https://github.com/your-username/fastnum
Project-URL: Issues, https://github.com/your-username/fastnum/issues
License: MIT
License-File: LICENSE
Keywords: encoding,image,ml,numpy,one-hot,preprocessing,tokenizer
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Processing
Requires-Python: >=3.9
Requires-Dist: numpy>=1.24
Requires-Dist: opencv-python>=4.8
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Description-Content-Type: text/markdown

# FastNum

> Zero-boilerplate conversion from raw data to numeric NumPy arrays.

FastNum detects the kind of data you hand it — an image path, a sentence, a category list, or a batch of any of these — and returns the right numeric representation without a single line of configuration.

---

## Why FastNum?

Most ML preprocessing pipelines repeat the same four patterns:

| Input | Desired output |
|---|---|
| Image file | Float32 pixel array normalised to `[0, 1]` |
| Text sentence | Integer token-ID sequence |
| Flat category list | One-hot matrix |
| Batch of sentences | Padded token-ID matrix |

FastNum collapses all four into one call: `fn.to_num(data)`.

---

## Installation

```bash
pip install fastnum
```

Or from source:

```bash
git clone https://github.com/your-username/fastnum.git
cd fastnum
pip install -e ".[dev]"
```

**Requirements:** Python ≥ 3.9, numpy ≥ 1.24, opencv-python ≥ 4.8.

---

## Quick start

```python
from fastnum import FastNum

fn = FastNum()

# --- Image -----------------------------------------------------------
pixels = fn.to_num("photo.jpg")          # (H, W, 3) float32, values in [0, 1]
pixels = fn.to_num("photo.jpg", image_size=(224, 224))  # resize on the fly

# Batch of images (all resized to the same shape for stacking)
batch = fn.to_num(["a.jpg", "b.jpg"], image_size=(224, 224))  # (2, 224, 224, 3)

# --- Plain text ------------------------------------------------------
tokens = fn.to_num("the cat sat on the mat")   # int32 array of token IDs
print(fn.decode(tokens))                        # → "the cat sat on the mat"

# --- Category list ---------------------------------------------------
labels = ["dog", "cat", "dog", "bird"]
one_hot = fn.to_num(labels)
# array([[0., 1., 0.],
#        [1., 0., 0.],
#        [0., 1., 0.],
#        [0., 0., 1.]], dtype=float32)

# --- Sentence batch --------------------------------------------------
matrix = fn.to_num(["hello world", "foo bar baz"])
# int32 matrix (2, 3), shorter rows are right-padded with pad_token_id

# --- Raw NumPy array -------------------------------------------------
import numpy as np
fn.to_num(np.array([1, 2, 3]))              # cast to float32, no-op otherwise
```

---

## API reference

### `FastNum(pad_token_id=0)`

| Parameter | Type | Default | Description |
|---|---|---|---|
| `pad_token_id` | `int` | `0` | ID reserved for the `[PAD]` token. The special token is inserted into the vocabulary at construction time so real words are always assigned different IDs. |

### `to_num(data, image_size=None) → np.ndarray`

| Parameter | Type | Description |
|---|---|---|
| `data` | `str \| list[str] \| np.ndarray` | Input to convert. |
| `image_size` | `tuple[int, int] \| None` | Target `(H, W)` for image resizing. |

**Return type depends on input:**

| Input | dtype | Shape |
|---|---|---|
| Image path / list of paths | `float32` | `(H, W, C)` / `(N, H, W, C)` |
| Sentence | `int32` | `(T,)` |
| Category list | `float32` | `(N, num_classes)` |
| Sentence batch | `int32` | `(N, max_len)` |
| `np.ndarray` | `float32` | same as input |

### `decode(token_ids) → str`

Converts a token-ID sequence back to whitespace-separated text. Padding tokens are silently dropped.

### `vocab_size → int`

Number of entries currently in the vocabulary, including `[PAD]`.

---

## The `[PAD]` token and collision safety

`FastNum` reserves `pad_token_id` inside the vocabulary at construction time:

```python
self.vocab        = {"[PAD]": pad_token_id}
self.inverse_vocab = {pad_token_id: "[PAD]"}
```

Because `[PAD]` occupies a slot before any text is tokenised, `_get_or_add` assigns new words IDs equal to `len(self.vocab)`, which can never equal `pad_token_id` again.  This means:

* A padded cell in a token matrix will **never** decode to a real word.
* `decode()` does not need a special-case filter beyond `i != self.pad_token_id` — the two sets are disjoint by construction.

---

## Development

```bash
# Run tests with coverage
pytest

# Lint
ruff check fastnum

# Type-check
mypy fastnum
```

---

## License

MIT © your-username