Metadata-Version: 2.4
Name: charex-numba
Version: 0.5.2
Summary: Numba overloads for NumPy string operations
License-Expression: BSD-2-Clause
Project-URL: Homepage, https://github.com/nmehran/charex
Project-URL: Source, https://github.com/nmehran/charex
Project-URL: Issues, https://github.com/nmehran/charex/issues
Keywords: numba,numpy,strings,stringdtype,np.char,np.strings
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Compilers
Requires-Python: <3.15,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numba<0.66,>=0.65.1
Requires-Dist: numpy<2.5,>=1.22
Provides-Extra: test
Requires-Dist: pytest>=8; extra == "test"
Provides-Extra: bench
Requires-Dist: matplotlib>=3.8; extra == "bench"
Dynamic: license-file

# charex

Use NumPy string functions inside Numba-compiled code.

`charex` lets `@njit` functions call common NumPy string operations such as
comparisons, `find`, `startswith`, `endswith`, `str_len`, and string predicates.
It works with fixed-width NumPy string arrays and, on NumPy 2.x, variable-width
`StringDType` arrays.

## Installation

The PyPI distribution is named `charex-numba`; the import package is `charex`:

```bash
python -m pip install charex-numba
```

## Quick Start

```python
import charex
import numpy as np
from numba import njit


@njit
def count_long(values, min_length):
    # charex enables this NumPy string operation inside nopython mode.
    lengths = np.strings.str_len(values)
    return np.count_nonzero(lengths >= min_length)
```

On NumPy 1.x, use the same pattern with `np.char` and fixed-width `S` or `U`
arrays.

## Behavior

NumPy behavior is the contract. Supported operations aim to match NumPy's return
values, output shapes, dtypes, exception behavior, broadcasting, and input
immutability.

## Supported Operations

Comparisons:

- `equal`
- `not_equal`
- `greater`
- `greater_equal`
- `less`
- `less_equal`

Occurrence and search:

- `count`
- `startswith`
- `endswith`
- `find`
- `rfind`
- `index`
- `rindex`

Information and predicates:

- `str_len`
- `isalpha`
- `isalnum`
- `isdigit`
- `isdecimal`
- `isnumeric`
- `isspace`
- `islower`
- `isupper`
- `istitle`

Additional `np.char` operation:

- `compare_chararrays`

## Supported APIs And Dtypes

- `np.char` fixed-width byte strings: `S`
- `np.char` fixed-width Unicode strings: `U`
- `np.strings` fixed-width byte strings: `S`
- `np.strings` fixed-width Unicode strings: `U`
- `np.strings` variable-width Unicode strings: `StringDType`

NumPy stores `S` and `U` arrays in fixed-size records. `StringDType` is
variable-width and stores string payloads separately from the array's fixed-size
metadata records. `charex` supports both storage models.

The native `charex._stringdtype` extension is required for `StringDType` support
and is built by the package install.

## Shapes

Supported inputs include scalars, 0-D arrays, 1-D arrays, N-D arrays, and
broadcast-compatible shapes for both fixed-width `S`/`U` and variable-width
`StringDType`. Array inputs may be contiguous, read-only, positively strided,
negatively strided, zero-stride, or empty views.

`StringDType()` and `StringDType(na_object=...)` variants are supported for the
listed `np.strings` operations with NumPy-matching operation-specific null
behavior.

`np.char` and `np.strings` are not treated as aliases. For example, `np.char`
comparison semantics strip trailing whitespace/NULs, while `np.strings`
comparison semantics do not.

Byte inputs to Unicode-only predicates such as `isdecimal` and `isnumeric`
follow NumPy and raise unsupported-loop errors.

## Not Supported

`charex` does not yet implement transformation or output-producing string
operations such as replace, case conversion, strip, pad, join, split, encode, or
decode. Object arrays and object-scalar bridges are also outside the current
nopython string path.

## Performance

On the current Numba 0.65.1 matrix, `charex` ranges from `1.02x` to `6.51x`
NumPy speed across 135 fixed-width and `StringDType` cases, with a `1.60x`
median.

Benchmark artifacts are in
[docs/benchmarks/numba-v-0.65.1](docs/benchmarks/numba-v-0.65.1/).

### Comparison Operators

![comparison-operators-bytes.png](docs/benchmarks/numba-v-0.65.1/comparison-operators-bytes.png)
![comparison-operators-strings.png](docs/benchmarks/numba-v-0.65.1/comparison-operators-strings.png)
![stringdtype-comparison.png](docs/benchmarks/numba-v-0.65.1/stringdtype-comparison.png)

### Occurrence Information

![char-occurrence-bytes.png](docs/benchmarks/numba-v-0.65.1/char-occurrence-bytes.png)
![char-occurrence-strings.png](docs/benchmarks/numba-v-0.65.1/char-occurrence-strings.png)
![stringdtype-occurrence.png](docs/benchmarks/numba-v-0.65.1/stringdtype-occurrence.png)

### Property Information

![char-properties-bytes.png](docs/benchmarks/numba-v-0.65.1/char-properties-bytes.png)
![char-properties-strings.png](docs/benchmarks/numba-v-0.65.1/char-properties-strings.png)
![char-numerics-strings.png](docs/benchmarks/numba-v-0.65.1/char-numerics-strings.png)
![stringdtype-properties.png](docs/benchmarks/numba-v-0.65.1/stringdtype-properties.png)
![stringdtype-numerics.png](docs/benchmarks/numba-v-0.65.1/stringdtype-numerics.png)

The previous Numba 0.59 matrix is archived under
[benchmarks/numba-v-0.59](benchmarks/numba-v-0.59/).

## Compatibility

`charex` targets Numba 0.65.1 and the NumPy ranges tested by that Numba release:

- Python `>=3.10,<3.15`
- Numba `>=0.65.1,<0.66`
- NumPy `>=1.22,<1.27` or `>=2.0,<2.5`
- llvmlite `0.47.x`

`np.strings` is available on NumPy 2.x only. On NumPy 1.x, `charex` registers the
`np.char` overloads and skips `np.strings`.

## Development

Install test dependencies:

```bash
python -m pip install -e ".[test]"
```

Run tests:

```bash
pytest -q
```

Run the representative behavior audit:

```bash
python docs/exploration/string_array_shape_audit.py --methods representative --api all --dtype all
```

Run the benchmark smoke test:

```bash
python benchmarks/benchmark.py --size 50000 --repeat 5
```

Install benchmark plotting dependencies and write CSV/PNG output:

```bash
python -m pip install -e ".[bench]"
python benchmarks/benchmark.py --size 50000 --repeat 5 --plot
```

Regenerate the full benchmark matrix from the repository root:

```bash
python -m pip install -e ".[bench]"
# Use a fresh NUMBA_CACHE_DIR for release matrices
CACHE_DIR=$(mktemp -d /tmp/charex-numba-cache.XXXXXX)
NUMBA_CACHE_DIR="$CACHE_DIR" PYTHONPATH=. \
  python benchmarks/matrix.py --size 250000 --repeat 15
```

CI runs Python 3.10-3.14 across representative NumPy 1.x and 2.x jobs with
Numba 0.65.1. The benchmark matrix above was generated on Python 3.12.8,
NumPy 2.4.6, Numba 0.65.1, and llvmlite 0.47.0.
