Metadata-Version: 2.4
Name: quill-sort
Version: 6.0.5
Summary: Adaptive ultra-sort: auto-dispatching, correctness-first sorting that beats np.sort by a mile on the array fast path
Author: Isaiah Tucker
License: MIT
Project-URL: Homepage, https://github.com/invariant-games/quill
Project-URL: Repository, https://github.com/invariant-games/quill
Project-URL: Changelog, https://github.com/invariant-games/quill/blob/main/CHANGELOG.md
Keywords: sorting,sort,algorithm,radix sort,parallel sort,fast sort,high performance,numpy,gpu,cupy,polars
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: fast
Requires-Dist: numpy>=1.21; extra == "fast"
Requires-Dist: psutil>=5.9; extra == "fast"
Provides-Extra: polars
Requires-Dist: numpy>=1.21; extra == "polars"
Requires-Dist: polars>=0.20; extra == "polars"
Provides-Extra: gpu
Requires-Dist: numpy>=1.21; extra == "gpu"
Requires-Dist: cupy-cuda12x>=13; platform_system != "Darwin" and extra == "gpu"
Provides-Extra: pandas
Requires-Dist: numpy>=1.21; extra == "pandas"
Requires-Dist: pandas>=1.3; extra == "pandas"
Provides-Extra: all
Requires-Dist: numpy>=1.21; extra == "all"
Requires-Dist: pandas>=1.3; extra == "all"
Requires-Dist: psutil>=5.9; extra == "all"
Requires-Dist: polars>=0.20; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.1; extra == "dev"
Requires-Dist: hypothesis>=6.0; extra == "dev"
Dynamic: license-file

# quill-sort

**Adaptive Ultra-Sort** — Quill auto-dispatches to the *fastest correct* sort
for your data and your machine, and **never loses to the baseline**.

```python
import numpy as np
import quill

# The list API: a drop-in for sorted() / list.sort(), ~3-4x faster on numeric.
quill.quill_sort([3, 1, 4, 1, 5, 9])              # → [1, 1, 3, 4, 5, 9]
quill.quill_sort(records, key=lambda r: r['age']) # sort objects by a key

# The array API: the "by a mile" path — dispatches to Rust radix / GPU / polars.
a = np.random.randint(0, 2**40, 20_000_000)
quill.sort_array(a)                                # ~5x faster than np.sort
```

---

## The one thing to know: list API vs. array API

Quill has two doors, and which one you use determines how big the win is.

| API | Input / output | Speed (numeric, this machine) | Why |
|-----|----------------|-------------------------------|-----|
| `quill_sort(list)` | `list` in, `list` out | **matches `np.sort`**, ~**3.8x** vs `sorted()` / `list.sort()` | It does `np.asarray(...)` then `.tolist()` around a fast kernel. Those two conversions dominate the runtime (Amdahl's law), so the *kernel* speedup is mostly hidden — you get np.sort-class throughput, not the "by a mile" number. |
| `quill.sort_array(ndarray)` | `ndarray` in, `ndarray` out | **~5.3x int64, ~3.3x float64, ~4x GPU** vs `np.sort` | **Zero conversions.** It dispatches the raw buffer straight to the fastest installed backend. This is where Quill actually pulls away from numpy. |

**Plain version:** if your data starts and ends as a Python `list`, the
conversion wall caps the win — `quill_sort` is np.sort-fast and ~4x faster than
the built-in `sorted()`, which is already a great deal, but it is *not* 5x. If
you keep your data in numpy arrays and call `sort_array`, you get the full
multi-threaded / GPU speedup. Don't believe any library that claims a 5x
*list-in / list-out* sort — the conversion alone forbids it.

All numbers below are measured on this development box (Windows 11, 28-core CPU,
RTX 4060 Ti, numpy 2.4.6) with a freshly-shuffled array on every timed run, so
they reflect real cold-sort throughput, not warm/already-sorted artifacts. Your
mileage will vary with CPU, GPU, array size, and dtype.

---

## Installation

```bash
pip install quill-sort              # core — ZERO required deps, always correct
pip install quill-sort[fast]        # + numpy + psutil (the baseline fast path)
pip install quill-sort[polars]      # + polars  (no-compile multi-threaded sort)
pip install quill-sort[gpu]         # + cupy    (NVIDIA GPU sort)
pip install quill-sort[all]         # numpy + pandas + psutil + polars
```

The compiled Rust backend (the ~5x path) ships as a **pre-built binary wheel**
for common platforms — `pip install quill-sort` just gets it, no Rust toolchain
needed. If no wheel matches your platform, pip falls back to the pure-Python
sdist and Quill transparently uses the next-fastest backend (GPU → polars →
numpy → np.sort). **The install never fails for lack of a compiler.**

---

## How it picks (the dispatch chain)

`sort_array()` profiles the array and walks a priority chain, taking the first
backend that is installed *and* supports the array. Every step has a measured
crossover (`min_n`) below which it isn't worth it, and **any backend error
falls straight back to `np.sort`** — correctness is never at risk.

```
sort_array(ndarray)
        │
        ▼
  eligible?  (1-D, dtype kind i/u/f, itemsize ≤ 8, value-only, C-contiguous)
        │ no ──────────────────────────────────────────────► np.sort
        │ yes
        ▼
  dense bounded int64/uint64?  ──► counting sort (np.bincount)   ~1.7-2.8x, single-thread
        │ no
        ▼
  ┌──────────────────────────────────────────────────────────┐
  │  try, in priority order, the first available + supporting │
  │    rust_voracious   (compiled radix, n ≥ 1M)              │
  │    cupy_gpu         (GPU, n ≥ 2M, fits in free VRAM)      │
  │    polars           (no-compile MT sort, n ≥ 200k)        │
  │    numpy_parallel   (int-only partition sort, n ≥ 5M)     │
  └──────────────────────────────────────────────────────────┘
        │ any error / nothing eligible
        ▼
  np.sort   ← the floor; always correct, never slower than numpy
```

NaN is stripped before any backend runs and re-appended at the end (numpy
convention), so a backend that can't order NaN never sees one. Descending is a
post-sort reverse. `quill.available_backends()` shows what your machine will
actually use, in this order.

---

## Backends

| Backend | How to get it | Measured (int64, vs `np.sort`) | Notes |
|---------|---------------|--------------------------------|-------|
| `rust_voracious` | `pip install quill-sort` (bundled wheel) | **~5.3x** (float64 ~3.3x) | Compiled PyO3 + `voracious_radix_sort`, multi-threaded radix. Fastest measured path; leads the chain. |
| `cupy_gpu` | `pip install quill-sort[gpu]` | **~4x** (float64 ~2.5x) | GPU radix via CuPy. Counts the host→device→host round trip and still wins for large arrays that fit in VRAM. |
| `polars` | `pip install quill-sort[polars]` | **~2.3x** (float64 ~1.7x) | Delegates to polars' multi-threaded Rust sort. No compiler needed — the "fast for everyone" fallback. |
| `numpy_parallel` | `pip install quill-sort[fast]` | **~1.1x** (int only) | Thread-parallel `np.partition` sample-sort. Small but reliable int win; deliberately uses few workers (the gain saturates at memory bandwidth). |
| counting sort | `pip install quill-sort[fast]` | **~1.7-2.8x** | `np.bincount` for dense bounded int64/uint64. O(n+k), single-threaded, beats every comparison backend in its niche. |
| `np.sort` | always (with numpy) | 1.0x (the floor) | The never-lose baseline. Reached on any error or ineligible dtype. |

Without numpy at all, Quill still sorts correctly via the stdlib Timsort — it is
simply slower. Correctness has **zero** required dependencies.

---

## The array API — `sort_array()`

```python
import numpy as np
import quill

a = np.random.randint(0, 2**40, 20_000_000)

s = quill.sort_array(a)                      # sorted COPY (a untouched), fastest backend
quill.sort_array(a, inplace=True)            # sort a IN PLACE, returns a
d = quill.sort_array(a, descending=True)     # reverse order

quill.available_backends()
# → ['rust_voracious', 'cupy_gpu', 'polars', 'numpy_parallel']  (priority order)
```

`sort_array` matches `np.sort` exactly — including negatives, mixed int/float
promotion, and NaN-to-end — and is **never slower than `np.sort`** on any input.

### Top-k — `quill_topk()`

When you only need the `k` smallest (or largest), don't sort the whole array.
`quill_topk` uses numpy's `argpartition` (introselect, O(n)) instead of a full
O(n log n) sort:

```python
quill.quill_topk(scores, 10)                  # 10 smallest, sorted ascending
quill.quill_topk(scores, 10, largest=True)    # 10 largest,  sorted descending
quill.quill_topk(rows, 5, key=lambda r: r.size)
```

Measured ~8x faster than `sorted(data, reverse=True)[:10]` for k=10 over 10M
elements on this machine.

### `analyze()` — inspect the profile

```python
quill.analyze([3, 1, 4, 1, 5, 9])
# {'n': 6, 'dtype': 'int_pos', 'presorted': False, 'dense': True, ...}
```

---

## The list API — `quill_sort()` / `quill_sorted()`

A drop-in replacement for `sorted()` / `list.sort()` that routes numeric lists
through the fast numeric kernel and everything else through Timsort. It returns
a `list`, so it pays the conversion tax described above — expect np.sort-class
speed and ~3-4x over the built-ins on numeric data, *not* the array-API number.

```python
quill.quill_sort([3, 1, 4, 1, 5, 9])              # in-place, returns the list
quill.quill_sort(data, key=lambda x: x['score'])  # objects via key
quill.quill_sort(data, reverse=True)              # descending
quill.quill_sort(data, inplace=False)             # return a new list, leave input alone
quill.quill_sort(data, parallel=True)             # force multi-core
quill.quill_sort(data, stable=False)              # unstable, max speed on numeric
result = quill.quill_sorted(iterable)             # non-mutating, mirrors sorted()
```

```python
quill.quill_sort(
    data,                        # list, generator, range, ndarray, Series, DataFrame
    key=None,                    # sort key function (like sorted())
    reverse=False,               # descending order
    inplace=True,                # mutate in place (False → return a new list)
    parallel=False,              # use multiple cores (auto on big numeric data)
    high_performance_mode=False, # skip the interactive prompt on the external-sort path
    silent=False,                # suppress status output
    stable=True,                 # True = matches sorted() exactly; False = max speed
    stats=False,                 # return (sorted_list, stats_dict)
)
```

### `stable` parameter

`stable=True` (default) guarantees equal elements keep their original relative
order — identical to Python's `sorted()`. `stable=False` allows the
benchmark-optimal unstable kernels on numeric data for extra speed when order
among equal elements doesn't matter.

```python
quill.quill_sort(data)                 # stable, safe for all use cases
quill.quill_sort(data, stable=False)   # unstable, faster for large int32/int64
```

### `stats` — timing

```python
result, stats = quill.quill_sort(data, stats=True)
# stats = {'time_ms': 12.3, 'n': 1000000}
```

---

## The never-lose guarantee

Every fast path in Quill is wrapped so that **any** failure — a missing backend,
a GPU OOM, a compiled-extension panic, an unsupported dtype — falls back to
`np.sort` (or stdlib Timsort if numpy is absent). The Rust extension is built
with `panic = "unwind"`, so even a native panic becomes a catchable Python
exception and the process never aborts. You can never get a result that is wrong
or a sort that is slower than the numpy baseline.

```bash
# To debug a backend instead of silently falling back, surface the error:
QUILL_BACKEND_DEBUG=1 python your_script.py
```

---

## Correctness contract

Quill's output matches `sorted()` / `np.sort` on every supported input:

**None values** sort to the end (start with `reverse=True`):

```python
quill.quill_sort([3, None, 1, None, 2])
# → [1, 2, 3, None, None]
```

**NaN values** in float data go to the end (start with `reverse=True`),
matching numpy's convention:

```python
quill.quill_sort([3.0, float('nan'), 1.0, 2.0])
# → [1.0, 2.0, 3.0, nan]
```

**Mixed int/float** lists are promoted to float64 and take the fast float path
(not the slow object fallback):

```python
quill.quill_sort([1, 2.5, 3, 4.0])
# → [1, 2.5, 3, 4.0]
```

**Negative integers** are handled natively (two's-complement radix; no object
fallback), and keyed sorts are **stable** by default.

---

## Supported types

- `int`, `float`, `str`, `bytes` — native fast paths
- Mixed `int` + `float` — promoted to float64
- Negative integers — handled natively
- `None` values — sorted to the end (start with `reverse=True`)
- `float('nan')` — sorted to the end (start with `reverse=True`)
- `numpy.ndarray` — sorted via the backend chain (use `sort_array` for the full win)
- `pandas.Series` — sorted, returned as a new Series
- `pandas.DataFrame` — sorted by column(s) via `key='column_name'`
- Any generator / iterator — materialised to a list, sorted, returned

---

## Plugin system

Teach Quill to sort your own types. A plugin declares which types it `handles`
and a `prepare()` that converts them to something Quill sorts natively, plus an
optional `postprocess` to rebuild your objects.

```python
from quill import register_plugin, QuillPlugin

class MyPlugin(QuillPlugin):
    handles = (MyCustomClass,)
    name    = "my_custom_class"

    @staticmethod
    def prepare(data, key, reverse):
        items       = [x.value for x in data]
        postprocess = lambda sorted_vals: [MyCustomClass(v) for v in sorted_vals]
        return items, key, postprocess   # (list_to_sort, key_fn, postprocess_fn)

register_plugin(MyPlugin)
quill.quill_sort(list_of_my_objects)
```

Built-in plugins cover `numpy.ndarray`, `pandas.Series`, `pandas.DataFrame`,
`range`, and any generator/iterator. You can also drop in a custom *backend*
(not just a type plugin) with `quill._backends.register_backend(...)` if you
have a faster kernel for some dtype.

---

## CLI / demo / setup

```bash
python -m quill       # animated benchmark demo across data types
quill                 # same, if installed via pip
quill setup           # calibration wizard: measures the parallel/GPU crossovers
                      #   on YOUR box and persists them to ~/.quill/config.json
```

Quill also ships a small visualizer/wizard and add-on helpers layered on the
core engine (the calibration wizard above, profile inspection via `analyze()`,
and the `quill_topk` add-on). They are optional conveniences — the core sort API
needs none of them.

---

## How it works (the list path, in detail)

For a `list`, Quill profiles the data at intake and routes to the best strategy:

| Data shape | Strategy | Complexity |
|------------|----------|------------|
| Dense bounded integer range | `np.bincount` counting sort | O(n + k) |
| int64 / float64 (value-only) | fastest backend via the array chain | ~O(n) radix / O(n log n) |
| Mixed int/float | promote to float64, then numeric path | O(n log n) |
| Strings / bytes | Python Timsort | O(n log n) |
| Objects with a `key` | extract keys → numpy argsort (stable) | O(n log n) |
| Nearly sorted (asc_ratio > 0.90) | Timsort bypass | ~O(n) |
| Larger than RAM | external radix-partition + parallel bucket sort | disk-bound |

The conversion to/from numpy is the fixed cost that caps the list-path speedup —
see the very first table. For maximum throughput, keep your data in arrays and
use `sort_array`.

---

## Performance tuning

- **Install the right extra.** `[fast]` (numpy + psutil) is the baseline; add
  `[polars]` for a no-compile multi-threaded sort, or `[gpu]` if you have an
  NVIDIA card. The bundled Rust wheel already gives you the top path.
- **Use `sort_array` on arrays.** It skips the conversion wall entirely.
- **`stable=False`** on the list path for extra speed on numeric data when you
  don't need stable ordering.
- **`quill setup`** measures your machine's crossovers so auto-parallel and GPU
  engage exactly where they win.
- **Install psutil** for accurate RAM sensing (avoids spurious external-sort
  triggers); without it, Quill assumes 2 GB available.

---

## Troubleshooting

- **"`quill_sort(list)` isn't 5x faster."** It can't be — the `asarray`/`tolist`
  round trip dominates (Amdahl). It *is* np.sort-fast and ~3-4x over `sorted()`.
  For the big win, keep arrays and call `sort_array`.
- **"`sort_array` isn't using the GPU / Rust path."** Check
  `quill.available_backends()`. The backend may not be installed, the array may
  be below its crossover (`min_n`), or the dtype/itemsize may be ineligible. Set
  `QUILL_BACKEND_DEBUG=1` to surface the real reason instead of falling back
  silently.
- **"GPU path errors on a huge array."** It only engages when the array fits in
  free VRAM with headroom; otherwise it falls back to the CPU radix or np.sort.
- **"Equal elements reordered."** Use `stable=True` (the default). `stable=False`
  is explicitly allowed to reorder equal elements.
- **"External sort triggered unexpectedly."** Install psutil for accurate RAM
  sensing; pass `high_performance_mode=True` to skip the interactive prompt.

---

## Development

```bash
git clone https://github.com/invariant-games/quill.git
cd quill
pip install -e ".[all,dev]"
pytest                          # full suite
pytest -m slow                  # parallel / large-data tests
python -m quill                 # animated benchmark demo

# Build the compiled Rust backend locally (optional; CI ships wheels):
cd rustext && maturin develop --release
```

The CI workflow `.github/workflows/wheels.yml` builds the `rustext` crate into
abi3 binary wheels (one wheel per platform covers Python ≥ 3.8) for Windows,
manylinux x86_64/aarch64, and macOS x86_64/arm64, plus a pure-Python sdist as
the never-fail-install fallback.

---

## Requirements

- Python 3.8+
- `numpy` — optional but strongly recommended (`pip install quill-sort[fast]`)
- `psutil` — optional, for accurate RAM sensing
- `polars` — optional backend (`pip install quill-sort[polars]`)
- `cupy` — optional GPU backend (`pip install quill-sort[gpu]`, NVIDIA only)
- `pandas` — optional (Series/DataFrame support)

A C/Rust toolchain is **not** required to install — the compiled backend is
shipped as a pre-built wheel, with a pure-Python sdist fallback.

---

## License

MIT — by Isaiah Tucker
