Metadata-Version: 2.4
Name: gpuhost-client
Version: 0.1.2
Summary: Python client for the ComputeGateway GPU Model Host (embeddings, entities, rerank, translate, OCR, LLM, search, fetch).
Project-URL: Homepage, https://github.com/computegateway/gpuhost-client
Project-URL: Source, https://github.com/computegateway/gpuhost-client
Project-URL: Issues, https://github.com/computegateway/gpuhost-client/issues
Author: ComputeGateway
License-Expression: MIT
License-File: LICENSE
Keywords: client,compute-gateway,embeddings,gpu,llm,ocr,rerank,translate
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Requires-Dist: httpx>=0.27
Provides-Extra: async
Requires-Dist: anyio>=4.0; extra == 'async'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-httpx>=0.30; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Description-Content-Type: text/markdown

# gpuhost-client

Python client for the **ComputeGateway GPU Model Host** — embeddings, entity
extraction, reranking, translation, OCR, LLM completion, search and fetch —
all exposed via a single typed client with **production-tuned default batch
sizes baked in**.

```bash
pip install gpuhost-client
```

📘 **Full user guide:** [docs/USER_GUIDE.md](docs/USER_GUIDE.md) — every
endpoint, batching strategy, error handling, retries, recipes, and FAQ.

## Quick start

```python
from gpuhost_client import GPUHostClient

client = GPUHostClient(
    host="https://",
    api_key="...",
)

# Health
print(client.health())

# Embeddings — single
vec = client.embed("hello world")

# Embeddings — batch (auto-chunked at the optimal batch size)
vecs = client.embed(["a", "b", "c", "d", ...])

# Translate
en = client.translate("صباح الخير", src="ar")

# Entity extraction with zero-shot labels
ents = client.entities(
    "Acme Corp acquired Globex on 2025-01-15.",
    labels=["company", "date"],
)

# Rerank — keep documents <= 32 for best p95
ranked = client.rerank(query="what is RAG?", documents=docs, top_k=5)

# OCR — accepts bytes, file paths, or base64
text = client.ocr("./scan.png")

# LLM
out = client.llm(
    [{"role": "user", "content": "say hi"}],
    provider="auto",
    task_type="general",
)

# Streaming LLM
for chunk in client.llm_stream([{"role": "user", "content": "tell a joke"}]):
    print(chunk)

client.close()
```

The client is also a context manager:

```python
with GPUHostClient(host=..., api_key=...) as client:
    vecs = client.embed(many_texts)
```

## Single vs batch — the same method handles both

Every inference method accepts either a scalar or a sequence:

| Method            | Scalar input → returns       | List input → returns                    |
|-------------------|------------------------------|----------------------------------------|
| `embed(texts)`    | `list[float]` (one vector)   | `list[list[float]]` aligned to input    |
| `entities(text)`  | `list[entity-dict]`          | `list[list[entity-dict]]`               |
| `translate(text)` | `str`                        | `list[str]`                             |
| `ocr(images)`     | `dict` (one OCR result)      | `list[dict]`                            |
| `search(query)`   | `dict`                       | `list[dict]`                            |
| `fetch(url)`      | `dict`                       | `list[dict]`                            |

When you pass a list, the client:

1. **Auto-chunks** into requests of `OPTIMAL_BATCH_SIZE[endpoint]` items.
2. Calls the dedicated `/v1/<endpoint>/batch` alias — guaranteed batch
   envelope, indexed per-item results.
3. **Reassembles** results in input order before returning.
4. Raises `GPUHostHTTPError` on the first failed item, unless you opt in
   to error inclusion (e.g. `client.llm_batch(..., )`).

## Recommended batch sizes (sweet spots)

Defaults are sourced from the production T4 sweep at rev
`ca-nas-prd-wus3--0000078`. They are the **balanced** point — lowest p95
within ~90 % of peak items/s — safe for interactive callers.

| Endpoint / model           | Default `batch_size` | Speed-up over `bs=1` |
|---------------------------|---------------------:|---------------------:|
| `embed-baseline` (Qwen3)  | **128**              | 23.9 ×              |
| `embed-mpnet-legacy`      | **512**              | 22.3 ×              |
| `entity-gliner`           | **128**              | 8.6 ×               |
| `rerank` (bge-v2-m3)      | **32**               | 5.6 ×               |
| `translate` (any pair)    | **256**              | ≈34 ×               |
| `ocr-paddle`              | **8**                | 1.4 ×               |

Override per call when steady-state throughput matters more than tail
latency:

```python
vecs = client.embed(big_list, batch_size=256)        # peak embed throughput
out  = client.translate(many_sents, src="zh",
                        batch_size=512, max_parallel=4)
```

The constants are exposed for inspection / overrides:

```python
from gpuhost_client import OPTIMAL_BATCH_SIZE, SERVER_BATCH_CAP
```

`SERVER_BATCH_CAP` reflects the gateway's hard-rejection thresholds
(T-037/T-038). Anything above the cap is rejected with HTTP 400 — the
client clamps user-supplied `batch_size` to the cap as a safety net, but
the gateway is the source of truth.

## Errors

All errors derive from `GPUHostError`:

- `GPUHostHTTPError` — non-2xx response. Carries `status_code`, `code`
  (e.g. `BadRequest`, `ModelLoadRejected`), `message`, `retryable`,
  `retry_after_ms`, `request_id`.
- `GPUHostQuotaError` — HTTP 429 specifically (subclass of HTTP error).
- `GPUHostTimeoutError` — transport-level timeout.

```python
from gpuhost_client import GPUHostHTTPError, GPUHostQuotaError

try:
    out = client.embed(texts)
except GPUHostQuotaError as e:
    print("rate limited; retry after", e.retry_after_ms, "ms")
except GPUHostHTTPError as e:
    print(e.status_code, e.code, e.message)
```

## OCR input flexibility

`ocr(...)` accepts:

- raw bytes (`bytes`/`bytearray`),
- a filesystem path (`str` or `pathlib.Path`),
- an already-encoded base64 string.

MIME type is sniffed from the magic bytes (PNG/JPEG/GIF/WebP). Pass a list
to OCR many images in one call — the client uses the
`/v1/ocr/batch` alias and reassembles results.

## LLM streaming

`llm_stream(...)` yields one chunk dict per SSE `data:` line. The iterator
exits when the gateway emits `data: [DONE]`. Network/timeout errors
propagate as exceptions, not as in-stream events.

## Async?

Out of scope for v0.1. The synchronous client is built on `httpx.Client`
which uses connection pooling, so concurrent callers should construct one
client per process/thread group rather than per request. An `asyncio`-based
sibling is on the roadmap and will share the same method names.

## Compatibility

| Client `0.1.x` | GPU Model Host | API surface |
|----------------|---------------|-------------|
| ✓              | rev ≥ 76      | `/v1/...` (T-037 caps + T-038 rerank cap) |

## License

MIT. See [LICENSE](LICENSE).
