Metadata-Version: 2.4
Name: pyxislm
Version: 0.1.1
Summary: High-performance, vendor-agnostic LLM inference library
Author: Pyxis Contributors
License: Proprietary
Project-URL: Homepage, https://pypi.org/project/pyxislm/
Project-URL: Documentation, https://pypi.org/project/pyxislm/
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Internet :: WWW/HTTP :: HTTP Servers
Classifier: License :: Other/Proprietary License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: msgspec>=0.18.6
Requires-Dist: pynng>=0.8.0
Requires-Dist: transformers>=4.40.0
Requires-Dist: fastapi<1,>=0.110.0
Requires-Dist: uvicorn<1,>=0.23.0
Requires-Dist: torch>=2.1.0
Provides-Extra: dev
Requires-Dist: ruff>=0.6.9; extra == "dev"
Requires-Dist: mypy>=1.10.0; extra == "dev"

# Pyxis

High-performance, vendor-agnostic LLM inference library.

## Status snapshot (2026-02-21)

- Sprints 1-5 are completed (`docs/SPRINT_CHECKLIST.md`).
- Core worker now supports pluggable executor backends (`hf`, `echo`).

## Quick start (local, 3 processes)

1. Install deps (in a venv):

   - `pip install -e .`

   Model downloads follow your local HuggingFace/Transformers cache settings.

2. Start the core worker:

   - `python scripts/run_core.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --backend hf`

3. Start the tokenizer + detokenizer worker:

   - `python scripts/run_tokenizer.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0`

4. Start the HTTP API:

   - `python scripts/run_api.py --host 127.0.0.1 --port 8000`

5. Verify streaming:

   - `python scripts/verify_api_real.py`

Installed CLI commands are also available:

- `pyxislm-core`
- `pyxislm-tokenizer`
- `pyxislm-api`

List executor backends:

- `python -m pyxis.cli.core --list-backends`

## Useful environment variables

- `PYXIS_MODEL_PATH`: model name/path for `scripts/run_core.py` (defaults to `TinyLlama/TinyLlama-1.1B-Chat-v1.0`)
- `PYXIS_MODEL_BACKEND`: executor backend for core worker (`hf` default, `echo` built-in)
- `PYXIS_TOKENIZER_PATH`: tokenizer name/path for `scripts/run_tokenizer.py` (defaults to `TinyLlama/TinyLlama-1.1B-Chat-v1.0`)
- `PYXIS_TOKENIZER_INGRESS`: API → tokenizer IPC address override
- `PYXIS_DETOK_TO_API`: detok → API IPC address override
- `PYXIS_CORE_REQUEST_QUEUE_SIZE`: max queued generation requests inside core worker (default `1024`)
- `PYXIS_MAX_INFLIGHT_REQUESTS`: max concurrent API streaming requests before `429 overloaded` (default `128`)
- `PYXIS_PER_REQUEST_QUEUE_MAXSIZE`: per-request detok queue bound in API (default `128`)
- `PYXIS_STREAM_IDLE_TIMEOUT_S`: stream idle timeout before API emits an error chunk (default `30`)
- `PYXIS_TOKENIZER_READY_WAIT_S`: wait for tokenizer ingress readiness per enqueue (default `1.0`)

## Smoke / integration scripts

- `python scripts/verify_ingestion.py`: API-less ingestion test (tokenizer → core request)
- `python scripts/verify_api_ingress.py`: API streaming test with a mocked core
- `python scripts/verify_api_real.py`: API streaming test with real core+tokenizer

## Realtime usage

- Interactive chat REPL:
  - `python scripts/chat_repl.py`
- End-to-end realtime harness (starts services, checks streaming/cancel/backpressure):
  - `powershell -ExecutionPolicy Bypass -File scripts/test_realtime.ps1 -SkipInstall`

## Developer workflow

- Quick checks: `python scripts/dev.py test-quick`
- Full checks: `python scripts/dev.py test-all`
- Optional lint/type checks: `python scripts/dev.py lint`

Contributor docs:

- `CONTRIBUTING.md`
- `docs/HACKING.md`

Benchmark harness:

- `python benchmarks/api_stream_bench.py --requests 20 --concurrency 4`
- See `benchmarks/README.md`

## Architecture (high level)

`HTTP API` → `TokenizerWorker` → `CoreWorker` → `TokenizerWorker (detok)` → `HTTP streaming response`

`POST /v1/chat/completions` streams SSE (`text/event-stream`) with OpenAI-style `chat.completion.chunk` payloads and a final `[DONE]`.

`GET /health` includes readiness and API stage latency snapshots (`stage_latency_ms`).

See `docs/ARCHITECTURE.md` for details.

Session notes and recent implementation memory are tracked in `docs/MEMORY.md`.
