Metadata-Version: 2.4
Name: skylar
Version: 0.3.0
Summary: Skylar — local, sovereign, from-scratch LLMs. CLI + loader for the Skylar model family: generative chat, embeddings, and a COBOL specialist.
Author: A. Ivanovitch
License: Apache-2.0
Project-URL: Models, https://huggingface.co/Sophia-AI
Keywords: llm,cobol,code-generation,sovereign-ai,from-scratch
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.1
Requires-Dist: transformers>=4.40
Requires-Dist: tokenizers>=0.15
Requires-Dist: huggingface_hub>=0.20
Requires-Dist: rich>=13.0
Provides-Extra: serve
Requires-Dist: fastapi>=0.100; extra == "serve"
Requires-Dist: uvicorn>=0.23; extra == "serve"
Requires-Dist: pydantic>=2.0; extra == "serve"

# skylar

A tiny **runtime + CLI for the Skylar model family** — local, sovereign, from-scratch LLMs you
load, run, and serve with one `pip install`. It covers **generative chat**, **embeddings /
retrieval**, and a **COBOL code specialist** — 236M–390M class, runnable on a single GPU or CPU,
no data leaving your machine.

Models live under [`Sophia-AI` on HuggingFace](https://huggingface.co/Sophia-AI):
`Skylar-236M-Base` · `Skylar-236M-Chat` · `Skylar-236M-Embed` · `Skylar-390M-Cobol`.

## Install

```bash
pip install skylar
# optional HTTP server:
pip install "skylar[serve]"
```

## Use it — CLI

```bash
# chat with any Skylar generative model (no forced persona — steer it with --system)
skylar chat --model Sophia-AI/Skylar-236M-Chat --system "Sei un assistente che risponde dal contesto."

# embeddings / retrieval (any SkylarEmbedder model)
skylar embed --model Sophia-AI/Skylar-236M-Embed --query "prestito casa" --docs "mutuo" "meteo"

# one-shot generation (HF repo id or a local checkpoint dir)
skylar generate --model Sophia-AI/Skylar-236M-Chat --prompt "..."

# the COBOL specialist — completes a COBOL stub into a full, compilable program
#   (auto-downloads Skylar-390M-Cobol; it's a stub completer, not a chatbot)
skylar cobol --example
skylar cobol --stub-file my_task.cbl --compile        # your own stub + GnuCOBOL check

# multi-user OpenAI-compatible server (needs the [serve] extra) — full details in "Serve it" below
skylar serve --model <any-skylar-model> --port 8000     # interactive docs at http://localhost:8000/docs
```

Decoding is greedy by default (`--temperature 0.0`); there is **no forced system prompt** — pass
`--system "..."` to steer a chat model. (The `skylar cobol` subcommand handles the COBOL prompt
format for you.)

## Use it — Python

```python
import skylar

# generative chat — pass your own system prompt (no forced persona)
m = skylar.load("Sophia-AI/Skylar-236M-Chat")          # HF repo id or a local dir
print(m.generate("Domanda: dove ha sede la Banca d'Italia?",
                 system="Rispondi solo dal contesto fornito."))
for delta in m.stream("..."):                          # streaming
    print(delta, end="", flush=True)

# embeddings / retrieval
e = skylar.load_embedder("Sophia-AI/Skylar-236M-Embed")
ranked = e.rank("costo del denaro", ["la BCE alza i tassi", "ricetta pizza"])

# the COBOL specialist — a stub completer (not a chatbot)
c = skylar.load("Sophia-AI/Skylar-390M-Cobol")
print(c.complete_cobol(my_stub))                       # -> full, compilable COBOL program
```

`skylar` also registers the architecture with 🤗 Transformers, so this works too:

```python
import skylar  # registers nano-transformer
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Sophia-AI/Skylar-236M-Chat")
```

## Serve it — multi-user HTTP API (`skylar serve`)

`skylar serve --model <id>` turns any Skylar generative model into an **OpenAI-compatible HTTP
server** built for concurrent users. Requests from many clients are fused into **dynamic
micro-batches** on a single worker that owns the model — so one GPU (or CPU) serves a whole demo
without per-request OOM or GPU contention — and each request can **stream** its tokens.

```bash
pip install "skylar[serve]"
skylar serve --model Sophia-AI/Skylar-236M-Chat          # swap the id for ANY Skylar model
#  → http://127.0.0.1:8000   ·   interactive docs: http://127.0.0.1:8000/docs
```

Open **`/docs`** for the auto-generated Swagger UI — every endpoint, schema, and example is
described there (or **`/redoc`** for ReDoc). The model is whatever you pass to `--model` (an HF
repo id or a local checkpoint dir); an embedder model is auto-detected and served at
`/v1/embeddings` instead.

| Method & path | What it does |
|---|---|
| `POST /v1/chat/completions` | OpenAI chat format. `"stream": true` → Server-Sent Events. |
| `POST /generate` | One `prompt` → one `completion`. |
| `GET /health` | Liveness + which model/device is loaded. |
| `GET /metrics` | Throughput, batch sizes, queue depth. |

```bash
# one-shot completion
curl localhost:8000/generate -H 'content-type: application/json' \
  -d '{"prompt": "Dove ha sede la Banca d'\''Italia?", "max_new_tokens": 64}'

# OpenAI chat format (+ "stream": true for SSE)
curl -N localhost:8000/v1/chat/completions -H 'content-type: application/json' -d '{
  "messages": [{"role":"system","content":"Sei un esperto COBOL."},
               {"role":"user","content":"Somma due campi PIC 9(4)."}],
  "max_tokens": 256, "stream": true
}'
```

Drop-in with the official OpenAI client — just point `base_url` at the server:

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(
    model="Sophia-AI/Skylar-236M-Chat",
    messages=[{"role": "user", "content": "Spiega cosa fa questo COBOL ..."}],
    stream=True,
)
for chunk in r:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
```

### Tuning concurrency

| Flag | Default | Meaning |
|---|---|---|
| `--max-batch` | `8` | Max requests fused into one forward pass. Raise for more throughput until VRAM/latency says stop. |
| `--max-wait-ms` | `15` | How long to wait for stragglers before launching a batch. Higher = bigger batches, slightly more latency. |
| `--max-queue` | `256` | Input-queue depth; beyond it the server returns **503** (backpressure instead of OOM). |

### How it works (for implementers / a future maintainer)

The server is **`skylar/serve.py`** — the model code (`decoder.py` / `attention.py`) is left
untouched:

- **One worker owns the model.** Async routes enqueue requests; a single background thread pulls a
  micro-batch (up to `--max-batch`, waiting `--max-wait-ms`) and runs it. No two CUDA calls race,
  and there is exactly one set of KV-caches in flight — so concurrency can't OOM the box.
- **True batched decoding.** `generate_batch()` left-pads ragged prompts and builds a 4D additive
  mask (causal + pad) that `NanoTransformer.forward` already accepts (its dense-mask SDPA path), so
  prompts of different lengths decode together with a shared KV-cache and per-row EOS stop. Per-row
  sampling mirrors `NanoTransformer.generate` exactly → **batched output is token-for-token
  identical to single-stream** (proven by `tests/test_batch_equiv.py`).
- **Batching + streaming coexist.** Each request carries its own queue; the worker pushes text
  deltas into it as tokens are produced, so every request in a batch streams independently.
- **Current limits (PoC).** *Static* micro-batching (a batch starts and finishes together). For
  heavy, time-skewed load the next step is *continuous* batching (adding requests to an in-flight
  batch). One model per process; greedy is the default, sampling params are per-request.

```bash
python tests/test_batch_equiv.py     # run after touching batching/masking: batched == single-stream
```

## What's inside

The Skylar models use a custom decoder (`NanoTransformer`, Qwen3-style: RMSNorm + RoPE + GQA +
QK-Norm + SwiGLU), trained 100% from scratch (no third-party pretrained weights). This package
vendors the architecture so the published weights load anywhere — no private framework needed.

## License

Apache-2.0. Models & code IP: A. Ivanovitch (Sophia AI).
