Metadata-Version: 2.4
Name: semqa
Version: 0.1.0
Summary: Open-source, self-hostable, framework-agnostic, multi-modal governed answer layer for agentic Q&A.
Project-URL: Homepage, https://github.com/pankajniet/semqa
Project-URL: Repository, https://github.com/pankajniet/semqa
Project-URL: Issues, https://github.com/pankajniet/semqa/issues
Author: The semqa Authors
License-Expression: Apache-2.0
License-File: LICENSE
License-File: NOTICE
Keywords: agents,governance,llm,mcp,rag,semantic-layer,text-to-sql,trust
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Application Frameworks
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: pydantic>=2.6
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Description-Content-Type: text/markdown

<div align="center">

# semqa

### A governed Semantic Answer Layer for agentic Q&A

Open-source infrastructure that sits *above* your data, documents, and tools, and lets any chatbot or agent
answer natural-language questions — **only when it can stand behind the answer**, with citations and
permissions enforced, and otherwise clarifying or abstaining.

[![License](https://img.shields.io/badge/license-Apache_2.0-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.11_%7C_3.12-blue.svg)](pyproject.toml)
[![Status](https://img.shields.io/badge/status-alpha-orange.svg)](#status)
[![Tests](https://img.shields.io/badge/tests-152_passing-brightgreen.svg)](#status)
[![Runtime deps](https://img.shields.io/badge/runtime_deps-pydantic_only-blue.svg)](pyproject.toml)

[Why it exists](#why-it-exists-the-thinking) · [How it works](#how-it-works) · [Install](#installation) · [Examples](examples/) · [The evidence](docs/validated-problems.md)

</div>

> [!NOTE]
> **Alpha (v0.1.0) — an open-source reference implementation.** The trust spine and all five answer modes
> are implemented and tested (152 passing). See [Status](#status) for exactly what works today and what it
> deliberately does *not* try to solve.

semqa is not a chatbot, not a RAG wrapper, and not a text-to-SQL tool. The way to think about it: *as Cube is to metrics, semqa is to governed agentic Q&A* — a horizontal layer you point at your own world, callable from any framework.

---

## Why it exists (the thinking)

The naive way to build "chat with your data" — let an LLM write SQL, or retrieve some chunks and summarize — produces impressive demos and unreliable products. It fails in production not because the model is unintelligent, but because of *how* it fails: it returns **confident, silently-wrong answers.** The query runs, a number comes back, and the number is quietly incorrect. A stale policy gets cited as current. A user sees data they shouldn't. The system answers a question it had no business answering.

The expensive realization is that **the hard part of enterprise Q&A is not generating an answer — it's knowing whether you *should*.** Which source is authoritative? Is it stale? Is this user allowed to see it? Is there actually enough evidence, or is the model guessing? Should the honest response be a clarifying question, or "I don't know"?

Almost every tool in this space optimizes for *generating* an answer (SQL, a chart, a dashboard, a paragraph). Very few make **abstention, sufficiency, verification, and citation quality the core product primitive.** That gap is the entire point of semqa:

> **Trust is the primary product surface — not accuracy as a supporting feature.** The differentiator is *boundary behavior:* knowing what it cannot answer.

---

## The core ideas

A few principles shape every decision:

1. **The governed answer layer is the show-runner; a query mechanism is never the star.** semqa decides *what kind* of question this is, *whether it can be answered*, and *how to govern it*. **Text-to-SQL is one mode — a handler** for the structured-data slice — and even there a *governed semantic query* is preferred over raw free-form SQL. Execution is **delegated** (to Cube, Wren, SQLite, …), never rebuilt.

2. **The LLM proposes; deterministic code disposes.** The LLM is constrained to the smallest fuzzy job — mapping a natural-language question to *governed concept names*. Everything that must be correct or governed — routing, authorization, sufficiency, verification, grounding, citation — is deterministic, typed, testable code. (We verified this matters: a small local model gave sloppy, sometimes wrong selections, yet outcomes were correct *because the typed governance layer rejected its bad guesses.*)

3. **Multi-modal by construction.** A real user asks metric questions, policy questions, how-to questions, and diagnostic questions — most of which are not SQL at all. semqa routes each to the right mode behind one interface.

4. **Open, self-hostable, and provider-neutral.** No proprietary lock-in: run it with a local model, a cloud model, or a gateway — your choice, your data residency.

---

## How it works

Every question flows through a deterministic **trust spine**:

```
intake → route → authorize → collect evidence → sufficiency gate → verify → ground / clarify / abstain (with citations)
```

- **route** — an intake step maps the question onto the governed model (today rule-based or LLM-backed; the LLM only picks concept names).
- **authorize** — identity comes from a verified, signed token (never the prompt); restricted concepts are never even shown to the model, and row-level security is enforced at the source.
- **collect** — a pluggable `SemanticSource` returns evidence: a **metric source** compiling a typed request to real SQL, a **document source** doing authority- and freshness-aware retrieval, and more later — all behind one interface, routed by mode.
- **sufficiency / verify** — deterministic checks decide whether there is enough trustworthy evidence; if not, the system **clarifies or abstains** instead of fabricating.
- **ground** — the answer is built strictly from the evidence, with citations and an explicit interpretation of what was measured.

The result is one of four first-class outcomes: **answered (cited), clarify, abstained, or refused** — never a confident guess.

---

## Installation

```bash
# from source (today)
git clone https://github.com/pankajniet/semqa && cd semqa
uv sync                 # or: pip install -e .

# from PyPI (after the first release)
pip install semqa
```

The core depends only on `pydantic`. A model is optional — with no API key and no network, semqa falls back to a deterministic resolver, so you can run everything below (and the demos) entirely offline.

---

## Provider & model neutrality

semqa leads with the **open OpenAI-compatible API** as the common surface, so the same adapter reaches OpenAI cloud *and* every local server (Ollama, vLLM, llama.cpp, LM Studio) with just a `base_url`. **Local and cloud are equally first-class; the adopter chooses.** Per-stage hybrid (a cheap/local model to route, a stronger model to ground) is the cost sweet spot.

Gateways like **LiteLLM** (self-hosted), OpenRouter, or Portkey are *configuration, not code* — point `LLM_BASE_URL` at the gateway and you get 100+ providers plus routing, fallbacks, and budgets, with no per-provider SDKs baked into semqa.

```bash
# pick any: a cloud key, a local model, a gateway — or nothing (deterministic fallback)
export OPENAI_API_KEY=sk-...                              # cloud
export LLM_BASE_URL=http://localhost:11434/v1            # local Ollama / vLLM / gateway
export LLM_MODEL=llama3.2
```

```python
from semqa import Engine, auto_resolver, demo_source, context_for

engine = Engine(demo_source(), secret="...", resolver=auto_resolver())
answer = engine.ask("how are active users trending this month?",
                    context_for("...", subject="alice", tenant="acme", roles=["analyst"]))
print(answer.status, answer.text, answer.citations)
```

Run the demos across all modes and outcomes:

```bash
uv run python examples/use_cases.py            # six real-world use cases, real governed outputs
uv run python examples/quickstart.py           # smaller; uses your LLM if configured, else deterministic
uv run python -m semqa.eval.scenario_live      # a realistic SaaS scenario through a live local model
```

---

## What makes it different

The strong players — Cube, Wren, dbt, Snowflake Cortex, Databricks Genie — are excellent at structured-data analytics, and semqa sits above and delegates to them rather than competing. But they are largely structured-data only and platform-locked, and they optimize for generating an answer. The gap semqa aims at is the intersection none of them occupy: open, self-hostable, vendor-neutral, multi-modal (including graph), and trust-first, with abstention, verification, and citation as the core primitive. The bet is that **governance, provenance, and calibrated abstention** — the trust layer, not autonomy and not raw accuracy alone — are what make agentic Q&A adoptable in the enterprise. That bet is grounded in real, sourced production failures — Microsoft Copilot oversharing, Uber QueryGPT hallucinations, the Air Canada chatbot ruling, the ~11% real RCA solve rate, buyer surveys of 600–1,006 orgs — collected in [docs/validated-problems.md](docs/validated-problems.md).

---

## Design principles (in the code)

- **LLM proposes, code disposes** — minimize the model's surface; deterministic code owns correctness.
- **Closed-vocabulary, validated output** — the model can only reference defined concepts; anything off-list is rejected.
- **Ports & adapters** — narrow `Resolver` / `SemanticSource` / `LLMClient` seams; swap provider or backend without touching the spine.
- **Trust-first, safe-by-default** — abstain/clarify are first-class; helpfulness is the opt-in, never the default.
- **Defense in depth** — never trust the prompt for identity or authorization; enforce at every layer.
- **Bounded, verified loops** — deterministic checks first; bounded retries; no unbounded autonomy.
- **Delegate, don't rebuild** — real engines behind `SemanticSource`; gateways for providers.
- **Evals first-class** — measure outcomes (execution, abstention, RLS correctness), not vibes.

---

## Notes on the thinking

A few convictions behind the design — and a few things we deliberately *don't* do:

- **We reframed the question.** "Can we build a chatbot over data?" is the wrong question — it leads to demos. The right one is: "can we build a governed layer that decides *what* to answer, from *which authoritative source*, for *which user*, and *when to say no*?" Once the question is about governance and evidence rather than generation, most of the design falls out on its own.

- **We bet on trust over autonomy.** A lot of the agentic-AI energy is about giving models *more* freedom. For the enterprise we bet the opposite: bounded loops, abstention over guessing, and a governed substrate the model cannot escape. A confident wrong answer is worse than an honest "I don't know" — so "I don't know" is a first-class outcome, not a failure.

- **We let the evidence correct us.** The first time we ran a real local model end-to-end, it was *sloppy* — it over-added fields, mis-filed a policy as a metric, and once picked "active users" for a question about the *weather*. The outcomes were still correct, because the typed governance layer rejected the junk. That was the most useful result we got: it showed exactly where to place trust (deterministic code) and where not to (the model's raw output). We try to **observe what the system actually does, not infer it from the outcome.**

- **We delegate instead of rebuilding.** The structured-query problem is already well-solved by Cube, Wren, dbt, and a plain SQL engine; rebuilding it would only produce a worse version. Our value is the governed, multi-modal, trust-first layer *above* them — so the metric mode delegates execution, and we spend our effort on routing, modes, and the trust layer.

- **We don't privilege a provider — in either direction.** Defaulting to one cloud model is a bias; swinging to "local only" is the same bias inverted. The neutral truth is an open interface where local and cloud are equally first-class and the adopter chooses — so we lead with the open standard everyone already implements.

- **We try to stay calibrated.** We keep what we *measured* separate from what we *modeled* separate from what we *assumed*; we verify claims against primary sources; and we treat "it worked five times" as a smoke test, not validation. The honest open question here isn't technical — it's whether a real team needs this enough to adopt it, and only real conversations answer that. We'd rather say that plainly than oversell.

---

## Status

Early but real, and honestly scoped — an **open-source reference implementation**, not a battle-tested product.

**Working today** (152 tests green; builds + installs as a `pydantic`-only package):

- The full **trust spine** with per-stage observability, an explicit **tunable sufficiency/abstain gate**, and a typed reason on every non-answer.
- **Five governed modes:** **metric** (real SQL via SQLite + a Cube delegate adapter), **document/policy** (lexical *and* dense hybrid retrieval, authority→relevance→freshness ranking, staleness abstention, competing-source surfacing, a verified-answer repository), **tool/live-status** (read-only, freshness-stamped, read/write partition), **graph** (multi-hop, adaptively gated, node-level authz), and **diagnostic/RCA** (bounded, correlational, confidence-capped — never a confident root cause).
- A **framework-agnostic surface** — a pure handler + a contract-derived tool schema + a zero-dependency HTTP server — so any chatbot can call it over the wire.
- Provider-neutral resolver run **live end-to-end against a local model**; a realistic SaaS scenario passing **21/21 deterministically and 21/21 live**.

**What it does NOT solve (be clear):** it does not fix the raw text-to-SQL accuracy cliff (it converts confident-wrong answers into *abstentions*); it does not stop prompt injection (it bounds the blast radius via read-only / Rule-of-Two); it inherits the curation/governance setup cost; and it depends on — but does not itself fix — upstream index freshness and source-data quality.

**The honest open question is not technical — it is whether a real team needs this enough to adopt it.** The build is validated against a realistic scenario and a real local model, **not yet against a real adopter's data.** The problems it targets are real and sourced — see **[docs/validated-problems.md](docs/validated-problems.md)**.

**Deliberately deferred** (not built ahead of demand): production MCP/FastAPI server wrappers; identity → OAuth 2.1 + on-behalf-of; an LLM grounder + citation-faithfulness; more backend adapters (Wren/dbt/Neo4j). See [`TODO.md`](TODO.md).

Stack: Python 3.11+, `pydantic` v2, **zero other runtime dependencies**. Licensed **Apache-2.0**.
