Metadata-Version: 2.4
Name: seamless-rag
Version: 0.1.6
Summary: TOON-Native Auto-Embedding & RAG Toolkit for MariaDB — VECTOR(N), HNSW, VEC_DISTANCE_COSINE, with TOON tabular output that saves 10-55% LLM tokens vs JSON.
Project-URL: Homepage, https://mariadb-hackathon-my-2026.github.io/seamless-rag/
Project-URL: Documentation, https://mariadb-hackathon-my-2026.github.io/seamless-rag/
Project-URL: Repository, https://github.com/MariaDB-Hackathon-MY-2026/seamless-rag
Project-URL: Issues, https://github.com/MariaDB-Hackathon-MY-2026/seamless-rag/issues
Project-URL: Changelog, https://github.com/MariaDB-Hackathon-MY-2026/seamless-rag/releases
Project-URL: Judges' Testing Guide, https://mariadb-hackathon-my-2026.github.io/seamless-rag/judges-testing-guide/
Author-email: LiuWei <TP085412@mail.apu.edu.my>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: auto-embedding,embeddings,hnsw,llm,mariadb,rag,retrieval-augmented-generation,semantic-search,toon,vector-database,vector-search
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Database :: Database Engines/Servers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: sqlglot>=26.0
Requires-Dist: tiktoken>=0.7.0
Requires-Dist: typer>=0.9
Provides-Extra: all
Requires-Dist: google-genai>=1.0; extra == 'all'
Requires-Dist: gradio>=4.0; extra == 'all'
Requires-Dist: mariadb>=1.1.14; extra == 'all'
Requires-Dist: ollama>=0.1; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: sentence-transformers>=5.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.100; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.2; extra == 'dev'
Requires-Dist: pytest-xdist>=3.5; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.8; extra == 'dev'
Requires-Dist: syrupy>=4.6; extra == 'dev'
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=5.0; extra == 'embeddings'
Provides-Extra: gemini
Requires-Dist: google-genai>=1.0; extra == 'gemini'
Provides-Extra: mariadb
Requires-Dist: mariadb>=1.1.14; extra == 'mariadb'
Provides-Extra: ollama
Requires-Dist: ollama>=0.1; extra == 'ollama'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: web
Requires-Dist: gradio>=4.0; extra == 'web'
Description-Content-Type: text/markdown

![Seamless-RAG](docs/assets/logo.svg)

# Seamless-RAG

**Vector Search & Structured-Data RAG Toolkit for MariaDB**

> Turn any MariaDB table into a searchable vector store. Query results come back in TOON v3 tabular format — a compact wire format that saves 10-55% of tokens (vs compact JSON) when feeding structured data to LLMs or agents.

![Powered by MariaDB](docs/assets/badge-mariadb.svg)
[![PyPI](https://img.shields.io/pypi/v/seamless-rag.svg)](https://pypi.org/project/seamless-rag/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://python.org)
[![TOON v3](https://img.shields.io/badge/TOON%20v3-166%2F166%20conformance-blue)]()
[![License](https://img.shields.io/badge/license-Apache--2.0-green.svg)](LICENSE)
[![Tests](https://img.shields.io/badge/tests-525%2F525%20passing-brightgreen)]()

---

## One-Line Agent Setup

If you use Claude Code, Codex, Cursor, Windsurf, or another coding agent, you can hand it the setup instructions in one sentence:

> Help me clone Seamless-RAG if needed, then bootstrap it for local development by following https://raw.githubusercontent.com/MariaDB-Hackathon-MY-2026/seamless-rag/main/Install.md

That prompt is intended for coding agents. It tells the agent to clone the repo if needed, prefer Docker when available (MariaDB 11.7.2+ is required for VECTOR support), and stop with the exact next command plus any missing config the user still needs to provide. The agent will not run the demo, pull models, or start long-running processes on its own — those are your call.

The repo also ships **agent skills** under [`skills/public/`](skills/public/) — Anthropic-format `SKILL.md` files for `seamless-rag` (CLI + Python API + TOON format) and `text-to-sql` (turning natural-language questions into safe SELECTs over MariaDB). Drop them into your agent's skills directory if you want richer guidance after bootstrap.

---

## Quick Start for Judges

> Evaluating this submission? Start with the **[Judges' Testing Guide](https://mariadb-hackathon-my-2026.github.io/seamless-rag/judges-testing-guide/)** — four progressive paths from inspect-only (5 min, no install) to full test suite (15 min). The full docs site is at <https://mariadb-hackathon-my-2026.github.io/seamless-rag/>.

90-second verification:

```bash
git clone https://github.com/SunflowersLwtech/seamless-rag.git
cd seamless-rag
docker compose up -d --wait
docker compose exec app seamless-rag demo
```

A 90-second screencast of the same flow lives at [`docs/assets/demo.gif`](docs/assets/demo.gif) (also [demo.mp4](docs/assets/demo.mp4)):

![Seamless-RAG demo](docs/assets/demo.gif)

---

## MariaDB Features We Use

This project is MariaDB-native end-to-end. The pipeline only works because of features that landed in MariaDB 11.7.2+ and is explicitly tuned for them — no external vector store, no shadow index, no application-side ANN.

| Feature | What we use it for | Where in the code |
|---|---|---|
| **`VECTOR(N)` column type** | First-class storage for 384-dim float32 embeddings, no `BLOB` workaround | [`storage/mariadb.py:112`](src/seamless_rag/storage/mariadb.py#L112) (schema), [`storage/mariadb.py:227`](src/seamless_rag/storage/mariadb.py#L227) (auto-add column for arbitrary tables) |
| **`VECTOR INDEX … DISTANCE=cosine`** (HNSW) | Sub-linear similarity search; we tune `mhnsw_ef_search = 100` per session for recall/latency trade-off | [`storage/mariadb.py:114`](src/seamless_rag/storage/mariadb.py#L114), [`storage/mariadb.py:337`](src/seamless_rag/storage/mariadb.py#L337) |
| **`VEC_DISTANCE_COSINE`** | Distance function in `ORDER BY` so the planner picks the HNSW index | [`storage/mariadb.py:343`](src/seamless_rag/storage/mariadb.py#L343), [`storage/mariadb.py:362`](src/seamless_rag/storage/mariadb.py#L362) |
| **Native binary protocol** via `mariadb-connector-python` | `array.array('f', embedding)` is sent verbatim — no `VEC_FromText` round-trip, no string parsing | [`storage/mariadb.py:254`](src/seamless_rag/storage/mariadb.py#L254), [`storage/mariadb.py:288`](src/seamless_rag/storage/mariadb.py#L288) (batch insert) |
| **CTE for context windowing** | Single round-trip retrieval: closest chunks plus their neighbours by `chunk_order` | [`storage/mariadb.py:341`](src/seamless_rag/storage/mariadb.py#L341) (`WITH closest AS …`) |
| **Hybrid SQL filter + vector ORDER BY** | `seamless-rag ask "waterproof watches" --where "price < 50"` — SQL pre-filter narrows the candidate set, vector ranks within | [`storage/mariadb.py:315`](src/seamless_rag/storage/mariadb.py#L315) (validated WHERE), [`storage/mariadb.py:362`](src/seamless_rag/storage/mariadb.py#L362) (combined query) |
| **Connection pool + autocommit** | `mariadb.ConnectionPool` with per-call lease, isolation-aware so the watcher never sees stale snapshots | [`storage/mariadb.py:158`](src/seamless_rag/storage/mariadb.py#L158), [`storage/mariadb.py:178`](src/seamless_rag/storage/mariadb.py#L178) |
| **Foreign keys + composite index** | `chunks.document_id REFERENCES documents(id)` plus `INDEX idx_doc_order(document_id, chunk_order)` so the CTE neighbour-join stays index-only | [`storage/mariadb.py:117-118`](src/seamless_rag/storage/mariadb.py#L117) |
| **Auto-schema for arbitrary tables** | `seamless-rag embed --table products --columns name,category` adds a `VECTOR(N)` column and HNSW index to your existing table without touching its other columns | [`storage/mariadb.py:227-232`](src/seamless_rag/storage/mariadb.py#L227) |
| **Bare `VEC_DISTANCE()` auto-pick** (MariaDB-only) | When the index has `DISTANCE=cosine`, plain `VEC_DISTANCE(...)` reads it from the index and applies cosine — no other RDBMS does this. Demonstrated live by `seamless-rag schema` | [`storage/mariadb.py:431`](src/seamless_rag/storage/mariadb.py#L431) (`compare_vec_distance`), [`tests/integration/test_vector_operations.py`](tests/integration/test_vector_operations.py) (1e-6 equivalence assertion) |

**See it for yourself in 5 seconds:** `seamless-rag schema` pretty-prints `SHOW CREATE TABLE chunks` (highlighting `vector(384)` and `VECTOR KEY ... DISTANCE=cosine`), `SHOW INDEX FROM chunks` (with the `VECTOR` row called out), and runs a side-by-side `VEC_DISTANCE()` vs `VEC_DISTANCE_COSINE()` query so you can verify auto-pick parity yourself.

**Tested against MariaDB 11.8** (the version shipped in the official `mariadb:11.8` Docker image). 11/11 integration tests pass against the real server, exercising every feature above — see [`tests/integration/test_vector_operations.py`](tests/integration/test_vector_operations.py).

Without MariaDB's VECTOR + HNSW, this project would either need a sidecar vector DB (Chroma/Qdrant/pgvector) or a from-scratch ANN implementation. Neither would be MariaDB-native, neither would benefit from the same indexes that already serve OLTP traffic.

---

## Why

LLMs and agents consume structured data as context. The standard approach — dumping JSON — wastes tokens on repeated field names and structural characters:

```json
[{"id":1,"name":"Widget","category":"Tools","price":29.99,"stock":150,"supplier":"Acme","rating":4.5},
 {"id":2,"name":"Gadget","category":"Tools","price":19.99,"stock":300,"supplier":"Acme","rating":4.2}]
```

TOON tabular writes field names once, values as compact rows:

```
[2,]{id,name,category,price,stock,supplier,rating}:
  1,Widget,Tools,29.99,150,Acme,4.5
  2,Gadget,Tools,19.99,300,Acme,4.2
```

**Measured on real public datasets** ([full benchmark](https://mariadb-hackathon-my-2026.github.io/seamless-rag/BENCHMARK_REAL_DATA/)):

| Dataset (query type) | Rows | JSON Tokens | TOON Tokens | Savings |
|---------------------|------|-------------|-------------|---------|
| MovieLens — top rated movies (7 cols) | 100 | 6,540 | 5,019 | **23.3%** |
| MovieLens — metadata only (4 cols) | 100 | 2,258 | 1,364 | **39.6%** |
| SF Restaurant — violations (9 cols) | 100 | 7,071 | 4,326 | **38.8%** |
| SF Restaurant — high risk (9 cols) | 50 | 3,437 | 2,076 | **39.6%** |

Savings scale with row count and stabilize at the dataset's natural ceiling:

| Rows | MovieLens (7 cols) | Restaurant (9 cols) |
|------|--------------------|---------------------|
| 10 | 21.7% | 34.6% |
| 50 | 22.0% | 38.2% |
| 100 | 24.1% | 38.8% |
| 500 | **29.0%** | **38.9%** |

TOON is not magic — it shines on **structured tabular data with many columns and short values**, which is exactly what comes out of database queries. All measurements use compact JSON (`separators=(",",":")`) as baseline.

### Where TOON does NOT help

TOON gets its savings by writing field names once in the header and one value-row per record. Two scenarios where this mechanic doesn't apply, and the cost table will report ~0% savings:

- **`seamless-rag ask` over prose / blob chunks.** When ingested chunks are free-form text stored in a single `content` column, the retrieved context is essentially one long string per chunk. There are no repeated field names to deduplicate, so TOON encoding the result list (id, content, distance) saves only the structural overhead — typically negligible. Use TOON for the **query result shape**, not the chunk text itself.
- **Single-row results.** `[1,]{...}:` plus one value row is roughly the same size as `[{...}]`. The break-even point is around 3–5 rows.

If you want the real TOON win on data that already lives in MariaDB, use `seamless-rag export "SELECT …"` directly — that path is row-level structured by construction, and the benchmarks in this section reflect it. CSV is a fair competitor on uniform numeric columns; TOON pulls ahead as schemas get wider, values get more varied, and you start nesting (e.g. `JSON_EXTRACT` columns).

## Where It Fits

For structured database data, the industry uses two retrieval approaches. Seamless-RAG bridges both to LLMs:

```
"Q3 revenue by region?"           "Find products similar to X"
        │                                    │
   Text-to-SQL                        Vector Search
   (LLM generates SQL)              (cosine similarity)
        │                                    │
        └──────────┬─────────────────────────┘
                   ▼
           MariaDB executes
                   ▼
           list[dict] results
                   ▼
        Seamless-RAG → TOON format     ← saves 20-40% tokens
                   ▼
           LLM / Agent consumes
```

- **Precise queries** ("revenue > 1M"): write SQL directly, use `seamless-rag export` to TOON-format the results
- **Semantic queries** ("similar products"): use `seamless-rag ask` for vector search on text columns
- **Hybrid** ("waterproof watches under $50"): `seamless-rag ask --where "price < 50"` combines both

Seamless-RAG is a **format + embedding bridge**, not a replacement for SQL.

## Quick Start

Your data is already in MariaDB. Seamless-RAG adds vectors and TOON.

**Prerequisites** — the `mariadb` Python connector is a C extension. Install the system library first:
- **macOS**: `brew install mariadb-connector-c`
- **Debian/Ubuntu**: `sudo apt install libmariadb-dev`
- **Or** use Docker (see [Quick Start for Judges](#quick-start-for-judges)) and skip pip entirely

```bash
pip install "seamless-rag[mariadb,embeddings]"  # from PyPI
# or, from source:
# pip install -e ".[mariadb,embeddings]"

docker compose up -d                            # MariaDB 11.8

seamless-rag init                               # create VECTOR columns + HNSW index
seamless-rag embed --table products --column description  # embed existing rows
seamless-rag ask "Which products are most relevant?"      # vector search → TOON → LLM
seamless-rag export "SELECT id, name, price FROM products LIMIT 20"  # SQL → TOON
```

Optional extras: `[gemini]`, `[openai]`, `[ollama]`, `[web]` (Gradio UI), or `[all]` for everything.

No file loading, no document chunking — data lives in MariaDB, Seamless-RAG bridges it to vectors and LLMs.

## CLI Commands

```
seamless-rag init              Create VECTOR columns + HNSW index
seamless-rag embed             Bulk-embed existing table rows (core workflow)
seamless-rag watch             Auto-embed new inserts in real time (Rich live)
seamless-rag ask <question>    Vector search → TOON context → LLM answer
seamless-rag export <sql>      Any SELECT → TOON format
seamless-rag benchmark         JSON vs TOON token/cost comparison
seamless-rag web               Gradio web UI (localhost-only by default)
seamless-rag demo              End-to-end demo with sample data
seamless-rag ingest <path>     Convenience: load text files for quick testing
```

**Multi-column embedding** — embed multiple columns for richer semantics:

```bash
# Single column (default)
seamless-rag embed --table products --column description

# Multi-column — values concatenated for richer vector search
seamless-rag embed --table products --columns "name,category,price,rating"
# Internally: "Widget — Tools — 29.99 — 4.5"

# Now "cheap high-rated tools" matches on price AND rating, not just description
seamless-rag ask "cheap high-rated tools" --where "price < 50"
```

Global options: `--host`, `--port`, `--database`, `--provider`, `--model`, `--log-level`

## As Agent Tools

Seamless-RAG commands work as agent tools. An LLM agent can call these to interact with MariaDB:

```python
# Agent tool: search MariaDB and get compact context
result = rag.ask("quarterly revenue by region", top_k=10)
# result.context_toon → compact tabular format for next LLM call
# result.savings_pct → token savings vs compact JSON

# Agent tool: export any SQL query as TOON
toon = rag.export("SELECT region, revenue, quarter FROM sales")
# Feed to next agent step with minimal token overhead

# Agent tool: multi-column embed for richer search
rag.embed_table("products", text_column=["name", "category", "price"])
# "Widget — Tools — 29.99" → vector search matches name AND price
```

In a 20-step agent workflow querying a database at each step (measured on real data):

| Dataset | JSON (20 steps) | TOON (20 steps) | Tokens Saved | Cost Saved |
|---------|-----------------|-----------------|--------------|------------|
| MovieLens (7 cols, 50 rows/step) | 73,680 | 58,760 | **14,920** | $0.037 |
| Restaurant (9 cols, 50 rows/step) | 69,640 | 42,640 | **27,000** | $0.068 |

## Python API

```python
from seamless_rag import SeamlessRAG

with SeamlessRAG(host="localhost", database="mydb") as rag:
    rag.init()
    rag.ingest("research.txt", ["chunk1...", "chunk2..."])

    # Single-column embed (default)
    rag.embed_table("articles", text_column="content")

    # Multi-column embed — richer semantics
    rag.embed_table("products", text_column=["name", "category", "price"])

    # Semantic search with hybrid filter
    result = rag.ask("affordable tools", where="price < 50", mmr=True)
    print(result.answer)           # LLM-generated answer
    print(result.context_toon)     # compact context
    print(f"Saved {result.savings_pct:.0f}% tokens")
```

## Pluggable Providers

Both embedding and LLM layers use `typing.Protocol` — no base class needed:

| Layer | Providers | Default |
|-------|-----------|---------|
| Embedding | SentenceTransformers, Gemini, OpenAI, Ollama | SentenceTransformers (local, free) |
| LLM | Ollama, Gemini, OpenAI | Ollama (local, free) |

Switch via env vars: `EMBEDDING_PROVIDER=gemini LLM_PROVIDER=openai seamless-rag ask "..."`

See the [Providers guide](https://mariadb-hackathon-my-2026.github.io/seamless-rag/providers/) for adding custom providers.

## Architecture

```
seamless-rag CLI / Python API / Agent Tools
    │
    ├── EmbeddingProvider (Protocol)     ← 4 built-in, add your own
    ├── LLMProvider (Protocol)           ← 3 built-in, add your own
    ├── VectorStore (Protocol)           ← MariaDB with connection pool
    │     └── VECTOR(N) + HNSW index + VEC_DISTANCE_COSINE
    ├── AutoEmbedder                     ← batch + watch, multi-column concat
    ├── RAGEngine                        ← search → TOON → LLM (retry) → benchmark
    ├── TOONEncoder                      ← full v3 spec (166/166)
    └── TokenBenchmark                   ← tiktoken + GPT-4o cost calc
```

## Test Results

```
538 tests passing (100%)
  lint:        100%
  unit:        100% (338/338)
  spec:        100% (166/166 TOON v3 conformance)
  integration: 100% (17/17)
  eval:        100%
```

## Security

- **SQL injection prevention**: WHERE filters and SELECT queries validated via [sqlglot](https://github.com/tobymao/sqlglot) AST parsing — blocks writes, DDL, subqueries, and dangerous functions (SLEEP, BENCHMARK, LOAD_FILE)
- **Web UI**: binds `127.0.0.1` by default; `--share` requires auth via `SEAMLESS_WEB_USER` / `SEAMLESS_WEB_PASSWORD`; error messages never leak server internals
- **LLM calls**: context truncated to 20K chars; retry with jitter for transient errors; rate-limit detection
- **Identifiers**: all table/column names validated against `^[A-Za-z_][A-Za-z0-9_]*$`

## Built for the MariaDB Ecosystem

<p align="center">
  <img src="docs/assets/mariadb-logo.svg" alt="MariaDB" width="200"/>
</p>

- **MariaDB 11.7+** VECTOR columns, HNSW indexes, VEC_DISTANCE_COSINE
- **Native binary protocol** via [`mariadb-connector-python`](https://mariadb.com/docs/connectors/mariadb-python/) (array.array float32, no `VEC_FromText` round-trip)
- **Connection pooling** with unique pool names for concurrent instances
- **Version validation** (>= 11.7.2) on init

**Where this fits in the MariaDB ecosystem:**

| Layer | Project | Relationship |
|---|---|---|
| Database | [MariaDB Server 11.7+](https://mariadb.com/kb/en/vector-overview/) | Required runtime — VECTOR/HNSW lands in 11.7.2 |
| Driver | [mariadb-connector-python](https://github.com/mariadb-corporation/mariadb-connector-python) | Used directly via `mariadb.ConnectionPool` |
| Knowledge Base | [VEC_DISTANCE_COSINE](https://mariadb.com/kb/en/vec_distance_cosine/), [VECTOR INDEX](https://mariadb.com/kb/en/create-table-with-vectors/) | Reference docs for the SQL we generate |
| Hackathons | [MariaDB Python Hackathon 2025 winners](https://mariadb.org/adaptive-query-optimizer-for-mariadb-vector-innovation-winner-of-mariadb-python-hackathon-2025/) | Sister projects (Adaptive Query Optimizer, Apache Airflow integration) |
| Distribution | [PyPI: `seamless-rag`](https://pypi.org/project/seamless-rag/) | `pip install seamless-rag[mariadb]` |

## License

```
Copyright 2026 LiuWei (SunflowersLwtech)
Licensed under the Apache License, Version 2.0
```

See [LICENSE](LICENSE) | [CONTRIBUTING](CONTRIBUTING.md) | [Documentation](https://mariadb-hackathon-my-2026.github.io/seamless-rag/)
