Metadata-Version: 2.4
Name: datus-agent
Version: 0.3.0rc3
Summary: AI-powered SQL Agent for data engineering (Compiled Version)
Author-email: Datus Team <harrison.zhao@datus.ai>
Maintainer-email: Datus Team <harrison.zhao@datus.ai>
License-Expression: Apache-2.0
Project-URL: Homepage, https://datus.ai/
Project-URL: Documentation, https://docs.datus.ai
Project-URL: Repository, https://github.com/datus-ai/datus-agent
Project-URL: Bug Tracker, https://github.com/datus-ai/datus-agent/issues
Keywords: sql,ai,agent,database,nlp,natural-language
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datus-db-core>=0.1.3
Requires-Dist: datus-storage-base<0.2.0,>=0.1.2
Requires-Dist: python-dotenv==1.0.0
Requires-Dist: pandas==2.1.4
Requires-Dist: sqlalchemy==2.0.23
Requires-Dist: sqlglot>=26.12.0
Requires-Dist: pyyaml==6.0.1
Requires-Dist: structlog>=23.1.0
Requires-Dist: openai>=2.8.0
Requires-Dist: httpx[socks]==0.27.2
Requires-Dist: tantivy>=0.22.2
Requires-Dist: aiohttp>=3.11.16
Requires-Dist: xlsxwriter>=3.2.2
Requires-Dist: openai-agents[litellm]==0.7.0
Requires-Dist: litellm<2,>=1.67.4.post1
Requires-Dist: pydantic<3.0,>=2.11.7
Requires-Dist: lancedb==0.18.0
Requires-Dist: pyarrow<19.0.0
Requires-Dist: rich==14.0.0
Requires-Dist: prompt_toolkit>=3.0.51
Requires-Dist: pygments>=2.18.0
Requires-Dist: textual[syntax]==5.1.1
Requires-Dist: anthropic==0.51.0
Requires-Dist: duckdb-engine>=0.17.0
Requires-Dist: duckdb==1.3.0
Requires-Dist: mcp>=1.11.0
Requires-Dist: anyio>=4.9.0
Requires-Dist: json-repair>=0.47.6
Requires-Dist: fastapi<1.0,>=0.104.0
Requires-Dist: uvicorn>=0.24.0
Requires-Dist: pyperclip==1.9.0
Requires-Dist: wcmatch>=10.0
Requires-Dist: fastembed==0.4.2
Requires-Dist: charset-normalizer>=3.4.2
Requires-Dist: defusedxml>=0.7.1
Requires-Dist: PyGithub>=2.1.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=5.0.0
Requires-Dist: markdown-it-py>=3.0.0
Requires-Dist: tabulate>=0.9.0
Requires-Dist: datus-semantic-core>=0.2.0
Requires-Dist: datus-bi-core>=0.1.2
Requires-Dist: datus-scheduler-core>=0.1.1
Dynamic: license-file

<p align="center">
  <strong>Datus — Open-Source Data Engineering Agent</strong>
</p>

<p align="center">
  <a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/license-Apache%202.0-blueviolet?style=for-the-badge" alt="License"></a>
  <a href="https://datus.ai"><img src="https://img.shields.io/badge/Website-5A0FC8?style=for-the-badge" alt="Website"></a>
  <a href="https://docs.datus.ai/"><img src="https://img.shields.io/badge/Docs-654FF0?style=for-the-badge" alt="Docs"></a>
  <a href="https://docs.datus.ai/getting_started/Quickstart/"><img src="https://img.shields.io/badge/Quick%20Start-3423A6?style=for-the-badge" alt="Quick Start"></a>
  <a href="https://docs.datus.ai/release_notes/"><img src="https://img.shields.io/badge/Release%20Notes-092540?style=for-the-badge" alt="Release Notes"></a>
  <a href="https://join.slack.com/t/datus-ai/shared_invite/zt-3g6h4fsdg-iOl5uNoz6A4GOc4xKKWUYg"><img src="https://img.shields.io/badge/Slack-4A154B?style=for-the-badge&logo=slack&logoColor=white" alt="Slack"></a>
</p>

---

## What is Datus?

**Datus** is an open-source data engineering agent that builds **evolvable context** for your data system — turning natural language into accurate SQL through domain-aware reasoning, semantic search, and continuous learning.

Data engineering is shifting from "building tables and pipelines" to "delivering scoped, domain-aware agents for analysts and business users." Datus makes that shift concrete.

![Datus Architecture](docs/assets/datus_architecture.svg)

## Key Features

### Build Evolvable Context, Not Static Pipelines

Traditional data engineering ends at data delivery. Datus goes further — it builds a **living knowledge base** that captures schema metadata, reference SQL, semantic models, metrics, and domain knowledge into a unified context layer. This context is what makes LLM-generated SQL accurate and trustworthy, and it improves with every interaction through a continuous learning loop. → [Contextual Data Engineering](https://docs.datus.ai/getting_started/contextual_data_engineering/)

### From Exploration to Domain-Specific Agents

Datus provides a complete journey for data engineers: start with a **Claude-Code-like CLI** to explore your data interactively, use [Plan Mode](https://docs.datus.ai/cli/plan_mode/) to review before executing, and build up context over time. When a domain matures, package it into a **Subagent** — a scoped chatbot with curated context, tools, and business rules — and deliver it to analysts via web, API, or MCP. → [Subagent docs](https://docs.datus.ai/subagent/introduction/)

### Metrics and Semantic Layer

Go beyond raw SQL with pluggable **semantic adapters**. Define business metrics in YAML via [MetricFlow](https://docs.datus.ai/metricflow/introduction/) integration, and let Datus generate SQL from metric queries — bridging the gap between business language and database dialect. Use [Dashboard Copilot](https://docs.datus.ai/getting_started/dashboard_copilot/) to turn existing BI dashboards into conversational analytics. → [Semantic Adapters docs](https://docs.datus.ai/adapters/semantic_adapters/)

### Measure and Improve

Built-in evaluation framework supporting **BIRD** and **Spider 2.0-Snow** datasets. Benchmark your agent's SQL accuracy, compare configurations, and track improvements as context evolves. → [Benchmark docs](https://docs.datus.ai/benchmark/benchmark_manual/)

### Open Platform

- **10+ LLM providers** (OpenAI, Claude, Gemini, DeepSeek, Qwen, Kimi, OpenRouter, and more) with per-node model assignment — mix models within a single workflow
- **11 databases** — Built-in SQLite & DuckDB, plus pluggable adapters for PostgreSQL, MySQL, Snowflake, StarRocks, ClickHouse, and more
- **MCP Protocol** — Both an MCP server (exposing Datus tools to Claude Desktop, Cursor, etc.) and an MCP client (consuming external tools via `.mcp` in the CLI). → [MCP docs](https://docs.datus.ai/integration/mcp/)
- **Skills** — Extend Datus with [agentskills.io](https://agentskills.io)-style packaged tools, configurable permissions, and marketplace support. → [Skills docs](https://docs.datus.ai/integration/skills/)

## Getting Started

### Install

**Requirements:** Linux or macOS. Python 3.12 is installed automatically when you use the one-liner.

#### One-liner (Linux / macOS)

Stable install from PyPI:

```bash
curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install.sh | sh
```

This creates a dedicated venv at `~/.datus/venv`, installs `datus-agent` from PyPI into it, and drops `datus`, `datus-cli`, `datus-api`, `datus-mcp`, `datus-agent`, `datus-gateway`, and `datus-pip` shims into `~/.local/bin`. Open a new shell (or `source ~/.zshrc`) to pick up PATH, then run `datus` to launch the REPL — use `/model` to configure an LLM, `/datasource` to add a datasource, and (optionally) `/init` to generate `AGENTS.md` for the current project.

To install additional Python packages into the global venv later, use `datus-pip install <package>` (it is a shim for `~/.datus/venv/bin/pip`).

Dev install from GitHub source (picks up unreleased changes):

```bash
curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install-dev.sh | sh
# or pin to a branch / tag / commit
curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install-dev.sh | DATUS_REF=feature/foo sh
```

Pin a PyPI version (stable installer only):

```bash
curl -fsSL https://raw.githubusercontent.com/datus-ai/datus-agent/main/install.sh | DATUS_VERSION=0.2.6 sh
```

Other variables supported by both installers: `DATUS_HOME` (default `~/.datus`), `DATUS_BIN_DIR` (default `~/.local/bin`), `DATUS_FORCE=1` to recreate the venv, `DATUS_NO_MODIFY_PATH=1` to skip shell rc edits.

#### Manual install

```bash
pip install datus-agent
datus
```

After the REPL starts, run `/model` to configure an LLM, `/datasource` to add a datasource, and (optionally) `/init` to generate `AGENTS.md` for the current project. For detailed guidance, see the [Quickstart Guide](https://docs.datus.ai/getting_started/Quickstart/).

### Four Ways to Use Datus

| Interface | Command | Use Case |
|-----------|---------|----------|
| **CLI** (Interactive REPL) | `datus-cli --datasource demo` | Data engineers exploring data, building context, creating subagents |
| **Web Chatbot** (Streamlit) | `datus-cli --web --datasource demo` | Analysts chatting with subagents via browser (`http://localhost:8501`) |
| **API Server** (FastAPI) | `datus-api --datasource demo` | Applications consuming data services via REST (`http://localhost:8000`) |
| **MCP Server** | `datus-mcp --datasource demo` | MCP-compatible clients (Claude Desktop, Cursor, etc.) |

> **Tip:** Use `datus-cli --print --datasource demo` for JSON streaming to stdout — useful for piping into other tools.

## Architecture

### Workflow Engine

Datus uses a configurable **node-based workflow engine**. Each workflow is a plan of nodes executed in sequence, parallel, or as sub-workflows:

```yaml
workflow:
  plan: planA
  planA:
    - schema_linking     # Find relevant tables
    - parallel:          # Run in parallel
      - generate_sql     # SQL generation
      - reasoning        # Chain-of-thought reasoning
    - selection          # Pick the best result
    - execute_sql        # Run the query
    - output             # Format and return
```

### Node Types

| Category | Nodes |
|----------|-------|
| **Core** | `schema_linking`, `generate_sql`, `execute_sql`, `reasoning`, `reflect`, `output` |
| **Agentic** | `chat`, `explore`, `gen_semantic_model`, `gen_metrics`, `gen_ext_knowledge`, `gen_sql_summary`, `gen_skill`, `gen_table`, `compare` |
| **Control Flow** | `parallel`, `selection`, `subworkflow` |
| **Utility** | `date_parser`, `doc_search`, `fix` |

### RAG Knowledge Base

The knowledge base is powered by **LanceDB** and organizes context into multiple layers:

- **Schema Metadata** — Table and column descriptions, relationships
- **Reference SQL** — Curated query examples with summaries
- **Reference Templates** — Parameterized Jinja2 SQL templates for stable, reusable queries
- **Semantic Models** — Business logic and metric definitions
- **Metrics** — Executable business metrics via semantic layer integration
- **External Knowledge** — Domain rules and concepts beyond raw schema
- **Platform Docs** — Ingested from GitHub repos, websites, or local files

Build the knowledge base with:

```bash
datus-agent bootstrap-kb --datasource demo --components metadata,reference_sql,ext_knowledge
```

## Configuration

Datus is configured via `agent.yml`. Launch `datus` and use `/model` plus `/datasource` to populate it interactively, or copy [`conf/agent.yml.example`](conf/agent.yml.example) and edit it by hand.

| Section | Purpose |
|---------|---------|
| `agent.models` | LLM provider definitions (API keys, model IDs, base URLs) |
| `agent.nodes` | Per-node model assignment and tuning parameters |
| `agent.services.datasources` | Database connections (SQLite, DuckDB, Snowflake, etc.) |
| `agent.storage` | Embedding models, vector DB, and RAG configuration |
| `agent.workflow` | Execution plans with sequential, parallel, and sub-workflow steps |
| `agent.agentic_nodes` | Configuration for agentic nodes (semantic model gen, metrics gen) |
| `agent.document` | Platform documentation sources (GitHub repos, websites, local files) |

API keys are injected via environment variables using `${ENV_VAR}` syntax.

## Supported LLM Providers

| Provider | Type | Notes |
|----------|------|-------|
| OpenAI | `openai` | GPT-4o, GPT-4, etc. |
| Anthropic Claude | `claude` | Direct API |
| Google Gemini | `gemini` | Gemini 2.0+ |
| DeepSeek | `deepseek` | DeepSeek-Chat, DeepSeek-Coder |
| Alibaba Qwen | `qwen` | Qwen series |
| Moonshot Kimi | `kimi` | Kimi models |
| MiniMax | `minimax` | MiniMax models |
| GLM (Zhipu) | `glm` | GLM-4 series |
| OpenAI Codex | `codex` | OAuth-based Codex models (gpt-5.3-codex, o3-codex) |
| OpenRouter | `openrouter` | 300+ models via a single API key |

**Embedding models:** OpenAI, Sentence-Transformers, FastEmbed, Hugging Face.

Per-node model assignment lets you use different providers for different workflow steps (e.g., a cheaper model for schema linking, a stronger model for SQL generation).

## Supported Databases

| Database | Type | Package |
|----------|------|---------|
| SQLite | `sqlite` | Built-in |
| DuckDB | `duckdb` | Built-in |
| PostgreSQL | `postgresql` | [`datus-postgresql`](https://github.com/Datus-ai/Datus-adapters) |
| MySQL | `mysql` | [`datus-mysql`](https://github.com/Datus-ai/Datus-adapters) |
| Snowflake | `snowflake` | [`datus-snowflake`](https://github.com/Datus-ai/Datus-adapters) |
| StarRocks | `starrocks` | [`datus-starrocks`](https://github.com/Datus-ai/Datus-adapters) |
| ClickHouse | `clickhouse` | [`datus-clickhouse`](https://github.com/Datus-ai/Datus-adapters) |
| ClickZetta | `clickzetta` | [`datus-clickzetta`](https://github.com/Datus-ai/Datus-adapters) |
| Hive | `hive` | [`datus-hive`](https://github.com/Datus-ai/Datus-adapters) |
| Spark | `spark` | [`datus-spark`](https://github.com/Datus-ai/Datus-adapters) |
| Trino | `trino` | [`datus-trino`](https://github.com/Datus-ai/Datus-adapters) |

See [Database Adapters documentation](https://docs.datus.ai/adapters/db_adapters/) for details.

## How It Works

![How It Works](docs/assets/how_it_works.svg)

**Explore** — Chat with your database, test queries, and ground prompts with `@table` or `@file` references.

```bash
datus-cli --datasource demo
/Check the top 10 banks by assets lost @table duckdb-demo.main.bank_failures
```

**Build Context** — Generate semantic models, import SQL history, define metrics. Each piece becomes reusable context for future queries.

```bash
/gen_semantic_model xxx        # Generate semantic model from tables
/gen_sql_summary               # Index SQL history for retrieval
```

**Create a Subagent** — Package mature context into a scoped, domain-aware chatbot with curated tools and business rules.

```bash
.subagent add mychatbot        # Create a new subagent
```

**Deliver** — Serve the subagent to analysts via web (`localhost:8501/?subagent=mychatbot`), REST API, or MCP — with feedback collection (upvotes, issue reports) built in.

**Measure** — Run benchmarks against BIRD or Spider 2.0-Snow to track SQL accuracy as context evolves.

**Iterate** — Analyst feedback loops back: engineers fix SQL, add rules, refine semantic models, and extend with Skills or MCP tools. The agent gets more accurate over time.

→ [End-to-end tutorial](https://docs.datus.ai/getting_started/contextual_data_engineering/#part-2--hands-on-tutorial-california-schools) · [CLI docs](https://docs.datus.ai/cli/introduction/) · [Knowledge Base docs](https://docs.datus.ai/knowledge_base/introduction/) · [Subagent docs](https://docs.datus.ai/subagent/introduction/)

## Development

```bash
uv sync                                           # Install dependencies
uv run pytest tests/unit_tests/ -q                # Run CI tests (no external deps)
uv run ruff format . && uv run ruff check --fix . # Lint & format
```

Enable `--save_llm_trace` on CLI commands or set `save_llm_trace: true` per model in `agent.yml` to persist LLM inputs/outputs for debugging. → [LLM Trace docs](https://docs.datus.ai/training/llm_trace_usage/)

See [CLAUDE.md](CLAUDE.md) for full development conventions, architecture patterns, and testing rules.

## License

[Apache 2.0](LICENSE)
