Metadata-Version: 2.4
Name: databridge-ai
Version: 0.44.0
Summary: DataBridge AI - An open-source, MCP-native data reconciliation engine with tools for hierarchy management, data quality, and analytics
Project-URL: Homepage, https://github.com/datanexum/DATABRIDGE_AI
Project-URL: Documentation, https://github.com/datanexum/DATABRIDGE_AI#readme
Project-URL: Repository, https://github.com/datanexum/DATABRIDGE_AI.git
Project-URL: Issues, https://github.com/datanexum/DATABRIDGE_AI/issues
Author-email: DataBridge AI Team <support@databridge.ai>
License: MIT License
        
        Copyright (c) 2024-2026 DataBridge AI Team
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: ai,analytics,blce,business-logic,cortex,data,data-catalog,data-mart,data-quality,dbt,etl,hierarchy,lineage,mcp,reconciliation,snowflake
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Office/Business :: Financial
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.10
Requires-Dist: cryptography>=42.0.0
Requires-Dist: fastmcp>=2.0.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: rapidfuzz>=3.0.0
Requires-Dist: sqlalchemy>=2.0.0
Provides-Extra: ai-agents
Requires-Dist: pydantic-ai>=0.1.0; extra == 'ai-agents'
Requires-Dist: vanna[anthropic,chromadb]>=0.7.0; extra == 'ai-agents'
Provides-Extra: all
Requires-Dist: duckdb>=1.0.0; extra == 'all'
Requires-Dist: pillow>=10.0.0; extra == 'all'
Requires-Dist: polars>=1.0.0; extra == 'all'
Requires-Dist: pydantic-ai>=0.1.0; extra == 'all'
Requires-Dist: pypdf>=3.0.0; extra == 'all'
Requires-Dist: pytesseract>=0.3.10; extra == 'all'
Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'all'
Requires-Dist: vanna[anthropic,chromadb]>=0.7.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: build>=1.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: twine>=4.0.0; extra == 'dev'
Provides-Extra: ocr
Requires-Dist: pillow>=10.0.0; extra == 'ocr'
Requires-Dist: pytesseract>=0.3.10; extra == 'ocr'
Provides-Extra: pdf
Requires-Dist: pypdf>=3.0.0; extra == 'pdf'
Provides-Extra: snowflake
Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'snowflake'
Provides-Extra: turbo
Requires-Dist: duckdb>=1.0.0; extra == 'turbo'
Requires-Dist: polars>=1.0.0; extra == 'turbo'
Description-Content-Type: text/markdown

# DataBridge AI

[![PyPI version](https://badge.fury.io/py/databridge-ai.svg)](https://pypi.org/project/databridge-ai/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

**DataBridge AI** is a headless, MCP-native data and implementation engine. Tool availability is license-dependent (Community/Pro/Enterprise), with platform-wide coverage up to **267 tools**.

## Features

| Module | Tools | Description |
|--------|-------|-------------|
| **Hierarchy Builder** | 49 | Create and manage multi-level hierarchy projects (up to 15 levels) |
| **Data Reconciliation** | 38 | Compare and validate data from CSV, SQL, PDF, JSON sources |
| **Wright Module** | 31 | Hierarchy-driven data mart generation with 4-object pipeline |
| **Cortex AI** | 26 | Snowflake Cortex AI with natural language to SQL |
| **Data Catalog** | 19 | Centralized metadata registry with business glossary |
| **Templates & Skills** | 16 | Pre-built templates and AI expertise definitions |
| **Data Observability** | 15 | Real-time metrics, alerting, anomaly detection, health scoring |
| **Data Versioning** | 12 | Semantic versioning, snapshots, rollback, and diff |
| **Git/CI-CD** | 12 | Automated workflows and GitHub integration |
| **Lineage Tracking** | 11 | Column-level lineage and impact analysis |
| **PlannerAgent** | 11 | AI-powered workflow planning and agent suggestions |
| **ScriptGenerator** | 3 | Plan → executable Python script (E2E, BLCE, or Custom strategies) |
| **Workflow Controller** | 2 | Subprocess management, phase tracking, live monitoring |
| **GraphRAG Engine** | 10 | Anti-hallucination layer with graph + vector retrieval |
| **Unified AI Agent** | 10 | Cross-system operations (Book/Librarian/Researcher) |
| **dbt Integration** | 8 | Generate dbt projects from hierarchies |
| **Data Quality** | 7 | Expectation suites and data contracts |
| **DataShield** | 12 | Offline pre-AI masking with preserved table purpose and analytical patterns |
| **Diff Utilities** | 6 | Character-level text and data comparison |
| **Hierarchy-Graph Bridge** | 5 | Event-driven sync between hierarchies, GraphRAG, and lineage |
| **AI Relationship Discovery** | 8 | Schema analysis, naming pattern detection, relationship inference |
| **Cortex Table Understanding** | 5 | AI-powered table documentation and graph sync |
| **BLCE** | 84 | Business Logic Comprehension Engine — extract, normalize, govern, and deploy logic from SQL, Python, Excel, DAX, MDX, PDF |
| **Mart Factory** | 10 | Hierarchy-driven data mart generation with automated 4-object DDL pipeline |
| **And more...** | 44 | Console dashboard, recommendations, orchestrator, etc. |

### Hierarchy-Centric Architecture

DataBridge AI v0.43.0 introduces a hierarchy-centric architecture where the **Hierarchy Builder** serves as the central data structure. The new **Hierarchy-Graph Bridge** (`hierarchy/graph_bridge.py`) uses event-driven propagation to keep all downstream systems in sync:

- **Hierarchy changes** trigger automatic reindexing in the GraphRAG vector store
- **Lineage graphs** are rebuilt to reflect updated hierarchy relationships
- **Impact analysis** propagates through the catalog and observability subsystems
- **Wright Pipeline** can generate data marts directly from hierarchy projects via `wright_from_hierarchy`

### GraphRAG Runtime Health Monitoring

GraphRAG now includes startup bootstrap and runtime health logging so graph/vector/RAG synchronization can be verified during operation.

- **Startup bootstrap (automatic):**
  - Initializes embedding provider + vector store
  - Performs best-effort indexing for catalog and hierarchies
- **Runtime monitor (automatic):**
  - Writes periodic status snapshots every ~120 seconds
  - Confirms vector/embedder/retriever readiness and hierarchy bridge wiring

Use unified ops action to check runtime state:

```python
rag_ops(action="runtime_status", log_limit=50)
```

Runtime health log file:

```text
data/graphrag/runtime_health.jsonl
```

Codex GraphRAG index behavior:
- `codex_graphrag` auto-builds index if missing
- auto-refreshes index when `docs/Codex/*.json` is newer than `data/codex_graphrag/index.json`

### DataShield Positioning

DataShield is not positioned as a warehouse security boundary. It is an **offline Python-based pre-AI masking layer** for clients who do not want raw data sent to AI.

- Raw client data is masked before any AI processing in DataBridge.
- AI sees only post-processed scrambled/synthetic data.
- DataShield stores table summary and purpose metadata before masking so semantic context is preserved after scrambling.
- Post-processed Snowflake data is used for relationship discovery, profiling, and warehouse design, and is intentionally de-correlated from client source data.

## Editions

DataBridge AI is available in four editions:

| | **Community (CE)** | **Pro** | **Pro Examples** | **Enterprise** |
|---|:---:|:---:|:---:|:---:|
| **Tools** | ~106 | ~247 | Tests & Tutorials | 257+ |
| **Distribution** | Public PyPI | GitHub Packages | GitHub Packages | Private Deploy |
| **License** | MIT (Free) | License Key | Requires Pro Key | Dedicated Key |
| Data Reconciliation | ✅ | ✅ | | ✅ |
| Fuzzy Matching | ✅ | ✅ | | ✅ |
| PDF/OCR Extraction | ✅ | ✅ | | ✅ |
| dbt Basic | ✅ | ✅ | | ✅ |
| Data Quality | ✅ | ✅ | | ✅ |
| UI Dashboard | ✅ | ✅ | | ✅ |
| Cortex AI Agent | | ✅ | | ✅ |
| Wright Pipeline | | ✅ | | ✅ |
| GraphRAG Engine | | ✅ | | ✅ |
| Data Observability | | ✅ | | ✅ |
| Full Data Catalog | | ✅ | | ✅ |
| Column Lineage | | ✅ | | ✅ |
| Hierarchy Builder | | ✅ | | ✅ |
| AI Orchestrator | | ✅ | | ✅ |
| Data Versioning | | ✅ | | ✅ |
| Git/CI-CD Automation | | ✅ | | ✅ |
| Table Understanding | | ✅ | | ✅ |
| AI Relationship Discovery | | ✅ | | ✅ |
| 47 Tests + 29 Tutorials | | | ✅ | |
| Custom Agents | | | | ✅ |
| White-label | | | | ✅ |
| SLA Support | | | | ✅ |

See [Commercialization Guide](docs/COMMERCIALIZATION.md) for full details.

## DataBridge AI Pro

Pro adds **19 modules** with ~213 additional tools on top of Community Edition:

| Module | Tools | Description |
|--------|-------|-------------|
| **Hierarchy Builder** | 49 | Multi-level hierarchy projects (up to 15 levels) |
| **Wright Pipeline** | 31 | 4-object data mart factory (VW_1 → DT_2 → DT_3A → DT_3) with hierarchy integration |
| **Cortex AI** | 26 | Snowflake Cortex natural language to SQL and AI reasoning |
| **Data Catalog** | 19 | Centralized metadata registry with automatic lineage |
| **Faux Objects** | 18 | Domain persona-based hierarchy generation |
| **Connections** | 16 | Multi-database connectivity management |
| **AI Orchestrator** | 16 | Multi-agent task coordination and workflow management |
| **Data Observability** | 15 | Real-time metrics, alerting, anomaly detection |
| **Data Versioning** | 12 | Semantic versioning, snapshots, rollback, and diff |
| **Git/CI-CD** | 12 | Automated workflows and GitHub integration |
| **Lineage Tracking** | 11 | Column-level lineage and impact analysis |
| **PlannerAgent** | 11 | AI-powered workflow planning and agent suggestions |
| **GraphRAG Engine** | 10 | Anti-hallucination layer with graph + vector retrieval |
| **Unified AI Agent** | 10 | Cross-system operations (Book/Librarian/Researcher) |
| **Hierarchy-Graph Bridge** | 5 | Event-driven sync between hierarchies, GraphRAG, and lineage |
| **Console Dashboard** | 5 | Real-time broadcast messaging and monitoring |
| **Schema Matcher** | 5 | Cross-database schema comparison and mapping |
| **AI Relationship Discovery** | 8 | Schema analysis, naming patterns, relationship inference |
| **Cortex Table Understanding** | 5 | AI-powered table documentation and graph sync |
| **Data Matcher** | 4 | Row-level data comparison across connections |
| **BLCE** | 72 | Business Logic Comprehension Engine — extract, normalize, govern, deploy business logic |
| **DataShield** | 12 | Offline pre-AI masking; AI only sees post-processed scrambled data |

## Pro Examples

The **Pro Examples** add-on (`databridge-ai-examples`) provides a comprehensive tests and tutorials package:

- **29 use-case tutorials**: Beginner (4), Financial (7), Faux Objects (8), ERP DW (10)
- **47 test files**: CE test suite + Pro test suite + shared fixtures
- Requires a valid Pro license key

## Installation

```bash
# Community Edition (free, from PyPI)
pip install databridge-ai

# With optional dependencies
pip install databridge-ai[pdf]       # PDF support
pip install databridge-ai[snowflake] # Snowflake support
pip install databridge-ai[all]       # All dependencies

# Pro Edition (requires license key)
export DATABRIDGE_LICENSE_KEY="DB-PRO-YOURKEY-20260101-signature"
pip install databridge-ai-pro --extra-index-url https://ghp_TOKEN@raw.githubusercontent.com/datanexum/DATABRIDGE_AI/main/

# Pro Examples (requires Pro license key)
pip install databridge-ai-examples                # CE tests + beginner tutorials
pip install databridge-ai-examples[pro]           # + Pro tests + advanced tutorials
```

## Licensing

DataBridge AI uses a key-based licensing system for Pro and Enterprise editions.

**License Key Format:**
```
DB-{TIER}-{CUSTOMER_ID}-{EXPIRY}-{SIGNATURE}
Example: DB-PRO-ACME001-20270209-a1b2c3d4e5f6
```

```bash
# Generate a license key (admin)
python scripts/generate_license.py PRO CUSTOMER01 365

# Check license status (MCP tool)
get_license_status()
```

See [docs/COMMERCIALIZATION.md](docs/COMMERCIALIZATION.md) for key management, distribution, and API reference.

## Quick Start

### As MCP Server (Claude Desktop)

```bash
python run_server.py          # Slim mode (default, ~15 tools)
python run_server.py --full   # Full mode (all tools)
```

Add to your `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "DataBridge_AI": {
      "command": "python",
      "args": ["run_server.py"]
    }
  }
}
```

### Web UI Dashboard

```bash
cd databridge-ce/ui
python server.py
# Open http://127.0.0.1:5050
```

### Workbench UI Features

The Web UI includes several guided workflows:

- **AI Planner** — Conversational workflow builder powered by Claude. Describe what you want in natural language and get an executable multi-step plan with agent assignments and confidence scores.
- **Execution Wizard** — 3-step guided modal (Connection → Options → Review) for configuring plan execution. Discovers Snowflake connections, lets you set row limits, ERP type, and skip phases before generating a script.
- **Workflow Controller** — Live monitoring of running workflows via vis.js flow graph and Mermaid flowchart with 4 layout modes (LR, TB, Swimlane, Freeform).
- **Onboarding Banner** — First-run welcome card with guided actions to help new users get started.

### Programmatic Usage

```python
from src.server import mcp

# Run as MCP server
mcp.run()
```

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│  Workbench UI (:5050)           │  Claude / LLM Client      │
│  AI Planner · Workflow Editor   │  MCP Protocol             │
│  Execution Wizard · Dashboard   │                           │
└──────────────┬──────────────────┴──────────┬────────────────┘
               │ HTTP/WS                     │ MCP
               ▼                             ▼
┌─────────────────────────────────────────────────────────────┐
│            Flask Server + MCP Server (267 Tools)            │
│                                                             │
│  AI Layer                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │  Planner    │  │  Script     │  │  Workflow   │         │
│  │  Agent      │  │  Generator  │  │  Controller │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                             │
│  Plugin Modules                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │  Hierarchy  │  │   Cortex    │  │   Wright    │         │
│  │   Builder   │  │   Agent     │  │   Module    │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │    Data     │  │   Lineage   │  │    Data     │         │
│  │   Catalog   │  │   Tracker   │  │   Quality   │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   GraphRAG  │  │ Observabil- │  │    Data     │         │
│  │   Engine    │  │    ity      │  │ Versioning  │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│  ┌─────────────┐                                            │
│  │    BLCE     │  Business Logic Comprehension Engine       │
│  │  (72 tools) │  SQL/Python/Excel/DAX/MDX/PDF parsing      │
│  └─────────────┘                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Snowflake / CSV / SQL / PDF                     │
└─────────────────────────────────────────────────────────────┘
```

## Core Modules

### Hierarchy Builder
Create and manage multi-level hierarchy projects for financial reporting, organizational structures, and data classification.

```python
# Create a hierarchy project
create_hierarchy_project(name="Revenue P&L", description="Revenue hierarchy")

# Add hierarchies with source mappings
create_hierarchy(project_id="...", name="Product Revenue", parent_id="...")
add_source_mapping(hierarchy_id="...", source_column="ACCOUNT_CODE", source_uid="41%")

# Export and deploy
export_hierarchy_csv(project_id="...")
generate_hierarchy_scripts(project_id="...")
```

### Cortex AI Integration
Natural language to SQL and AI-powered data operations using Snowflake Cortex.

```python
# Configure Cortex
configure_cortex_agent(connection_id="snowflake-prod", cortex_model="mistral-large")

# Natural language query
analyst_ask(question="What was total revenue by region last quarter?",
            semantic_model_file="@ANALYTICS.PUBLIC.MODELS/sales.yaml")

# AI reasoning loop
cortex_reason(goal="Analyze data quality in PRODUCTS table")
```

### Wright Module (Data Mart Factory)
Generate data marts using the 4-object pipeline pattern: VW_1 → DT_2 → DT_3A → DT_3

```python
# Create mart configuration
create_mart_config(project_name="upstream_gross", report_type="GROSS",
                   hierarchy_table="TBL_0_GROSS_LOS_REPORT_HIERARCHY")

# Generate pipeline
generate_mart_pipeline(config_name="upstream_gross")
```

### Data Versioning
Track changes to hierarchies, catalog assets, and semantic models with semantic versioning.

```python
# Create version snapshot
version_create(object_type="hierarchy", object_id="revenue-pl",
               description="Added new cost centers", bump="minor")

# Compare versions
version_diff(object_type="hierarchy", object_id="revenue-pl",
             from_version="1.0.0", to_version="1.1.0")

# Rollback
version_rollback(object_type="hierarchy", object_id="revenue-pl", to_version="1.0.0")
```

### Data Observability
Real-time metrics collection, alerting, anomaly detection, and health scoring.

```python
# Record metrics
obs_record_metric(name="hierarchy.validation.success_rate", value=98.5,
                  type="gauge", tags='{"project_id": "revenue-pl"}')

# Create alert rules
obs_create_alert_rule(name="Low success rate",
                      metric_name="hierarchy.validation.success_rate",
                      threshold=95.0, comparison="<", severity="warning")

# Detect anomalies
obs_detect_anomaly(metric_name="hierarchy.validation.success_rate", value=72.0)

# Get asset health score
obs_get_asset_health(asset_id="revenue-pl", asset_type="hierarchy_project")

# System health dashboard
obs_get_system_health()
```

### GraphRAG Engine
Anti-hallucination layer using graph + vector retrieval-augmented generation.

```python
# Unified search action
rag_search(action="search", query="What hierarchies use ACCOUNT_CODE?", max_results=5)

# Validate AI output against knowledge graph
rag_ops(action="validate_output", content="SELECT * FROM REVENUE", content_type="sql")

# Runtime health (bootstrap + bridge wiring + vector readiness)
rag_ops(action="runtime_status", log_limit=25)
```

## Configuration

Create a `.env` file:

```env
# Data directory
DATA_DIR=./data

# License key (for Pro/Enterprise)
DATABRIDGE_LICENSE_KEY=DB-PRO-YOURKEY-20260101-signature

# NestJS backend (optional)
NESTJS_BACKEND_URL=http://localhost:8001
NESTJS_API_KEY=your-api-key

# Snowflake (optional)
SNOWFLAKE_ACCOUNT=your-account
SNOWFLAKE_USER=your-user
SNOWFLAKE_PASSWORD=your-password

# Cortex AI
CORTEX_DEFAULT_MODEL=mistral-large
```

## Documentation

- **[CLAUDE.md](CLAUDE.md)** - Complete tool reference and usage guide for Claude
- **[GEMINI.md](GEMINI.md)** - Gemini integration guide and quick reference
- **[docs/MANIFEST.md](docs/MANIFEST.md)** - Auto-generated tool manifest
- **[docs/COMMERCIALIZATION.md](docs/COMMERCIALIZATION.md)** - Licensing, tiers, and distribution
- **[docs/GRAPHRAG_RUNTIME_OPERATIONS.md](docs/GRAPHRAG_RUNTIME_OPERATIONS.md)** - Runtime health, auto-updates, and troubleshooting for graph/vector/RAG sync
- **[docs/compliance/OPEN_SOURCE_COMPLIANCE.md](docs/compliance/OPEN_SOURCE_COMPLIANCE.md)** - OSS compliance process and governance
- **[docs/compliance/THIRD_PARTY_NOTICES.md](docs/compliance/THIRD_PARTY_NOTICES.md)** - Third-party package notices and license review flags
- **[Wiki](../../wiki)** - Architecture, getting started, and tutorials

## Community Edition

The `databridge-ce/` folder contains the open-source Community Edition with:
- Plugin architecture for custom tools
- Web UI dashboard
- Starter templates

## Contributing

Contributions are welcome! Please read our contributing guidelines before submitting PRs.

## License

MIT License - See [LICENSE](LICENSE) for details.

Copyright (c) 2024-2026 DataBridge AI Team
