Metadata-Version: 2.4
Name: nlp2sql
Version: 0.2.0rc6
Summary: Enterprise-ready Natural Language to SQL converter with multi-provider support. Built for production scale (1000+ tables) with Clean Architecture.
Project-URL: Homepage, https://github.com/luiscarbonel1991/nlp2sql
Project-URL: Bug Tracker, https://github.com/luiscarbonel1991/nlp2sql/issues
Project-URL: Documentation, https://nlp2sql.readthedocs.io
Author-email: luiscarbonel1991 <devhighlevel@gmail.com>
License: MIT
License-File: LICENSE
Keywords: ai,clean-architecture,database,enterprise,llm,multi-provider,mysql,natural-language,nlp,postgres,production,query-generation,schema-filtering,sql
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Requires-Dist: anthropic>=0.18.0
Requires-Dist: asyncpg>=0.29.0
Requires-Dist: azure-identity>=1.15.0
Requires-Dist: boto3>=1.34.0
Requires-Dist: click>=8.0.0
Requires-Dist: faiss-cpu>=1.7.4
Requires-Dist: google-generativeai>=0.3.0
Requires-Dist: greenlet>=3.2.3
Requires-Dist: httpx>=0.25.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: openai>=1.0.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: psycopg2-binary>=2.9.0
Requires-Dist: psycopg[binary]>=3.2.13
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: redis>=5.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: sqlalchemy-redshift>=0.7.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: structlog>=24.1.0
Requires-Dist: tenacity>=8.2.0
Requires-Dist: tiktoken>=0.5.0
Provides-Extra: all-embeddings
Requires-Dist: sentence-transformers>=2.5.0; extra == 'all-embeddings'
Provides-Extra: all-providers
Requires-Dist: anthropic>=0.18.0; extra == 'all-providers'
Requires-Dist: google-generativeai>=0.3.0; extra == 'all-providers'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.18.0; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: black>=23.7.0; extra == 'dev'
Requires-Dist: isort>=5.12.0; extra == 'dev'
Requires-Dist: mypy>=1.7.0; extra == 'dev'
Requires-Dist: myst-parser>=2.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.11.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: sphinx-rtd-theme>=1.3.0; extra == 'dev'
Requires-Dist: sphinx>=7.0.0; extra == 'dev'
Provides-Extra: embeddings-local
Requires-Dist: sentence-transformers>=2.5.0; extra == 'embeddings-local'
Provides-Extra: embeddings-openai
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.3.0; extra == 'gemini'
Provides-Extra: mssql
Requires-Dist: pymssql>=2.2.0; extra == 'mssql'
Provides-Extra: mysql
Requires-Dist: mysql-connector-python>=8.0.0; extra == 'mysql'
Requires-Dist: pymysql>=1.1.0; extra == 'mysql'
Provides-Extra: oracle
Requires-Dist: oracledb>=2.0.0; extra == 'oracle'
Description-Content-Type: text/markdown

<p align="center">
  <img src="assets/nlp2sql-logo.png" alt="nlp2sql logo" width="400">
</p>

<p align="center">
  <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.9+-blue.svg" alt="Python 3.9+"></a>
  <a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg" alt="Code style: black"></a>
</p>

# nlp2sql

**Enterprise-ready Natural Language to SQL converter with multi-provider support**

Convert natural language queries to optimized SQL using multiple AI providers. Built with Clean Architecture principles for enterprise-scale applications handling 1000+ table databases.

## Features

- **Multiple AI Providers**: OpenAI, Anthropic Claude, Google Gemini - no vendor lock-in
- **Database Support**: PostgreSQL, Amazon Redshift
- **Large Schema Handling**: Vector embeddings and intelligent filtering for 1000+ tables
- **Smart Caching**: Query and schema embedding caching for improved performance
- **Async Support**: Full async/await support
- **Clean Architecture**: Ports & Adapters pattern for maintainability

## Documentation

| Document | Description |
|----------|-------------|
| [Architecture](docs/ARCHITECTURE.md) | Component diagram and data flow |
| [API Reference](docs/API.md) | Python API and CLI command reference |
| [Configuration](docs/CONFIGURATION.md) | Environment variables and schema filters |
| [Enterprise Guide](docs/ENTERPRISE.md) | Large-scale deployment and migration |
| [Redshift Support](docs/Redshift.md) | Amazon Redshift setup and examples |
| [Contributing](CONTRIBUTING.md) | Contribution guidelines |

## Installation

```bash
# With UV (recommended)
uv add nlp2sql

# With pip
pip install nlp2sql

# With specific providers
pip install nlp2sql[anthropic,gemini]
pip install nlp2sql[all-providers]

# With embeddings
pip install nlp2sql[embeddings-local]   # Local embeddings (free)
pip install nlp2sql[embeddings-openai]  # OpenAI embeddings
```

## Quick Start

### 1. Set Environment Variables

```bash
# At least one AI provider key required
export OPENAI_API_KEY="your-openai-key"
# export ANTHROPIC_API_KEY="your-anthropic-key"
# export GOOGLE_API_KEY="your-google-key"
```

### 2. One-Line Usage

```python
import asyncio
import os
from nlp2sql import generate_sql_from_db

async def main():
    result = await generate_sql_from_db(
        database_url="postgresql://user:pass@localhost:5432/mydb",
        question="Show me all active users",
        ai_provider="openai",
        api_key=os.getenv("OPENAI_API_KEY")
    )
    print(result['sql'])
    print(f"Confidence: {result['confidence']}")

asyncio.run(main())
```

### 3. Pre-Initialized Service (Better Performance)

```python
from nlp2sql import create_and_initialize_service

async def main():
    # Initialize once
    service = await create_and_initialize_service(
        database_url="postgresql://user:pass@localhost:5432/mydb",
        ai_provider="openai",
        api_key=os.getenv("OPENAI_API_KEY")
    )

    # Use multiple times
    result1 = await service.generate_sql("Count total users")
    result2 = await service.generate_sql("Show recent orders")
```

### 4. Large Database with Schema Filtering

```python
from nlp2sql import create_and_initialize_service

service = await create_and_initialize_service(
    database_url="postgresql://localhost/enterprise",
    ai_provider="anthropic",  # Best for large schemas (200K context)
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    schema_filters={
        "include_schemas": ["sales", "finance"],
        "exclude_system_tables": True
    }
)

result = await service.generate_sql("Show revenue by month")
```

### 5. CLI Usage

```bash
# Generate SQL
nlp2sql query \
  --database-url postgresql://user:pass@localhost:5432/mydb \
  --question "Show all active users" \
  --explain

# Inspect schema
nlp2sql inspect --database-url postgresql://localhost/mydb

# Benchmark providers
nlp2sql benchmark --database-url postgresql://localhost/mydb
```

## How It Works

```
Question ──► Cache check ──► Schema retrieval ──► Relevance filtering ──► Context building ──► AI generation ──► Validation
                                    │                     │                      │
                              SchemaRepository    FAISS + TF-IDF hybrid   Reuses precomputed
                              (+ disk cache)      + batch scoring          relevance scores
```

1. **Schema retrieval** -- Fetches tables from database via `SchemaRepository` (with disk cache for Redshift)
2. **Relevance filtering** -- FAISS dense search + TF-IDF sparse search (50/50 hybrid) finds candidate tables; batch scoring refines with precomputed embeddings
3. **Context building** -- Builds optimized schema context within token limits, reusing scores from step 2 (zero additional embedding calls)
4. **SQL generation** -- AI provider (OpenAI, Anthropic, or Gemini) generates SQL from question + schema context
5. **Validation** -- SQL syntax and safety checks before returning results

See [Architecture](docs/ARCHITECTURE.md) for the detailed flow with method references and design decisions.

## Provider Comparison

| Provider | Context Size | Best For |
|----------|-------------|----------|
| OpenAI GPT-4 | 128K | Complex reasoning |
| Anthropic Claude | 200K | Large schemas |
| Google Gemini | 1M | High volume, cost efficiency |

See [Configuration](docs/CONFIGURATION.md) for detailed provider setup.

## Architecture

Clean Architecture (Ports & Adapters) with three layers: core entities, port interfaces, and adapter implementations. The schema management layer uses FAISS + TF-IDF hybrid search for relevance filtering at scale.

```
nlp2sql/
├── core/           # Business entities (pure Python, no dependencies)
├── ports/          # Interfaces (AIProviderPort, SchemaRepositoryPort, EmbeddingProviderPort)
├── adapters/       # Implementations (OpenAI, Anthropic, Gemini, PostgreSQL, Redshift)
├── services/       # Orchestration (QueryGenerationService)
├── schema/         # Schema management (SchemaManager, SchemaAnalyzer, SchemaEmbeddingManager)
├── config/         # Pydantic Settings configuration
└── exceptions/     # Custom exception hierarchy
```

See [Architecture](docs/ARCHITECTURE.md) for the full component diagram, data flow, and design decisions.

## Development

```bash
# Clone and install
git clone https://github.com/luiscarbonel1991/nlp2sql.git
cd nlp2sql
uv sync

# Start test databases
cd docker && docker-compose up -d

# Run tests
uv run pytest

# Code quality
uv run ruff format .
uv run ruff check .
uv run mypy src/
```

## MCP Server

nlp2sql includes a Model Context Protocol server for AI assistant integration.

```json
{
  "mcpServers": {
    "nlp2sql": {
      "command": "python",
      "args": ["/path/to/nlp2sql/mcp_server/server.py"],
      "env": {
        "OPENAI_API_KEY": "${OPENAI_API_KEY}",
        "NLP2SQL_DEFAULT_DB_URL": "postgresql://user:pass@localhost:5432/mydb"
      }
    }
  }
}
```

Tools: `ask_database`, `explore_schema`, `run_sql`, `list_databases`, `explain_sql`

See [mcp_server/README.md](mcp_server/README.md) for complete setup.

## Contributing

We welcome contributions. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## License

MIT License - see [LICENSE](LICENSE).

## Author

**Luis Carbonel** - [@luiscarbonel1991](https://github.com/luiscarbonel1991)
