Metadata-Version: 2.4
Name: langchain-clickzetta
Version: 0.1.2
Summary: An integration package connecting ClickZetta and LangChain
Author-email: ClickZetta LangChain Integration Team <support@clickzetta.com>
License: MIT
License-File: LICENSE
Keywords: clickzetta,embeddings,full-text-search,langchain,llm,sql,vector-database
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Requires-Dist: clickzetta-connector-python>=0.8.92
Requires-Dist: clickzetta-zettapark-python>=0.1.3
Requires-Dist: langchain-core>=0.1.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: typing-extensions>=4.0.0
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.10.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: lint
Requires-Dist: black>=23.0.0; extra == 'lint'
Requires-Dist: mypy>=1.0.0; extra == 'lint'
Requires-Dist: ruff>=0.1.0; extra == 'lint'
Provides-Extra: test
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'test'
Requires-Dist: pytest-cov>=4.0.0; extra == 'test'
Requires-Dist: pytest-mock>=3.10.0; extra == 'test'
Requires-Dist: pytest>=7.0.0; extra == 'test'
Description-Content-Type: text/markdown

# LangChain ClickZetta

[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

🚀 **Enterprise-grade LangChain integration for ClickZetta** - Unlock the power of cloud-native lakehouse with AI-driven SQL queries, high-performance vector search, and intelligent full-text retrieval in a unified platform.

## 📖 Table of Contents

- [Why ClickZetta + LangChain?](#-why-clickzetta--langchain)
- [Core Features](#️-core-features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Storage Services](#-storage-services)
- [Comparison with Alternatives](#-comparison-with-alternatives)
- [Advanced Usage](#advanced-usage)
- [Testing](#testing)
- [Development](#development)
- [Contributing](#contributing)

## 🚀 Why ClickZetta + LangChain?

### 🏆 Unique Advantages

**1. Native Lakehouse Architecture**
- ClickZetta's cloud-native lakehouse provides 10x performance improvement over traditional Spark-based architectures
- Unified storage and compute for all data types (structured, semi-structured, unstructured)
- Real-time incremental processing capabilities

**2. True Hybrid Search in Single Table**
- Industry-first single-table hybrid search combining vector and full-text indexes
- No complex joins or multiple tables needed - everything in one place
- Atomic MERGE operations for consistent data updates

**3. Enterprise-Grade Storage Services**
- Complete LangChain BaseStore implementation with sync/async support
- Native Volume integration for binary file storage (models, embeddings)
- SQL-queryable document storage with JSON metadata
- Atomic UPSERT operations using ClickZetta's MERGE INTO

**4. Advanced Chinese Language Support**
- Built-in Chinese text analyzers (IK, standard, keyword)
- Optimized for bilingual (Chinese/English) AI applications
- DashScope integration for state-of-the-art Chinese embeddings

**5. Production-Ready Features**
- Connection pooling and query optimization
- Comprehensive error handling and logging
- Full test coverage (unit + integration)
- Type-safe operations throughout

## 🛠️ Core Features

### 🧠 AI-Powered Query Interface
- **Natural Language to SQL**: Convert questions to optimized ClickZetta SQL
- **Context-Aware**: Understands table schemas and relationships
- **Bilingual Support**: Works seamlessly with Chinese and English queries

### 🔍 Advanced Search Capabilities
- **Vector Search**: High-performance embedding-based similarity search
- **Full-Text Search**: Enterprise-grade inverted index with multiple analyzers
- **True Hybrid Search**: Single-table combined vector + text search (industry first)
- **Metadata Filtering**: Complex filtering with JSON metadata support

### 💾 Enterprise Storage Solutions
- **ClickZettaStore**: High-performance key-value storage using SQL tables
- **ClickZettaDocumentStore**: Structured document storage with queryable metadata
- **ClickZettaFileStore**: Binary file storage using native ClickZetta Volume
- **ClickZettaVolumeStore**: Direct Volume integration for maximum performance

### 🔄 Production-Grade Operations
- **Atomic UPSERT**: MERGE INTO operations for data consistency
- **Batch Processing**: Efficient bulk operations for large datasets
- **Connection Management**: Pooling and automatic reconnection
- **Type Safety**: Full type annotations and runtime validation

### 🎯 LangChain Compatibility
- **BaseStore Interface**: 100% compatible with LangChain storage standards
- **Async Support**: Full async/await pattern implementation
- **Chain Integration**: Seamless integration with LangChain chains and agents
- **Memory Systems**: Persistent chat history and conversation memory

## Installation

### From PyPI (Recommended)

```bash
pip install langchain-clickzetta
```

### Development Installation

```bash
git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta
pip install -e ".[dev]"
```

### Local Installation from Source

```bash
git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta
pip install .
```

## Quick Start

### Basic Setup

```python
from langchain_clickzetta import ClickZettaEngine, ClickZettaSQLChain, ClickZettaVectorStore
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_community.llms import Tongyi

# Initialize ClickZetta engine
# ClickZetta requires exactly 7 connection parameters
engine = ClickZettaEngine(
    service="your-service",
    instance="your-instance",
    workspace="your-workspace",
    schema="your-schema",
    username="your-username",
    password="your-password",
    vcluster="your-vcluster"  
)

# Initialize embeddings (DashScope recommended for Chinese/English support)
embeddings = DashScopeEmbeddings(
    dashscope_api_key="your-dashscope-api-key",
    model="text-embedding-v4"
)

# Initialize LLM
llm = Tongyi(dashscope_api_key="your-dashscope-api-key")
```

### SQL Queries

```python
# Create SQL chain
sql_chain = ClickZettaSQLChain.from_engine(
    engine=engine,
    llm=llm,
    return_sql=True
)

# Ask questions in natural language
result = sql_chain.invoke({
    "query": "How many users do we have in the database?"
})

print(result["result"])  # Natural language answer
print(result["sql_query"])  # Generated SQL query
```

### Vector Storage

```python
from langchain_core.documents import Document

# Create vector store
vector_store = ClickZettaVectorStore(
    engine=engine,
    embeddings=embeddings,
    table_name="my_vectors",
    vector_element_type="float"  # Options: float, int, tinyint
)

# Add documents
documents = [
    Document(
        page_content="ClickZetta is a high-performance analytics database.",
        metadata={"category": "database", "type": "analytics"}
    ),
    Document(
        page_content="LangChain enables building applications with LLMs.",
        metadata={"category": "framework", "type": "ai"}
    )
]

vector_store.add_documents(documents)

# Search for similar documents
results = vector_store.similarity_search(
    "What is ClickZetta?",
    k=2
)

for doc in results:
    print(doc.page_content)
```

### Full-text Search

```python
from langchain_clickzetta.retrievers import ClickZettaFullTextRetriever

# Create full-text retriever
retriever = ClickZettaFullTextRetriever(
    engine=engine,
    table_name="my_documents",
    search_type="phrase",
    k=5
)

# Add documents to search index
retriever.add_documents(documents)

# Search documents
results = retriever.get_relevant_documents("ClickZetta database")
for doc in results:
    print(f"Score: {doc.metadata.get('relevance_score', 'N/A')}")
    print(f"Content: {doc.page_content}")
```

### True Hybrid Search (Single Table)

```python
from langchain_clickzetta import ClickZettaHybridStore, ClickZettaUnifiedRetriever

# Create true hybrid store (single table with both vector + inverted indexes)
hybrid_store = ClickZettaHybridStore(
    engine=engine,
    embeddings=embeddings,
    table_name="hybrid_docs",
    text_analyzer="ik",  # Chinese text analyzer
    distance_metric="cosine"
)

# Add documents to hybrid store
documents = [
    Document(page_content="云器 Lakehouse 是由云器科技完全自主研发的新一代云湖仓。使用增量计算的数据计算引擎，性能可以提升至传统开源架构例如Spark的 10倍，实现了海量数据的全链路-低成本-实时化处理，为AI 创新提供了支持全类型数据整合、存储与计算的平台，帮助企业从传统的开源 Spark 体系升级到 AI 时代的数据基础设施。"),
    Document(page_content="LangChain enables building LLM applications")
]
hybrid_store.add_documents(documents)

# Create unified retriever for hybrid search
retriever = ClickZettaUnifiedRetriever(
    hybrid_store=hybrid_store,
    search_type="hybrid",  # "vector", "fulltext", or "hybrid"
    alpha=0.5,  # Balance between vector and full-text search
    k=5
)

# Search using hybrid approach
results = retriever.invoke("analytics database")
for doc in results:
    print(f"Content: {doc.page_content}")
```

### Chat Message History

```python
from langchain_clickzetta import ClickZettaChatMessageHistory
from langchain_core.messages import HumanMessage, AIMessage

# Create chat history
chat_history = ClickZettaChatMessageHistory(
    engine=engine,
    session_id="user_123",
    table_name="chat_sessions"
)

# Add messages
chat_history.add_message(HumanMessage(content="Hello!"))
chat_history.add_message(AIMessage(content="Hi there! How can I help you?"))

# Retrieve conversation history
messages = chat_history.messages
for message in messages:
    print(f"{message.__class__.__name__}: {message.content}")
```

## Configuration

### Environment Variables

You can configure ClickZetta connection using environment variables:

```bash
export CLICKZETTA_SERVICE="your-service"
export CLICKZETTA_INSTANCE="your-instance"
export CLICKZETTA_WORKSPACE="your-workspace"
export CLICKZETTA_SCHEMA="your-schema"
export CLICKZETTA_USERNAME="your-username"
export CLICKZETTA_PASSWORD="your-password"
export CLICKZETTA_VCLUSTER="your-vcluster"  # Required
```

### Connection Options

```python
engine = ClickZettaEngine(
    service="your-service",
    instance="your-instance",
    workspace="your-workspace",
    schema="your-schema",
    username="your-username",
    password="your-password",
    vcluster="your-vcluster",  # Required parameter
    connection_timeout=30,      # Connection timeout in seconds
    query_timeout=300,         # Query timeout in seconds
    hints={                    # Custom query hints
        "sdk.job.timeout": 600,
        "query_tag": "My Application"
    }
)
```

## Advanced Usage

### Custom SQL Prompts

```python
from langchain_core.prompts import PromptTemplate

custom_prompt = PromptTemplate(
    input_variables=["input", "table_info", "dialect"],
    template="""
    You are a ClickZetta SQL expert. Given the input question and table information,
    write a syntactically correct {dialect} query.

    Tables: {table_info}
    Question: {input}

    SQL Query:"""
)

sql_chain = ClickZettaSQLChain(
    engine=engine,
    llm=llm,
    sql_prompt=custom_prompt
)
```

### Vector Store with Custom Distance Metrics

```python
vector_store = ClickZettaVectorStore(
    engine=engine,
    embeddings=embeddings,
    distance_metric="euclidean",  # or "cosine", "manhattan"
    vector_dimension=1536,
    vector_element_type="float"  # or "int", "tinyint"
)
```

### Metadata Filtering

```python
# Search with metadata filters
results = vector_store.similarity_search(
    "machine learning",
    k=5,
    filter={"category": "tech", "year": 2024}
)

# Full-text search with metadata
retriever = ClickZettaFullTextRetriever(
    engine=engine,
    table_name="research_docs"
)
results = retriever.get_relevant_documents(
    "artificial intelligence",
    filter={"type": "research"}
)
```

## Testing

Run the test suite:

```bash
# Navigate to package directory
cd libs/clickzetta

# Install test dependencies
pip install -e ".[dev]"

# Run unit tests
make test-unit

# Run integration tests
make test-integration

# Run all tests
make test
```

### Integration Tests

To run integration tests against a real ClickZetta instance:

1. Configure your connection in `~/.clickzetta/connections.json` with a UAT connection
2. Add DashScope API key to the configuration
3. Run integration tests:

```bash
cd libs/clickzetta
make integration
make integration-dashscope
```

## Development

### Setup Development Environment

```bash
# Clone the repository
git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks (if configured)
pre-commit install
```

### Code Quality

```bash
# Navigate to the package directory
cd libs/clickzetta

# Format code (auto-fixes many issues)
make format

# Linting (significantly improved)
make lint      # ✅ Reduced from 358 to 65 errors - 82% improvement!

# Core functionality testing
# Use project virtual environment for best results:
source .venv/bin/activate
make test-unit        # ✅ Core unit tests (LangChain compatibility verified)
make test-integration # Integration tests

# Type checking (in progress)
make typecheck # Some LangChain compatibility issues being resolved
```

**Recent Improvements ✨**:
- ✅ **Ruff configuration updated** to modern format
- ✅ **155 typing issues auto-fixed** (Dict→dict, Optional→|None)
- ✅ **Method signatures fixed** for LangChain BaseStore compatibility
- ✅ **Bare except clauses improved** with proper exception handling
- ✅ **Code formatting standardized** with black

**Current Status**: Core functionality fully working with significantly improved code quality (82% reduction in lint errors). All LangChain BaseStore compatibility tests pass.

## 📦 Storage Services

LangChain ClickZetta provides comprehensive storage services that implement the LangChain BaseStore interface with enterprise-grade features:

### 🔑 Key Advantages of ClickZetta Storage

**🚀 Performance Benefits**
- **10x Faster**: ClickZetta's optimized lakehouse architecture
- **Atomic Operations**: MERGE INTO for consistent UPSERT operations
- **Batch Processing**: Efficient handling of large datasets
- **Connection Pooling**: Optimized database connections

**🏗️ Architecture Benefits**
- **Native Integration**: Direct ClickZetta Volume support for binary data
- **SQL Queryability**: Full SQL access to stored documents and metadata
- **Unified Storage**: Single platform for all data types
- **Schema Evolution**: Flexible metadata storage with JSON support

**🔒 Enterprise Features**
- **ACID Compliance**: Full transaction support
- **Type Safety**: Runtime validation and type checking
- **Error Handling**: Comprehensive error recovery and logging
- **Monitoring**: Built-in query performance tracking

### Key-Value Store
```python
from langchain_clickzetta import ClickZettaStore

# Basic key-value storage
store = ClickZettaStore(engine=engine, table_name="cache")
store.mset([("key1", b"value1"), ("key2", b"value2")])
values = store.mget(["key1", "key2"])
```

### Document Store
```python
from langchain_clickzetta import ClickZettaDocumentStore

# Document storage with metadata
doc_store = ClickZettaDocumentStore(engine=engine, table_name="documents")
doc_store.store_document("doc1", "content", {"author": "user"})
content, metadata = doc_store.get_document("doc1")
```

### File Store
```python
from langchain_clickzetta import ClickZettaFileStore

# Binary file storage using ClickZetta Volume
file_store = ClickZettaFileStore(
    engine=engine,
    volume_type="user",
    subdirectory="models"
)
file_store.store_file("model.bin", binary_data, "application/octet-stream")
content, mime_type = file_store.get_file("model.bin")
```

### Volume Store (Native ClickZetta Volume)
```python
from langchain_clickzetta import ClickZettaUserVolumeStore

# Native Volume integration
volume_store = ClickZettaUserVolumeStore(engine=engine, subdirectory="data")
volume_store.mset([("config.json", b'{"key": "value"}')])
config = volume_store.mget(["config.json"])[0]
```

## 📊 Comparison with Alternatives

### ClickZetta vs. Traditional Vector Databases

| Feature | ClickZetta + LangChain | Pinecone/Weaviate | Chroma/FAISS |
|---------|------------------------|-------------------|---------------|
| **Hybrid Search** | ✅ Single table | ❌ Multiple systems | ❌ Separate tools |
| **SQL Queryability** | ✅ Full SQL support | ❌ Limited | ❌ No SQL |
| **Lakehouse Integration** | ✅ Native | ❌ External | ❌ External |
| **Chinese Language** | ✅ Optimized | ⚠️ Basic | ⚠️ Basic |
| **Enterprise Features** | ✅ ACID, Transactions | ⚠️ Limited | ❌ Basic |
| **Storage Services** | ✅ Full LangChain API | ❌ Custom | ❌ Limited |
| **Performance** | ✅ 10x improvement | ⚠️ Variable | ⚠️ Memory limited |

### ClickZetta vs. Other LangChain Integrations

| Integration | Vector Search | Full-Text | Hybrid | Storage API | SQL Queries |
|-------------|---------------|-----------|---------|-------------|-------------|
| **ClickZetta** | ✅ | ✅ | ✅ | ✅ | ✅ |
| Elasticsearch | ✅ | ✅ | ⚠️ | ❌ | ❌ |
| PostgreSQL/pgvector | ✅ | ⚠️ | ❌ | ⚠️ | ✅ |
| MongoDB | ✅ | ⚠️ | ❌ | ⚠️ | ❌ |
| Redis | ✅ | ❌ | ❌ | ✅ | ❌ |

### Key Differentiators

**🎯 Single Platform Solution**
- No need to manage multiple systems (vector DB + full-text + SQL + storage)
- Unified data governance and security model
- Simplified architecture and reduced operational complexity

**🚀 Performance at Scale**
- ClickZetta's incremental computing engine
- Optimized for both analytical and operational workloads
- Native lakehouse storage with separation of compute and storage

**🌏 Chinese Market Focus**
- Deep integration with Chinese AI ecosystem (DashScope, Tongyi)
- Optimized text processing for Chinese language
- Compliance with Chinese data regulations

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for your changes
5. Ensure all tests pass (`pytest`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Create a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Support

- Documentation: [Link to detailed docs]
- Issues: [GitHub Issues](https://github.com/yunqiqiliang/langchain-clickzetta/issues)
- Discussions: [GitHub Discussions](https://github.com/yunqiqiliang/langchain-clickzetta/discussions)

## Acknowledgments

- [LangChain](https://github.com/langchain-ai/langchain) for the foundational framework
- [ClickZetta](https://www.yunqi.tech/) for the powerful analytics lakehouse