Metadata-Version: 2.4
Name: sourcebridgekit
Version: 0.2.0
Summary: Universal Python connector library for databases, files, cloud storage, and APIs with production-grade features
Author-email: sreeyenan <sreeyenan@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/sreeyenan/sourcebridgekit
Project-URL: Documentation, https://github.com/sreeyenan/sourcebridgekit#readme
Project-URL: Repository, https://github.com/sreeyenan/sourcebridgekit
Keywords: connector,database,etl,data,api,cloud,async,incremental,monitoring
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: typing-extensions>=4.0
Requires-Dist: pandas>=1.5
Provides-Extra: api
Requires-Dist: httpx>=0.25; extra == "api"
Requires-Dist: requests>=2.31; extra == "api"
Provides-Extra: mysql
Requires-Dist: pymysql>=1.1; extra == "mysql"
Requires-Dist: mysql-connector-python>=8.0; extra == "mysql"
Provides-Extra: mssql
Requires-Dist: pyodbc>=5.0; extra == "mssql"
Requires-Dist: pymssql>=2.2; extra == "mssql"
Provides-Extra: postgresql
Requires-Dist: psycopg[binary]>=3.1; extra == "postgresql"
Requires-Dist: psycopg2-binary>=2.9; extra == "postgresql"
Provides-Extra: clickhouse
Requires-Dist: clickhouse-connect>=0.6; extra == "clickhouse"
Requires-Dist: clickhouse-driver>=0.2; extra == "clickhouse"
Provides-Extra: mongodb
Requires-Dist: pymongo>=4.0; extra == "mongodb"
Provides-Extra: elasticsearch
Requires-Dist: elasticsearch>=8.0; extra == "elasticsearch"
Provides-Extra: files
Requires-Dist: pandas>=2.0; extra == "files"
Requires-Dist: polars>=0.19; extra == "files"
Requires-Dist: pyarrow>=14.0; extra == "files"
Requires-Dist: openpyxl>=3.1; extra == "files"
Requires-Dist: xlrd>=2.0; extra == "files"
Provides-Extra: azure
Requires-Dist: azure-storage-blob>=12.19; extra == "azure"
Requires-Dist: azure-storage-file-datalake>=12.14; extra == "azure"
Requires-Dist: azure-identity>=1.15; extra == "azure"
Provides-Extra: aws
Requires-Dist: boto3>=1.34; extra == "aws"
Provides-Extra: gcp
Requires-Dist: google-cloud-storage>=2.14; extra == "gcp"
Provides-Extra: redis
Requires-Dist: redis>=5.0; extra == "redis"
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: mypy>=1.7; extra == "dev"
Provides-Extra: all
Requires-Dist: sourcebridgekit[api,aws,azure,clickhouse,elasticsearch,files,gcp,mongodb,mssql,mysql,postgresql,redis]; extra == "all"
Dynamic: license-file

# SourceBridgeKit

**Universal Python Connector Library for Databases, Files, Cloud Storage, and APIs**

Version: 0.2.0

---

## What is SourceBridgeKit?

SourceBridgeKit is a standard, reusable Python connector framework that provides:

- **One Common API** for all data sources (MySQL, Azure Blob, REST APIs, Excel, etc.)
- **Configurable Everything** - drivers, timeouts, connection pools, retry logic
- **Environment Variable Support** - secure credential management via `${VAR:default}` syntax
- **Pandas & Polars Output** - fetch data in your preferred DataFrame format
- **Batch Operations** - memory-efficient reads/writes for large datasets
- **Incremental Loading** - configurable strategies for change detection
- **Production Ready** - retry logic, circuit breakers, connection pooling, SSL verification

**SourceBridgeKit focuses on source access only** - no preprocessing, no transformations, just clean data movement.

---

## Quick Start

### Installation

```bash
# Core library only
pip install sourcebridgekit

# With MySQL support
pip install sourcebridgekit[mysql]

# With Azure Blob Storage
pip install sourcebridgekit[azure]

# Everything
pip install sourcebridgekit[all]
```

### Basic Usage

```python
from sourcebridgekit import connect

# Connect with explicit config
with connect('mysql', config={
    'host': 'localhost',
    'database': 'analytics',
    'username': 'app_user',
    'password': '${MYSQL_PASSWORD}',  # From environment
}) as conn:
    result = conn.read('SELECT * FROM orders LIMIT 1000', output='pandas')
    df = result.data

# Or use environment prefix
with connect('mysql', env_prefix='MYSQL_') as conn:
    result = conn.read('SELECT * FROM orders', output='polars')
    df_pl = result.data
```

---

## Features

### Supported Connectors (V1)

| Category | Connectors |
|----------|------------|
| **Databases** | MySQL, PostgreSQL, MSSQL, ClickHouse, MongoDB, Elasticsearch |
| **Files** | CSV, JSON/JSONL, Excel, Parquet |
| **Cloud** | Azure Blob Storage, Azure Data Lake Gen2 |
| **APIs** | REST API (with pagination and curl parsing) |

### Output Formats

- **pandas** - pandas DataFrame
- **polars** - Polars DataFrame
- **arrow** - PyArrow Table
- **records** - List of dictionaries
- **raw** - Driver-native format

### Core Capabilities

✅ Connection management (connect, disconnect, test)  
✅ Data operations (read, write, batch read/write)  
✅ Metadata discovery (list databases, tables, describe schema)  
✅ Incremental loading (high watermark, timestamp, file modified time)  
✅ Checkpoint management (memory, JSON file, SQLite)  
✅ Retry logic with exponential backoff  
✅ Circuit breaker pattern  
✅ Connection pooling  
✅ SSL/TLS verification  
✅ Secret redaction in logs  

---

## Usage Examples

### MySQL Connector

```python
from sourcebridgekit import connect
from sourcebridgekit.connectors.sql import MySQLConfig

config = MySQLConfig(
    host='${MYSQL_HOST:localhost}',
    port=3306,
    database='analytics',
    username='${MYSQL_USER}',
    password='${MYSQL_PASSWORD}',
    driver='pymysql',  # or 'mysql-connector'
    pool={'enabled': True, 'pool_size': 10},
    retry={'enabled': True, 'max_attempts': 3}
)

with connect('mysql', config=config) as conn:
    # Simple read
    result = conn.read('SELECT * FROM orders WHERE status = "active"', output='pandas')
    
    # Batch read for large tables
    for batch in conn.read_batch('SELECT * FROM large_table', batch_size=10000):
        process(batch.data)
    
    # Write data
    conn.write(df, target='staging.new_orders', mode='append')
    
    # Metadata
    print(conn.list_tables(database='analytics'))
    schema = conn.describe_table('orders')
```

### Azure Blob Storage

```python
from sourcebridgekit import connect

config = {
    'account_name': '${AZURE_STORAGE_ACCOUNT}',
    'container_name': 'data',
    'connection_string': '${AZURE_STORAGE_CONNECTION_STRING}',
}

with connect('azure_blob', config=config) as conn:
    # Read file
    result = conn.read('data/sales/2026/sales.csv', output='pandas')
    
    # Write file
    conn.write(df, target='data/output/processed.parquet', format='parquet')
    
    # List files
    files = conn.list_files(prefix='data/sales/', pattern='*.csv')
```

### REST API with Pagination

```python
from sourcebridgekit import connect

config = {
    'base_url': 'https://api.example.com',
    'auth_type': 'bearer',
    'auth_token': '${API_TOKEN}',
    'pagination': {
        'enabled': True,
        'type': 'page',
        'page_size': 100,
        'max_pages': 50
    }
}

with connect('rest_api', config=config) as conn:
    result = conn.read('/v1/users', params={'status': 'active'}, output='pandas')
    df = result.data
```

### Incremental Loading

```python
from sourcebridgekit import connect

incremental_config = {
    'enabled': True,
    'strategy': 'high_watermark',
    'cursor_column': 'updated_at',
    'checkpoint_key': 'tenant_a.orders',
    'lookback_seconds': 300,
    'checkpoint_store': {'type': 'sqlite', 'path': './checkpoints.db'}
}

with connect('mysql', config=mysql_config) as conn:
    result = conn.read_incremental(
        table='orders',
        incremental=incremental_config,
        output='polars'
    )
    
    # Library automatically tracks checkpoint
    print(f"Fetched {result.row_count} new rows")
    print(f"New checkpoint: {result.checkpoint}")
```

### Curl to REST API

```python
from sourcebridgekit.connectors.api import RestConfig

# Parse curl command into structured config
config = RestConfig.from_curl('''
curl -X POST https://api.example.com/orders \
  -H "Authorization: Bearer ${API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"status":"active"}'
''')

with connect('rest_api', config=config) as conn:
    result = conn.read(output='records')
```

---

## Configuration

### Environment Variables

All configs support `${VAR_NAME}` or `${VAR_NAME:default}` syntax:

```python
config = {
    'host': '${DB_HOST:localhost}',  # Fallback to 'localhost'
    'port': '${DB_PORT:5432}',
    'password': '${DB_PASSWORD}',    # Required, no default
}
```

### Secrets Management

Sensitive fields use `SecretStr` and are redacted from logs:

```python
from pydantic import SecretStr

config = MySQLConfig(
    password=SecretStr('secret123')  # Redacted in logs
)
```

### Connection Pooling

```python
config = MySQLConfig(
    pool={
        'enabled': True,
        'pool_size': 10,
        'max_overflow': 20,
        'pool_timeout': 30,
        'pool_recycle': 3600
    }
)
```

### Retry & Circuit Breaker

```python
config = MySQLConfig(
    retry={
        'enabled': True,
        'max_attempts': 3,
        'backoff_factor': 2.0,
        'timeout_seconds': 30
    },
    circuit_breaker={
        'enabled': True,
        'failure_threshold': 5,
        'recovery_timeout': 60
    }
)
```

---

## FetchResult Standard

All read operations return a `FetchResult` object:

```python
result = conn.read('SELECT * FROM orders', output='pandas')

result.data              # pandas DataFrame
result.output_format     # 'pandas'
result.row_count         # Number of rows
result.columns           # List of column names
result.schema            # Column types
result.execution_time_ms # Query execution time
result.checkpoint        # Incremental checkpoint (if applicable)
result.metadata          # Additional metadata
result.warnings          # Any warnings
```

---

## Incremental Strategies

| Strategy | Description | Best For |
|----------|-------------|----------|
| `high_watermark` | Track max value of cursor column | SQL databases, APIs |
| `incrementing_id` | Track max ID value | Append-only tables |
| `timestamp_with_lookback` | Timestamp + safety window | Distributed systems |
| `file_modified_time` | Track file modification time | Local files, object storage |
| `checksum_or_etag` | Detect changes by hash | Files, object storage |

---

## Checkpoint Stores

| Store | Use Case |
|-------|----------|
| `memory` | Testing only (state lost on restart) |
| `json_file` | Simple local jobs |
| `sqlite` | Default persistent checkpoint store |

---

## Security

✅ SSL/TLS verification enabled by default  
✅ Secrets redacted from logs and exceptions  
✅ No raw shell command execution  
✅ Parameterized SQL queries  
✅ Configurable timeouts  
✅ SecretStr for sensitive fields  

---

## Roadmap

- **V1 (Current)**: Core connectors, batch operations, incremental loading
- **V2 (Planned)**: Async support, Redis/PostgreSQL checkpoint stores, OAuth2, OpenTelemetry
- **V3 (Future)**: CDC (binlog, logical replication), Kafka/RabbitMQ, distributed execution

---

## Development

```bash
# Clone and install in dev mode
git clone https://github.com/yourorg/sourcebridgekit
cd sourcebridgekit
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=sourcebridgekit --cov-report=html

# Format code
black sourcebridgekit/
ruff check sourcebridgekit/
```

---

## License

MIT License

---

## Support

- Documentation: https://sourcebridgekit.readthedocs.io
- Issues: https://github.com/yourorg/sourcebridgekit/issues
- Examples: https://github.com/yourorg/sourcebridgekit/tree/main/examples
