Metadata-Version: 2.4
Name: helix-connect
Version: 1.3.8
Summary: Official Python SDK for Helix Connect Data Marketplace
Author-email: Helix Tools <contact@helix.tools>
License: MIT
Project-URL: Homepage, https://helix-connect.com
Project-URL: Documentation, https://docs.helix-connect.com
Project-URL: Repository, https://github.com/helix-tools/helix-connect-sdk-python
Project-URL: Issues, https://github.com/helix-tools/helix-connect-sdk-python/issues
Keywords: helix,data-marketplace,s3,aws,datasets
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8.1
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: boto3>=1.28.0
Requires-Dist: botocore>=1.31.0
Requires-Dist: requests>=2.31.0
Requires-Dist: cryptography>=41.0.0
Requires-Dist: genson>=1.2.0
Requires-Dist: PyJWT>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.7.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Requires-Dist: flake8>=6.1.0; extra == "dev"
Requires-Dist: jsonschema>=4.20.0; extra == "dev"
Dynamic: license-file

# Helix Connect Python SDK

[![PyPI version](https://badge.fury.io/py/helix-connect.svg)](https://badge.fury.io/py/helix-connect)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Official Python SDK for **Helix Connect Data Marketplace** - a secure, scalable platform for exchanging datasets between producers and consumers.

## 🚀 Features

- **Consumer API**: Download and subscribe to datasets
- **Producer API**: Upload and manage datasets (includes all consumer features)
- **Admin API**: Platform management (includes all producer + consumer features)
- **Secure**: AWS SigV4 authentication + AES-256-GCM envelope encryption
- **Efficient**: Compress-then-encrypt pipeline with ~90% space savings
- **Progress Tracking**: Real-time upload/download progress callbacks
- **Notifications**: SQS-based dataset update notifications with long-polling
- **Type-Safe**: Full type hints with mypy support

## 📦 Installation

```bash
pip install helix-connect
```

### Development Installation

```bash
git clone https://github.com/helix-tools/helix-connect-sdk-python.git
cd helix-connect-sdk-python
pip install -e ".[dev]"
```

## 🔧 Prerequisites

- Python 3.8 or higher
- AWS credentials (provided during customer onboarding)
- Helix Connect customer ID (UUID format)

## 📖 Quick Start

### Consumer: Download Datasets

```python
from helix_connect import HelixConsumer

# Initialize consumer
consumer = HelixConsumer(
    aws_access_key_id="your-access-key",
    aws_secret_access_key="your-secret-key",
    customer_id="your-customer-id",
    api_endpoint="https://api.helix-connect.com"  # optional
)

# List available datasets
datasets = consumer.list_datasets()
for ds in datasets:
    print(f"{ds['name']}: {ds['description']}")

# Download a dataset
consumer.download_dataset(
    dataset_id="a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    output_path="./data/my_dataset.csv"
)

# Subscribe to dataset updates
consumer.subscribe_to_dataset(dataset_id="...")

# Poll for notifications (long-polling with auto-download)
notifications = consumer.poll_notifications(
    max_messages=10,
    wait_time=20,  # seconds
    auto_download=True,
    output_dir="./downloads"
)
```

### Producer: Upload Datasets

```python
from helix_connect import HelixProducer

# Initialize producer (inherits all consumer capabilities)
producer = HelixProducer(
    aws_access_key_id="your-access-key",
    aws_secret_access_key="your-secret-key",
    customer_id="your-customer-id"
)

# Upload a dataset with progress tracking
def progress_callback(bytes_transferred, total_bytes):
    percent = (bytes_transferred / total_bytes) * 100
    print(f"Progress: {percent:.1f}%")

producer.upload_dataset(
    file_path="./data/my_dataset.csv",
    dataset_name="my-awesome-dataset",
    description="Q4 2024 sales data",
    data_freshness="daily",
    progress_callback=progress_callback
)

# Update existing dataset
producer.update_dataset(
    dataset_id="...",
    file_path="./data/updated_dataset.csv"
)

# List your uploaded datasets
my_datasets = producer.list_my_datasets()
```

### Admin: Platform Management

```python
from helix_connect import HelixAdmin

# Initialize admin (inherits producer + consumer capabilities)
admin = HelixAdmin(
    aws_access_key_id="admin-access-key",
    aws_secret_access_key="admin-secret-key",
    customer_id="admin-customer-id"
)

# Create new customer
customer = admin.create_customer(
    customer_name="Acme Corp",
    contact_email="data@acme.com"
)

# List all customers
customers = admin.list_customers()

# Get platform statistics
stats = admin.get_platform_stats()
print(f"Total datasets: {stats['total_datasets']}")
print(f"Total customers: {stats['total_customers']}")
```

### Admin: JWT Token Generation

Generate schema-compliant JWT tokens for testing, development, or service-to-service communication:

```python
from helix_connect import HelixAdmin

admin = HelixAdmin(
    aws_access_key_id="admin-access-key",
    aws_secret_access_key="admin-secret-key",
    customer_id="admin-customer-id"
)

# Generate a user token
token = admin.generate_token(
    sub="user@example.com",
    customer_id="company-123",
    email="user@example.com",
    customer_type="consumer",  # "producer", "consumer", or "both"
    tier="starter",
)

# Generate an admin token (convenience method)
admin_token = admin.generate_admin_token(
    sub="admin@helix.tools",
    customer_id="company-admin",
    email="admin@helix.tools",
    customer_type="both",
)

# Token with custom expiry and all claims
token = admin.generate_token(
    sub="user@example.com",
    customer_id="company-123",
    email="user@example.com",
    customer_type="producer",
    role="user",
    tier="enterprise",
    login_method="oauth",
    expiry_minutes=120,
)
```

**JWT Secret Resolution Order:**
1. Explicit `secret` argument
2. `HELIX_JWT_SECRET` environment variable
3. SSM Parameter Store (`/{env}/customers/{customer_id}/jwt_secret`)

**Token Claims:**
- Required: `sub`, `customer_id`, `email`, `customer_type`, `role`, `iss`, `iat`, `exp`
- Optional: `tier`, `authenticated_at`, `login_method`, `nbf`

## 🏗️ Architecture

### Class Hierarchy

```
HelixConsumer (base class)
    ↓
HelixProducer (adds upload capabilities)
    ↓
HelixAdmin (adds platform management)
```

Each class inherits all capabilities from its parent, so:
- **Producers** can also consume data
- **Admins** can produce and consume data

### Security & Encryption

The SDK implements a **compress-then-encrypt pipeline** with envelope encryption:

1. **Compression**: Gzip compression (configurable levels 1-9)
2. **Envelope Encryption**: 
   - Generates random 256-bit AES key
   - Encrypts data with AES-256-GCM
   - Encrypts AES key with AWS KMS
   - Packages as: `[key_len][encrypted_key][iv][tag][encrypted_data]`

This approach:
- ✅ Supports files of **unlimited size** (no KMS 4KB limit)
- ✅ Achieves **~90% space savings** through compression
- ✅ Provides **authenticated encryption** with GCM
- ✅ Uses AWS KMS for **secure key management**

### Network Configuration

- **API Timeouts**: 10s connect, 30s read (configurable)
- **Download Timeouts**: 10s connect, unlimited read (for large files)
- **Credential Validation**: Fail-fast with STS on initialization

## 📚 Examples

See the [`examples/`](examples/) directory for comprehensive usage examples:

- [`consumer_example.py`](examples/consumer_example.py) - Download, subscribe, poll notifications
- [`producer_example.py`](examples/producer_example.py) - Upload, update, manage datasets
- [`admin_example.py`](examples/admin_example.py) - Platform management (internal use)

## 🧪 Testing

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=helix_connect --cov-report=html

# Run specific test suite
pytest tests/test_encryption_compression.py -v

# Run standalone pipeline test
python tests/test_pipeline_standalone.py
```

### Test Results

The SDK includes comprehensive tests for the encryption/compression pipeline:

```
✓ test_compress_data - 90.9% compression on JSON data
✓ test_envelope_encryption_decryption - AES-256-GCM envelope format
✓ test_full_pipeline_compress_then_encrypt - End-to-end verification
✓ test_wrong_order_encrypt_then_compress - Proves old order was broken
✓ 10 tests total, all passing
```

## ⚙️ Configuration

### Environment Variables

```bash
# Required
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export HELIX_CUSTOMER_ID="your-customer-id"

# Optional
export HELIX_API_ENDPOINT="https://api-go.helix.tools"
export HELIX_COMPRESSION_LEVEL="6"  # 1-9, default: 6
```

### Programmatic Configuration

```python
consumer = HelixConsumer(
    aws_access_key_id="...",
    aws_secret_access_key="...",
    customer_id="...",
    api_endpoint="https://api-go.helix.tools",
    region="us-east-1",
    compression_level=6  # 1=fastest, 9=best compression
)
```

## 🔐 Security Best Practices

1. **Never commit credentials** to version control
2. **Use environment variables** or AWS Secrets Manager
3. **Rotate credentials** regularly
4. **Use IAM roles** when running on AWS infrastructure
5. **Validate data integrity** after downloads
6. **Monitor CloudWatch logs** for anomalies

## 🐛 Error Handling

The SDK provides specific exceptions for different error scenarios:

```python
from helix_connect.exceptions import (
    AuthenticationError,
    PermissionDeniedError,
    DatasetNotFoundError,
    RateLimitError,
    UploadError,
    DownloadError,
    HelixError  # Base exception
)

try:
    consumer.download_dataset(dataset_id="...", output_path="...")
except AuthenticationError:
    print("Invalid AWS credentials")
except PermissionDeniedError:
    print("No access to this dataset - subscribe first")
except DatasetNotFoundError:
    print("Dataset doesn't exist")
except RateLimitError as e:
    print(f"Rate limit exceeded - retry after {e.retry_after}s")
except HelixError as e:
    print(f"General error: {e}")
```

## 📊 Performance

### Compression Benchmarks

Based on real-world testing with JSON data:

| Data Type | Original Size | Compressed | Savings |
|-----------|--------------|------------|---------|
| JSON (user data) | 92 KB | 8 KB | **90.9%** |
| CSV (sales data) | 150 KB | 18 KB | **88.0%** |
| XML (config) | 45 KB | 6 KB | **86.7%** |

**Note**: Encrypting first (old broken code) resulted in ~0% compression!

### Network Performance

- **Chunked uploads**: 8MB chunks for large files
- **Parallel downloads**: Multi-threaded for multiple datasets
- **Progress callbacks**: Real-time feedback without performance impact
- **Connection pooling**: Reuses HTTP connections for efficiency

## 🛠️ Development

### Build & Validate

```bash
# Build package
python -m build

# Run build script (includes validation)
./scripts/build.sh

# Lint code
flake8 helix_connect/
black helix_connect/
mypy helix_connect/
```

### Project Structure

```
helix-connect-sdk-python/
├── helix_connect/          # SDK source code
│   ├── __init__.py         # Package exports
│   ├── consumer.py         # Consumer API
│   ├── producer.py         # Producer API
│   ├── admin.py            # Admin API
│   └── exceptions.py       # Custom exceptions
├── tests/                  # Test suite
│   ├── test_encryption_compression.py
│   └── test_pipeline_standalone.py
├── examples/               # Usage examples
│   ├── consumer_example.py
│   ├── producer_example.py
│   └── admin_example.py
├── scripts/                # Build scripts
│   └── build.sh
├── pyproject.toml          # Package configuration
└── README.md               # This file
```

## 🤝 Contributing

We welcome contributions! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Run tests (`pytest`)
4. Commit changes (`git commit -m 'Add amazing feature'`)
5. Push to branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request

### Code Standards

- **Style**: Follow PEP 8 (enforced by `black`)
- **Types**: Include type hints for all functions
- **Tests**: Maintain >80% coverage
- **Docs**: Update docstrings for public APIs

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🔗 Links

- **Homepage**: https://helix-connect.com
- **Documentation**: https://docs.helix-connect.com
- **GitHub**: https://github.com/helix-tools/helix-connect-sdk-python
- **PyPI**: https://pypi.org/project/helix-connect/
- **Support**: contact@helix.tools

## 📝 Changelog

### v1.0.0 (2024-10-14)

#### ✨ Features
- Initial release with Consumer, Producer, and Admin APIs
- AES-256-GCM envelope encryption for unlimited file sizes
- Compress-then-encrypt pipeline with ~90% space savings
- Real-time progress tracking for uploads/downloads
- SQS-based dataset update notifications
- Long-polling support with auto-download
- Comprehensive test suite (10 tests, all passing)

#### 🔧 Improvements
- Network timeouts (API: 30s, Downloads: unlimited)
- Credential validation on initialization (fail-fast)
- Proper exception handling throughout
- Type hints for all public APIs

#### 🐛 Bug Fixes
- Fixed KMS 4KB limit with envelope encryption
- Fixed compress-then-encrypt order (was reversed)
- Removed all emojis (encoding issues)
- Fixed bare except clauses

## 💬 Support

For questions, issues, or feature requests:

- **GitHub Issues**: https://github.com/helix-tools/helix-connect-sdk-python/issues
- **Email**: contact@helix.tools
- **Documentation**: https://docs.helix-connect.com

## 🙏 Acknowledgments

Built with:
- [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) - AWS SDK for Python
- [cryptography](https://cryptography.io/) - Cryptographic recipes and primitives
- [requests](https://requests.readthedocs.io/) - HTTP library

---

**Made with ❤️ by the Helix Tools team**
