Metadata-Version: 2.4
Name: terrafox-datalake
Version: 0.1.5
Summary: Automated connector wrapper for streaming data securely from a private MinIO Data Lake
Author: Sethu Gopalan
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: fsspec==2025.3.0
Requires-Dist: s3fs==2025.3.0
Requires-Dist: boto3<1.42
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Terrafox Data Lake Connector
A simple, secure wrapper module to stream files out of your private data lake into remote notebook runtimes seamlessly.

# terrafox-datalake

A lightweight, universal, stream-native connector wrapper designed to stream datasets securely from private MinIO and S3-compatible Data Lakes straight into Pandas dataframes. 

By replacing traditional file-system directory mapping wrappers (`s3fs`/`fsspec`) with direct object streaming via `boto3`, this package completely eliminates network edge bottlenecks, Cloudflare proxy payload limits, and 403 Forbidden credential collisions caused by background directory scanning.

---

## Key Features

- **Stream-Native Engine:** Reads multi-gigabyte datasets (e.g., 1.3 GiB+ CSVs) linearly using high-performance byte-stream network chunks, keeping your local or Google Colab memory consumption minimal.
- **Bypasses Proxy Blocks:** Sidesteps standard reverse-proxy constraints (like Cloudflare Tunnel 100 MiB Client Max Body Size upload blocks) during active read cycles.
- **Fully Universal & Repurposable:** Zero hardcoded endpoints. Works natively out-of-the-box with your configured defaults or targets any custom local/cloud data lake clusters dynamically.
- **Zero Configuration Conflict:** Completely abstracts complex `botocore` configuration arguments, address styling structures, and signature parameters out of your notebooks.

---

## Installation

# Terrafox Data Lake

A lightweight Python package for securely connecting to and streaming data from private MinIO-based data lake environments.

## Installation

```bash
pip install terrafox-datalake
```

## Quick Start

### 1. Connecting Natively via Interactive Prompt

If no background credentials are found, calling `connect()` will securely prompt you for your data lake credentials.

```python
import terrafox_datalake as dl

# Initialize the data lake client context securely
dl.connect()
```

---

### 2. Silent Credentials Injection (Automated Workflows)

For automated scripts, CI/CD pipelines, headless environments, or to bypass the interactive login prompt in Google Colab, set your credentials as environment variables before initializing the connection.

```python
import os
import terrafox_datalake as dl

# Pre-populate session credentials
os.environ["MINIO_USER"] = "admin"
os.environ["MINIO_PASSWORD"] = "your_secure_password"
os.environ["MINIO_ENDPOINT"] = "https://minio.terrafoxai.com"

# Initialize the connection
dl.connect()
```

---

### 3. Advanced Usage: Connecting to Different Infrastructures

Terrafox Data Lake is designed to be dynamic and reusable. Switch seamlessly between production environments, staging clusters, or local development instances.

```python
import terrafox_datalake as dl

# Connect to an alternate cluster or local MinIO instance
dl.connect(endpoint="https://local-testing-cluster.local:9000")

# Read data from a different environment
df = dl.read_csv(
    bucket="test-bucket",
    key="metrics.csv"
)
```

---

## Example: Reading Data from a Data Lake

```python
import terrafox_datalake as dl

dl.connect()

df = dl.read_csv(
    bucket="bigdata",
    key="vehicles.csv"
)

print(df.head())
```

---

## Architecture Requirements

* **Python:** 3.7 or higher
* **Supported Storage:** MinIO (S3-compatible object storage)

### Dependencies

* pandas
* boto3
* s3fs
* fsspec

---

## Features

* Secure interactive authentication
* Environment variable support for automation
* Native MinIO integration
* S3-compatible object storage access
* Simple DataFrame-based data retrieval
* Flexible infrastructure switching between environments

---

## License

MIT License
