Metadata-Version: 2.4
Name: s3explore
Version: 0.1.0
Summary: Query S3 files with SQL — no database, no pipeline, no infrastructure.
Author: Patrick MacCarthy
License: MIT
Project-URL: Homepage, https://github.com/PatrickRoyMac/s3_data_explorer
Project-URL: Repository, https://github.com/PatrickRoyMac/s3_data_explorer
Project-URL: Issues, https://github.com/PatrickRoyMac/s3_data_explorer/issues
Keywords: s3,data,sql,clickhouse,chdb,parquet,exploration
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: chdb>=2.0.2
Requires-Dist: boto3>=1.34.0
Requires-Dist: click>=8.1.0
Requires-Dist: pandas>=2.0.0

# s3explore

> Query S3 files with SQL — no database, no pipeline, no infrastructure.

s3explore wraps [chDB](https://github.com/ClickHouse/chdb) (ClickHouse's embedded Python engine) and boto3 to let you run SQL directly against files sitting in S3. Drop `s3explore.py` next to your notebook or run it from the terminal.

Works as a **CLI**, a **Jupyter notebook library**, and produces **structured JSON output** for piping into LLMs like Claude Code.

---

## Prerequisites

- Python 3.9+
- AWS CLI with SSO configured (`aws configure sso`) — or any credentials boto3 can resolve

## Installation

```bash
pip install s3explore
```

**Or directly from GitHub:**
```bash
pip install git+https://github.com/PatrickRoyMac/s3_data_explorer.git
```

---

## Try it now — no AWS account needed

Query a public dataset (Amazon product reviews, ~150M rows of Parquet on S3) straight away:

```bash
# Schema — what columns are in these files?
s3explore schema "s3://datasets-documentation/amazon_reviews/*.parquet"

# Count rows per file
s3explore count "s3://datasets-documentation/amazon_reviews/*.parquet"

# Sample 5 rows
s3explore sample --rows 5 "s3://datasets-documentation/amazon_reviews/*.parquet"

# Run your own SQL
s3explore query "s3://datasets-documentation/amazon_reviews/*.parquet" \
  --sql "SELECT product_category, avg(star_rating) AS avg_stars, count() AS reviews
         FROM {table}
         GROUP BY product_category
         ORDER BY reviews DESC
         LIMIT 10"
```

No `--profile` flag needed — public buckets are accessed anonymously.

---

## Quickstart

```bash
# See what's in a bucket
s3explore --profile my-profile ls s3://my-bucket/events/year=2025/

# Understand the schema
s3explore --profile my-profile schema s3://my-bucket/events/year=2025/*.parquet

# Count rows across files
s3explore --profile my-profile count s3://my-bucket/events/year=2025/*.parquet

# Sample 10 rows
s3explore --profile my-profile sample s3://my-bucket/events/year=2025/*.parquet

# Run your own SQL
s3explore --profile my-profile query s3://my-bucket/events/year=2025/*.parquet \
  --sql "SELECT event_type, count() AS n FROM {table} GROUP BY event_type ORDER BY n DESC"
```

---

## Commands

```
s3explore [--profile PROFILE] [--format table|json|csv] COMMAND S3_PATH [OPTIONS]
```

| Command  | What it does                              | Key options            |
|----------|-------------------------------------------|------------------------|
| `ls`     | List files at an S3 prefix (boto3)        |                        |
| `schema` | Show column names and types               | `--fmt`                |
| `sample` | Show N sample rows                        | `--rows N`, `--fmt`    |
| `count`  | Count rows per file                       | `--fmt`                |
| `query`  | Run custom SQL (use `{table}` placeholder)| `--sql`, `--fmt`       |

### Output formats

| Flag            | Output              | Use for                    |
|-----------------|---------------------|----------------------------|
| `--format table`| Pretty table        | Human reading (default)    |
| `--format json` | One JSON object/line| LLMs, pipes, scripts       |
| `--format csv`  | CSV with headers    | Export, downstream tooling |

---

## Notebook usage

Open `notebook_template.ipynb`, fill in the config cell, and run all cells.

```python
import s3explore

creds = s3explore.get_credentials(profile="my-profile")

# Schema
print(s3explore.get_schema("s3://my-bucket/data/*.parquet", creds))

# Sample rows
print(s3explore.sample_rows("s3://my-bucket/data/*.parquet", creds, n=10))

# Custom query
print(s3explore.run_user_query(
    "SELECT event_type, count() AS n FROM {table} GROUP BY event_type",
    "s3://my-bucket/data/*.parquet",
    creds,
))
```

The `{table}` placeholder in your SQL is replaced with the full `s3(...)` call at runtime — you never need to handle credentials in your SQL strings.

---

## Supported file formats

Auto-detected from the file extension in the S3 path:

| Extension          | Format                  |
|--------------------|-------------------------|
| `.parquet`         | Parquet                 |
| `.json` / `.jsonl` | JSONEachRow             |
| `.json.gz`         | JSONEachRow (auto-decompressed) |
| `.csv`             | CSVWithNames            |
| `.tsv`             | TabSeparatedWithNames   |
| `.gz` (bare)       | JSONEachRow (best-effort) |

Override with `--fmt`:
```bash
s3explore schema s3://bucket/data/*.gz --fmt JSONEachRow
```

---

## LLM / Claude Code usage

s3explore is designed to be consumed by command-line LLMs. Use `--format json` to get structured output:

```bash
# Let Claude Code explore your data
s3explore --profile my-profile --format json schema s3://bucket/data/*.parquet
s3explore --profile my-profile --format json sample s3://bucket/data/*.parquet
```

See `CLAUDE.md` for a full tool description including the recommended exploration workflow.

---

## Troubleshooting

**Credentials expired (SSO)**
```bash
aws sso login --profile my-profile
```

**Format not detected**
Add `--fmt` with the explicit format name: `Parquet`, `JSONEachRow`, `CSVWithNames`.

**Bare `.gz` files (e.g. Kinesis Firehose output)**
These have no inner extension hint. s3explore defaults to `JSONEachRow` with a warning. Override: `--fmt JSONEachRow`.

---

## How it works

1. **boto3** resolves AWS SSO credentials from your named profile
2. **boto3** lists files via `list_objects_v2` for the `ls` command
3. **chDB** builds and executes a `SELECT ... FROM s3('path', creds, 'Format')` query in-process — no network call to any database, no cluster, no cost beyond S3 GET requests

---

## Dependencies

```
chdb>=2.0.2      # ClickHouse embedded engine
boto3>=1.34.0    # AWS credential resolution + S3 listing
click>=8.1.0     # CLI
pandas>=2.0.0    # CSV export in the notebook template
```
