Metadata-Version: 2.4
Name: faceberg
Version: 0.1.3
Summary: Bridge HuggingFace datasets with Apache Iceberg
Project-URL: Homepage, https://github.com/kszucs/faceberg
Project-URL: Documentation, https://github.com/kszucs/faceberg
Project-URL: Repository, https://github.com/kszucs/faceberg
Author-email: Krisztian Szucs <kszucs@users.noreply.github.com>
License: Apache-2.0
License-File: LICENSE
Keywords: data-lake,datasets,huggingface,iceberg
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Requires-Dist: click>=8.0.0
Requires-Dist: datasets>=2.0.0
Requires-Dist: fsspec>=2023.1.0
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: jinja2>=3.1.6
Requires-Dist: litestar>=2.0.0
Requires-Dist: pyarrow>=21.0.0
Requires-Dist: pyiceberg>=0.10.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: uuid-utils>=0.9.0
Requires-Dist: uvicorn[standard]>=0.27.0
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: duckdb>=0.10.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest-playwright>=0.7.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: requests>=2.31.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

![Faceberg](https://github.com/kszucs/faceberg/blob/main/faceberg.png?raw=true)

# Faceberg

**Bridge HuggingFace datasets with Apache Iceberg tables — no data copying, just metadata.**

Faceberg maps HuggingFace datasets to Apache Iceberg tables. Your catalog metadata lives on HuggingFace Spaces with an auto-deployed REST API, and any Iceberg-compatible query engine can access the data.

## Installation

```bash
pip install faceberg
```

## Quick Start

```bash
export HF_TOKEN=your_huggingface_token

# Create a catalog on HuggingFace Hub
faceberg user/mycatalog init

# Add datasets
faceberg user/mycatalog add stanfordnlp/imdb --config plain_text
faceberg user/mycatalog add openai/gsm8k --config main

# Query with interactive DuckDB shell
faceberg user/mycatalog quack
```

```sql
SELECT label, substr(text, 1, 100) as preview
FROM iceberg_catalog.stanfordnlp.imdb
LIMIT 10;
```

## How It Works

```
HuggingFace Hub
┌─────────────────────────────────────────────────────────┐
│                                                         │
│  ┌─────────────────────┐    ┌─────────────────────────┐ │
│  │  HF Datasets        │    │  HF Spaces (Catalog)    │ │
│  │  (Original Parquet) │◄───│  • Iceberg metadata     │ │
│  │                     │    │  • REST API endpoint    │ │
│  │  stanfordnlp/imdb/  │    │  • faceberg.yml         │ │
│  │   └── *.parquet     │    │                         │ │
│  └─────────────────────┘    └───────────┬─────────────┘ │
│                                         │               │
└─────────────────────────────────────────┼───────────────┘
                                          │ Iceberg REST API
                                          ▼
                              ┌─────────────────────────┐
                              │     Query Engines       │
                              │  DuckDB, Pandas, Spark  │
                              └─────────────────────────┘
```

**No data is copied** — only metadata is created. Query with DuckDB, PyIceberg, Spark, or any Iceberg-compatible tool.

## Python API

```python
import os
from faceberg import catalog

cat = catalog("user/mycatalog", hf_token=os.environ.get("HF_TOKEN"))
table = cat.load_table("stanfordnlp.imdb")
df = table.scan(limit=100).to_pandas()
```

## Share Your Catalog

Your catalog is accessible to anyone via the REST API:

```python
import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")
conn.execute("ATTACH 'https://user-mycatalog.hf.space' AS cat (TYPE ICEBERG)")

result = conn.execute("SELECT * FROM cat.stanfordnlp.imdb LIMIT 5").fetchdf()
```

## Documentation

**[Read the docs →](https://faceberg.kszucs.dev/)**

- [Getting Started](https://faceberg.kszucs.dev/) — Full quickstart guide
- [Local Catalogs](https://faceberg.kszucs.dev/local.html) — Use local catalogs for development
- [DuckDB Integration](https://faceberg.kszucs.dev/integrations/duckdb.html) — Advanced SQL queries
- [Pandas Integration](https://faceberg.kszucs.dev/integrations/pandas.html) — Load into DataFrames

## Development

```bash
git clone https://github.com/kszucs/faceberg
cd faceberg
pip install -e .
```

## License

Apache 2.0
