Metadata-Version: 2.4
Name: tab-realdata-hub
Version: 0.1.0
Summary: Manifest-backed real-data ingestion and OpenML materialization for tabular workflows
Project-URL: Repository, https://github.com/bensonlee5/tab-realdata-hub
Project-URL: Changelog, https://github.com/bensonlee5/tab-realdata-hub/blob/main/CHANGELOG.md
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: dataset-ingestion,manifest,openml,tabular
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <3.15,>=3.14
Requires-Dist: numpy>=2.1
Requires-Dist: openml>=0.15
Requires-Dist: pandas>=2.2
Requires-Dist: pyarrow>=23.0
Requires-Dist: scikit-learn>=1.6
Description-Content-Type: text/markdown

# tab-realdata-hub

`tab-realdata-hub` materializes external tabular data sources into the
manifest-backed packed-shard contract consumed by `tab-foundry`.

`tab-realdata-hub` is the sole owner of that manifest contract. The parquet
manifest is the stable index layer, and richer evolving dataset/provenance
fields live in `metadata.ndjson`. Downstream consumers are expected to read
through this package rather than reimplementing compatibility shims.

Install from the upstream git tag with:

```bash
python -m pip install "tab-realdata-hub @ git+https://github.com/bensonlee5/tab-realdata-hub.git@v0.1.0"
```

For repo-local development:

```bash
uv sync
```

The v1 surface is OpenML-first:

- build pinned OpenML bundle JSON from known task pools or live discovery
- materialize bundle tasks into packed shards plus manifest parquet
- inspect manifest-backed datasets through a stable library and CLI surface

Example:

```bash
uv sync

tab-realdata-hub bundle build-openml \
  --out-path bundles/many_class_v1.json \
  --bundle-name many_class_v1 \
  --version 1 \
  --task-source tabarena_v0_1 \
  --max-features 10 \
  --max-classes 10 \
  --max-missing-pct 10.0

tab-realdata-hub materialize openml-bundle \
  --bundle-path bundles/many_class_v1.json \
  --out-root outputs/openml/many_class_v1

tab-realdata-hub manifest inspect \
  --manifest outputs/openml/many_class_v1/manifest.parquet
```
