Metadata-Version: 2.4
Name: arraylake-datahub
Version: 0.1.1
Summary: DataHub ingestion source for Arraylake (Earthmover) Icechunk datasets
Project-URL: Homepage, https://earthmover.io
Project-URL: Documentation, https://docs.earthmover.io/integrations/catalogs/datahub
Project-URL: Changelog, https://docs.earthmover.io/changelog
Author-email: Earthmover <support@earthmover.io>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: arraylake,catalog,datahub,earthmover,icechunk,metadata,zarr
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Atmospheric Science
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: acryl-datahub<2,>=0.15.0
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2.0
Provides-Extra: dev
Requires-Dist: build; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Requires-Dist: twine; extra == 'dev'
Provides-Extra: test
Requires-Dist: pytest-cov; extra == 'test'
Requires-Dist: pytest>=8; extra == 'test'
Requires-Dist: respx>=0.21; extra == 'test'
Description-Content-Type: text/markdown

![Arraylake](https://earthmover-web-assets.s3.amazonaws.com/04-Arraylake-Lockup-Midnight-RGB-LARGE.png)

# arraylake-datahub

DataHub ingestion source for [Arraylake](https://earthmover.io) /
[Icechunk](https://icechunk.io) datasets.

The plugin crawls the Arraylake catalog over HTTPS and emits one DataHub
`Dataset` **per xarray-compatible Zarr group** in your repos. Access
control, credential vending, and querying stay in Arraylake — DataHub
becomes the discovery and metadata-search surface, with `externalUrl`
links back into the Arraylake web app for the actual data.

📖 **Full documentation:**
[Arraylake docs › Integrations › Catalogs › DataHub](https://docs.earthmover.io/integrations/catalogs/datahub)

## What you get in DataHub

For every xarray-compatible group, a `Dataset` named `<org>/<repo>/<group_path>` with:

- **Schema** — one field per Zarr array. Coordinates and data variables
  are distinguished via a `classification` flag in each field's `jsonProps`,
  alongside `shape`, `chunk_shape`, `dimension_names`, codecs, and the full
  CF attribute bag (`GRIB_*` keys are filtered as noise).
- **Description** — the group's CF `title` + `summary` when present,
  otherwise the repo description.
- **`externalUrl`** — direct link into the Arraylake page for that group.
- **`customProperties`** spread:
  - Arraylake metadata: provider, product_type, spatial/temporal coverage,
    spatial_resolution, update_freq, etc.
  - Storage: bucket platform/name/region, computed `storage_uri`.
  - CF group attributes: license, institution, creator/publisher, time
    and geospatial coverage, references, history.
  - Marketplace subscription details if the repo is from a listing.

Repos with no xarray-compatible groups still emit one `Dataset` (repo
landing only) so every catalog entry is discoverable.

Orphan repos — catalog entries whose underlying Icechunk storage no
longer exists — are tagged with `arraylake_storage_status=orphan`.

## Install

```bash
pip install arraylake-datahub 'acryl-datahub[datahub-rest]'
```

## One-time platform registration

Register the `earthmover` custom data platform in your DataHub instance
once:

```bash
datahub put platform \
  --name earthmover \
  --display_name "Earthmover" \
  --logo https://app.earthmover.io/icon.svg
```

## Run

Save the following as `recipe.yml`:

```yaml
source:
  type: earthmover
  config:
    # token: ${ARRAYLAKE_TOKEN}      # default: read from env
    # api_url: https://api.earthmover.io
    orgs:                            # omit to crawl every org the token sees
      - earthmover-public
    repo_pattern:
      allow: [".*"]
      # deny: [".*-archive$"]
    env: PROD                        # DataHub fabric

sink:
  type: datahub-rest
  config:
    server: http://localhost:8080
    token: ${DATAHUB_GMS_TOKEN}
```

Then run:

```bash
export ARRAYLAKE_TOKEN=ema_xxxxxxxxxxxx
export DATAHUB_GMS_TOKEN=...

datahub ingest -c recipe.yml --preview   # dry run
datahub ingest -c recipe.yml             # for real
```

The most useful knobs are `orgs` (allowlist) and `repo_pattern` (regex
allow/deny). Full config below.

## Config

| Field               | Default                       | Notes                                                                |
| ------------------- | ----------------------------- | -------------------------------------------------------------------- |
| `token`             | `$ARRAYLAKE_TOKEN`            | Arraylake API token (`ema_*`). Read-only is sufficient.              |
| `api_url`           | `https://api.earthmover.io`   | Arraylake catalog API base URL.                                      |
| `web_url`           | `https://app.earthmover.io`   | Used for `externalUrl` when a repo's `web_url` is missing.           |
| `orgs`              | _all visible_                 | Allowlist of org slugs. Omit to crawl every org the token sees.      |
| `repo_pattern`      | allow `.*`                    | `AllowDenyPattern` matched against `<org>/<repo>`.                   |
| `env`               | `PROD`                        | DataHub fabric segment of the Dataset URN.                           |
| `platform`          | `earthmover`                  | Must match the platform registered above.                            |
| `walk_max_workers`  | `8`                           | Parallel HTTP fetches per repo when walking groups.                  |
| `request_timeout_s` | `30`                          |                                                                      |
| `max_retries`       | `3`                           |                                                                      |

## Required Arraylake API access

The token needs read access to:

- `GET /user/orgs`
- `GET /orgs/{org}/repos/paginated`
- `GET /repos/{org}/{repo}`
- `GET /repos/icechunk/{org}/{repo}/dataset-node`

## Verification

After a successful ingest, in DataHub's UI you should see:

- The Earthmover platform with logo.
- One `Dataset` per xarray-compatible Zarr group, named `<org>/<repo>/<group>`.
- A Schema panel listing every coordinate and data variable with units
  and CF descriptions where available.
- A clickable "View in Source" link that lands on the Arraylake page for
  that group, where authentication and querying happen.

Tested against `acryl-datahub` 0.15.x on Python 3.10–3.13.

## Arraylake Support

**Email** — [support@earthmover.io](mailto:support@earthmover.io)

Email us with any questions, bug reports, or feature requests.

## License

Apache-2.0
