Metadata-Version: 2.4
Name: zagg
Version: 0.1.0
Summary: Multi-resolution aggregation for ICESat-2 ATL06 data using morton/healpix indexing
Project-URL: Homepage, https://github.com/englacial/zagg
Project-URL: Repository, https://github.com/englacial/zagg
Project-URL: Issues, https://github.com/englacial/zagg/issues
Author-email: Shane Grigsby <refuge@rocktalus.com>
License: MIT License
        
        Copyright (c) 2025 englacial
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: GIS
Requires-Python: >=3.12
Requires-Dist: boto3
Requires-Dist: earthaccess
Requires-Dist: fastparquet
Requires-Dist: h5coro>=0.0.8
Requires-Dist: healpy
Requires-Dist: mortie>=0.6.3
Requires-Dist: numpy>=2.0
Requires-Dist: obstore>=0.8.2
Requires-Dist: pandas>=2.2
Requires-Dist: pyarrow
Requires-Dist: pydantic-zarr>=0.9.1
Requires-Dist: pyproj
Requires-Dist: pyyaml
Requires-Dist: shapely
Requires-Dist: zarr>=3.1.5
Provides-Extra: analysis
Requires-Dist: cartopy>=0.25.0; extra == 'analysis'
Requires-Dist: cubed-xarray>=0.0.9; extra == 'analysis'
Requires-Dist: cubed>=0.24.0; extra == 'analysis'
Requires-Dist: geopandas; extra == 'analysis'
Requires-Dist: matplotlib>=3.10.8; extra == 'analysis'
Requires-Dist: notebook; extra == 'analysis'
Requires-Dist: xarray[io]; extra == 'analysis'
Requires-Dist: xdggs; extra == 'analysis'
Provides-Extra: lambda
Requires-Dist: astropy; extra == 'lambda'
Requires-Dist: cramjam; extra == 'lambda'
Requires-Dist: h5coro==0.0.8; extra == 'lambda'
Requires-Dist: numpy==2.2.6; extra == 'lambda'
Requires-Dist: pandas==2.2.3; extra == 'lambda'
Provides-Extra: test
Requires-Dist: pytest>=8.0; extra == 'test'
Description-Content-Type: text/markdown

# zagg - Multi-resolution Aggregation

Aggregate point observations to multi-resolution grids using HEALPix spatial indexing and serverless compute.

## Overview

zagg aggregates sparse point data (e.g., ICESat-2 ATL06 elevation measurements) to gridded products using HEALPix/morton spatial indexing. Processing runs in parallel on AWS Lambda — each worker handles one spatial cell independently, writing to a shared [Zarr v3](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html) store following the [DGGS convention](https://github.com/zarr-conventions/dggs).

## Features

- **Pre-computed granule catalogs** — query CMR once, process many times
- **Morton-based spatial indexing** — HEALPix nested scheme for hierarchical grids
- **Massive parallelism** — tested with up to 1,700 concurrent Lambda workers
- **Direct S3 access** — h5coro reads HDF5 via byte-range requests, no downloads
- **Cost-effective** — ~$0.006/cell (~$2 per full Antarctica run on ARM64)

## End-to-End Workflow

### Step 1: Build a Granule Catalog

Query NASA's CMR to build a mapping of spatial cells to granule S3 URLs.

```bash
# ICESat-2 convenience — cycle number computes dates automatically:
uv run python -m zagg.catalog --cycle 22 --parent-order 6

# General — explicit date range and spatial polygon:
uv run python -m zagg.catalog \
    --start-date 2024-01-06 --end-date 2024-04-07 \
    --short-name ATL06 \
    --polygon my_region.geojson \
    --parent-order 6
```

When `--polygon` is provided, the bounding box for the CMR query is computed automatically from the polygon's extent, and `morton_coverage` uses the polygon for cell discovery. When no polygon is given, Antarctic drainage basins are used as the default.

Output: `catalog_ATL06_2024-01-06_2024-04-07_order6.json`

See [Catalog API](docs/api/catalog.md) for full options.

### Step 2: Deploy the Lambda Function

Build and deploy the Lambda function and its dependency layer.

```bash
# Build the function package
bash deployment/aws/build_function.sh

# Build the dependency layer (ARM64)
bash deployment/aws/build_arm64_layer.sh

# Deploy
bash deployment/aws/deploy.sh
```

See [Lambda Deployment](docs/deployment/lambda.md) and [ARM64 Build Guide](docs/deployment/arm64.md).

### Step 3: Run Processing

Processing reads a pipeline config YAML (data source, aggregation, output store) and a granule catalog. Run locally or dispatch to Lambda.

```bash
# Local processing (write to local Zarr):
uv run python -m zagg --config atl06.yaml --catalog catalog.json --store ./output.zarr

# Local processing (write to S3):
uv run python -m zagg --config atl06.yaml --catalog catalog.json --store s3://bucket/output.zarr

# Lambda dispatch (requires deployed Lambda function):
uv run python deployment/aws/invoke_lambda.py \
    --config atl06.yaml --catalog catalog.json

# Test with a few cells:
uv run python -m zagg --config atl06.yaml --catalog catalog.json --max-cells 5

# Dry run:
uv run python -m zagg --config atl06.yaml --catalog catalog.json --dry-run
```

The store path and output grid parameters are defined in the YAML config (`output.store`, `output.grid.child_order`) and can be overridden via `--store` on the command line.

### Step 4: Visualize Results

The output Zarr is a public DGGS dataset. The included notebook rasterizes HEALPix cells to a polar stereographic grid for fast rendering with `imshow`.

```bash
uv run jupyter notebook notebooks/rasterized_zarr.ipynb
```

Adjust `GRID_SPACING` in the notebook to control output resolution (default 2 km).

## Project Structure

```
zagg/
├── src/zagg/              # Main package (cloud-agnostic)
│   ├── __main__.py        # Local processing runner (python -m zagg)
│   ├── config.py          # YAML pipeline configuration
│   ├── processing.py      # Core aggregation pipeline
│   ├── catalog.py         # CMR query + catalog building
│   ├── schema.py          # Output schema + Zarr template
│   ├── store.py           # Store factory (local or S3)
│   ├── auth.py            # NASA Earthdata authentication
│   └── configs/           # Built-in pipeline configs (atl06.yaml)
├── deployment/            # Cloud-specific deployment
│   └── aws/               # Lambda handler, orchestrator, build scripts
├── notebooks/             # Visualization
├── docs/                  # Documentation
└── tests/                 # Test suite
```

## Documentation

- **[Architecture](docs/design/architecture.md)** — design philosophy, end-to-end flow diagram, key decisions
- **[Schema](docs/design/schema.md)** — aggregation dispatch, extending with new statistics
- **[API Reference](docs/api/catalog.md)** — catalog, processing, schema, auth modules
- **[Lambda Deployment](docs/deployment/lambda.md)** — AWS setup and production use
- **[ARM64 Build Guide](docs/deployment/arm64.md)** — building Lambda layers for ARM64

## Development

```bash
# Install
uv sync --all-groups

# Run tests
uv run pytest

# Lint
uv run ruff check src/
```

Requires Python >= 3.12, [uv](https://docs.astral.sh/uv/), AWS credentials (for Lambda), and a [NASA Earthdata](https://urs.earthdata.nasa.gov/) account (for data access).

## Performance

| Metric | Value |
|--------|-------|
| Execution time | 2–3 min average per cell |
| Memory | 2 GB configured, 1–1.5 GB typical |
| Throughput | Tested with up to 1,700 concurrent workers |
| Cost | ~$0.006/cell (~$2 per full Antarctica run on ARM64) |

## License

MIT — see LICENSE file.
