Metadata-Version: 2.4
Name: pyacid
Version: 0.1.0a2
Summary: Astronomical Catalog Inference Driver: XMATCH SQL over HATS-partitioned Parquet via DuckDB
Author-email: Mario Juric <mjuric@uw.edu>
License-Expression: BSD-3-Clause
Project-URL: Homepage, https://github.com/mjuric/acid
Project-URL: Repository, https://github.com/mjuric/acid
Project-URL: Issues, https://github.com/mjuric/acid/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Astronomy
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: duckdb<2,>=1.0
Requires-Dist: sqlglot<31,>=27
Requires-Dist: pyarrow>=14
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Requires-Dist: pyyaml>=6
Requires-Dist: tqdm>=4
Requires-Dist: polars>=1.40.1
Requires-Dist: cdshealpix>=0.7
Provides-Extra: hats
Requires-Dist: hats>=0.4; extra == "hats"
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: astropy>=5; extra == "dev"
Requires-Dist: mocpy>=0.15; extra == "dev"
Dynamic: license-file

# acid — Astronomical Catalog Inference Driver

SQL-driven crossmatching and analysis of HEALPix-partitioned
astronomical catalogs, executed by DuckDB. `acid` extends SQL with an
`XMATCH(...)` predicate for spherical-distance joins, runs each anchor
partition independently against a boundary-safe margin cache, and
aggregates the results.

Reads and writes the [HATS](https://hats.readthedocs.io) format used by
LINCC Frameworks (LSDB, hats-import) and by published catalogs such as
Gaia DR3 and Rubin DP1.

---

## Quick start

### One-shot query

```python
import acid

r = acid.sql(
    """
    SELECT a.id, b.id AS b_id, XMATCH_DISTANCE(b) AS d_arcsec
    FROM dia_source AS a
    JOIN object AS b ON XMATCH(radius_arcsec => 1.0)
    JOIN lightcurve AS lc ON a.id = lc.object_id
    """,
    catalogs="catalogs.yaml",
    workers=8,
)

r.df()              # -> pandas.DataFrame
r.arrow()           # -> pyarrow.Table
r.to_polars()       # -> polars.DataFrame
for batch in r.batches(): ...      # -> RecordBatch iterator
r.write_parquet("results.parquet", layout="single")
```

### Exploratory session (recommended for repeated queries)

```python
with acid.connect("catalogs.yaml", workers=8) as s:
    r1 = s.sql("SELECT a.id FROM a JOIN b ON XMATCH(r => 1.0)")
    r2 = s.sql("SELECT a.id FROM a JOIN b ON XMATCH(r => 0.5)")

    # Promote an intermediate to a real catalog usable in later queries.
    s.materialize(
        "nearby",
        "SELECT a.id, a.ra, a.dec FROM a JOIN b ON XMATCH(r => 1.0)",
    )
    r3 = s.sql("SELECT * FROM nearby JOIN c ON XMATCH(r => 5.0, mode => 'all')")
```

A Session keeps the worker pool, the DuckDB connection in each worker,
and the parsed registry alive across queries. On a 64-core box it's
typically 5–10× faster than calling `acid.sql()` in a loop.

### CLI

```bash
# Query execution (--db accepts a directory of HATS catalogs or a YAML file)
acid query "SELECT COUNT(*) FROM object" --db datasets/ --out /tmp/result
acid query -f query.sql --db catalogs.yaml --out results/ --workers 32
echo "SELECT ..." | acid query --db datasets/ --out results/
acid validate "SELECT ..." --db datasets/

# Download catalogs (HTTP, SSH, or local; full or spatial subset)
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass
acid download https://data.lsdb.io/hats/two_mass/two_mass /data/two_mass --cone 50,-50,10
acid download user@server:/hats/gaia /data/gaia --columns ra,dec,mag --cone 180,0,5

# Inspect catalogs (local or remote)
acid inspect /data/two_mass                          # summary
acid inspect schema /data/two_mass                   # column schema
acid inspect https://data.lsdb.io/hats/two_mass/two_mass  # remote

# Build margin caches locally
acid hats build-margin /data/two_mass --margin-arcsec 5.0 --workers 16
```

`results/` is itself a valid HATS catalog (`lsdb.open_catalog(...)` and
`hats.read_hats(...)` will read it). Downloaded subsets are also valid
HATS catalogs with rebuilt `_metadata`.

### Catalog registry

The simplest way: point `--db` at a directory of HATS catalogs. Each
subdirectory with a `properties` file becomes a table named after the
directory. Margin caches (`dataproduct_type=margin`) are auto-skipped.

For more control, use a YAML file:

```yaml
catalogs:
  dia_source:
    path: /data/dia_source      # HATS root, or CatalogCollection root
    # Auto-detected from <path>/properties when present:
    #   ra_col            (from hats_col_ra)
    #   dec_col           (from hats_col_dec)
    #   hpix_order        (from <path>/partition_info.csv)
    #   neighbor_path     (from collection.properties or sibling '_margin' dir)
    #   neighbor_margin_arcsec  (from hats_margin_threshold)
    #   npix_suffix       (from hats_npix_suffix; default '.parquet')
    # Any auto-detected value can be overridden here.

  object:
    path: /data/object_collection    # a CatalogCollection root works too

  lightcurve:
    path: /data/lightcurve
    hpix_order: 5                    # explicit when partition_info.csv is absent

# Named MOC footprints for IN_MOC() filtering.
# Each entry is a path to a FITS file (HEALPix image or MOC FITS).
mocs:
  des_dr2: /data/mocs/des_dr2.fits
  known_artifacts: /data/mocs/artifacts.fits
  # If a catalog has a point_map.fits at its root, IN_MOC(<alias>, '<catalog_name>')
  # auto-loads it — no explicit entry needed.
```

---

## What `XMATCH` does

```sql
JOIN  b ON XMATCH(radius_arcsec => 1.0)                   -- nearest, inner
JOIN  b ON XMATCH(r => 1.0)                               -- 'r' is an alias
JOIN  b ON XMATCH(r => 1.0, mode => 'all')                -- every match within r
LEFT JOIN b ON XMATCH(r => 1.0)                           -- keep unmatched anchors

-- Distance is exposed as a SELECT-level function over the right alias.
SELECT a.id, XMATCH_DISTANCE(b) AS d FROM a JOIN b ON XMATCH(r => 1.0)
WHERE  XMATCH_DISTANCE(b) < 0.5

-- Ordinary joins, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT/OFFSET,
-- DISTINCT all work; cross-partition reduction is handled internally.
SELECT a.id, COUNT(*) AS n, AVG(XMATCH_DISTANCE(b)) AS avg_d
FROM a
JOIN  b ON XMATCH(r => 1.0)
JOIN  lightcurve AS lc ON a.id = lc.object_id
GROUP BY a.id
ORDER BY n DESC LIMIT 100

-- Footprint filtering via MOC (Multi-Order Coverage maps):
-- Restrict rows to a survey footprint or sky region.
SELECT a.id, a.ra, a.dec
FROM a JOIN b ON XMATCH(r => 1.0)
WHERE IN_MOC(a, 'des_dr2')              -- anchor inside DES footprint
  AND NOT IN_MOC(b, 'known_artifacts')  -- exclude artifact regions

-- IN_MOC also works in SELECT projections (per-row boolean):
SELECT a.id, IN_MOC(a, 'des_dr2') AS in_des FROM a
```

Semantics, in short:

- All XMATCHes in a query use the **anchor** (first FROM) table's
  coordinates, even after a `mode => 'all'` expansion.
- A right-table radius **must be ≤** that catalog's
  `neighbor_margin_arcsec`. Otherwise we'd silently miss boundary
  pairs; the analyzer rejects the query.
- `ORDER BY ... LIMIT K` pushes the top-K to each partition first;
  the reducer re-sorts the union and applies the global LIMIT/OFFSET.
- Aggregates / GROUP BY / DISTINCT / HAVING run in a phase-2 reducer
  over the per-partition Parquet output.

---

## Python API surface

```python
# Sessions (preferred for EDA)
s = acid.connect(catalogs, *, workers=1, duckdb_threads=None, cache_dir=None)
s.sql(query, *, output=None, inmem_row_limit=50_000_000) -> acid.Result
s.run(query, *, output=None)                              -> ExecutionResult
s.validate(query)                                          -> QueryPlan
s.explain(query)                                           -> str (rewritten SQL)
s.list_catalogs()                                          -> list[str]
s.add_catalog(name, **kwargs)                              -> TableSpec
s.materialize(name, query, *, ra_col=None, dec_col=None, overwrite=False) -> TableSpec
s.register_moc(name, source)                           # FITS path, mocpy.MOC, or (N,2) ranges
s.close()    # or use as a context manager

# One-shot convenience (transient Session under the hood)
acid.sql(query, *, catalogs, output=None, workers=1, ...)  -> acid.Result
acid.run(query, *, catalogs, output, workers=1)            -> ExecutionResult

# Result
r.num_rows, r.column_names, r.schema
r.column(name)         -> pa.ChunkedArray
r.arrow()              -> pa.Table
r.df() / r.to_pandas() -> pandas.DataFrame
r.to_polars()          -> polars.DataFrame
r.to_pylist()          -> list[dict]
r.batches(batch_size=None) -> Iterator[pa.RecordBatch]
r.head(n=10)           -> Result
r.write_parquet(path, layout="hats"|"single") -> Path
len(r), for batch in r: ...

# Errors (all inherit from acid.AcidError)
acid.RegistryError       # catalog config (missing path, mixed Norder, ...)
acid.ParseError          # SQL parse failures
acid.ValidationError     # unsupported XMATCH constructs
acid.ExecutionError      # per-partition execution failures
acid.SessionClosedError  # method called on a closed Session
```

---

## Layout assumptions

- Catalogs follow the **HATS** layout:
  `<root>/dataset/Norder=N/Dir=D/Npix=P.parquet` (or
  `Npix=P/*.parquet` when `hats_npix_suffix='/'`).
- Margin caches live as **sibling catalogs** (HATS canonical), at
  `<root>/margin_cache/...`, or any sibling dir matching
  `<name>_margin*`. `collection.properties` is consumed if present.
- Adaptive (per-pixel) Norder is supported: a catalog's
  `partition_info.csv` may list pixels at any orders, and XMATCH/ordinary
  joins across mixed-Norder catalogs are run via a refinement-tree
  enumeration that emits one work unit per coarsest cursor pixel where
  every joined catalog has ≤ 1 partition. Output is itself a valid
  HATS catalog whose `partition_info.csv` reflects the refinement.

---

## What's the speed story?

- DuckDB does the heavy SQL work and Parquet I/O. acid is roughly a
  Python orchestrator + a vectorized SciPy `cKDTree` crossmatcher.
- Each partition is independent → embarrassingly parallel across
  HEALPix pixels.
- The per-worker DuckDB connection is opened once and reused for every
  partition the worker processes; temp namespace is dropped between
  partitions. With a `Session`, the connection survives across queries.
- Top-K queries push the LIMIT to each partition. Aggregates write
  partial data to disk and reduce centrally.
- Column pruning: anchor and right tables are TEMP **views** over
  `read_parquet()`, so the final SELECT only pulls referenced columns
  from disk. Wide catalogs (150+ columns) don't slow down narrow SELECTs.
- Auto-spill: when `output_dir` is unset and the running result exceeds
  `inmem_row_limit` (default 50M rows), `acid` spills to a tempdir
  rather than OOM-ing the parent.

See `bench/match_all.py` and `bench/session_vs_oneshot.py` for
microbenchmarks.

---

## Install

### With uv (recommended for development)

```bash
uv sync --dev          # creates .venv, installs all deps + test + hats
uv run pytest          # run tests
```

### With pip

```bash
pip install -e .
# extras:  pip install -e .[hats,dev]
```

Requires Python 3.10+, DuckDB 1.x, SQLGlot ≥ 27, PyArrow ≥ 14,
NumPy ≥ 1.24, SciPy ≥ 1.10, PyYAML ≥ 6.

---

## Status

- **v0 (correctness):** XMATCH inner/left, mode 'nearest'/'all', chains,
  ordinary joins, distance via `XMATCH_DISTANCE(alias)`.
- **v1 (scale):** views + narrow side-tables, vectorized matcher,
  worker initializer, auto-spill, top-K pushdown, manifest.
- **v1.1 (HATS spec):** writes valid HATS catalogs, reads canonical
  property keys, supports `hats_npix_suffix='/'`, auto-discovers
  margin siblings via `collection.properties`.
- **v2 (EDA):** persistent `Session`, per-worker DuckDB connection,
  `Result` wrapper, `materialize()`.
- **v3 (adaptive Norder):** per-catalog `PartitionIndex`, refinement-tree
  tuple enumeration, vectorized pure-numpy `ang2pix_nested` exposed as a
  DuckDB UDF for per-pixel row filtering, LEFT-XMATCH/JOIN over partitions
  without coverage.
- **v4 (MOC footprint filtering):** `IN_MOC(<alias>, '<name>')` in WHERE
  restricts rows to a named sky region (Multi-Order Coverage map). Supports
  `NOT IN_MOC`, multiple predicates (AND-combined via mocpy set ops),
  catalog auto-resolution from `point_map.fits`, and use in SELECT/ORDER BY/
  CASE expressions. Three-level optimization: catalog-footprint scoping,
  cursor-pixel intersection, and partition-level pruning — all via the
  existing `_healpix_29` row-group pushdown fast path.

- **v5 (catalog ops):** `acid hats build-margin` builds HATS margin caches
  locally (validated against hats-import). `acid download` generates
  `point_map.fits`, auto-includes HATS RA/Dec/healpix columns. `acid query`
  accepts `--db <directory>` for zero-config usage, fails fast on errors,
  shows tqdm progress, shuffles work for load balancing. Bare column
  resolution via schema introspection. `LocalFetcher` for local I/O.

Tests: 176 passing (~30s parallel via pytest-xdist). Fixtures cached
across runs.

### Known limitations

- **XMATCH must be the entire `ON` predicate.** Compound predicates
  like `XMATCH(...) AND b.mag < 20` are rejected.
- **No CTEs / subqueries in the anchor position.**
- **RIGHT / FULL / CROSS JOIN XMATCH** not supported.
- **`IN_MOC` must be in conjunctive WHERE position** (top-level AND-chain,
  optionally negated). Disjunctive use (`IN_MOC(...) OR ...`) and
  `IN_MOC` in `JOIN ON` are rejected.
