Metadata-Version: 2.4
Name: anndata-metadata
Version: 0.1.3
Summary: Add your description here
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: h5py>=3.13.0
Requires-Dist: pandas>=2.2.3
Requires-Dist: pyarrow>=20.0.0
Requires-Dist: s3fs>=2025.3.2

# anndata-metadata

**anndata-metadata** is a Python library and CLI tool for extracting metadata from [AnnData](https://anndata.readthedocs.io/) `.h5ad` files, both locally and on S3. When extracting metadata from S3, it uses partial downloads to dramatically speed up extraction.

It provides utilities to summarize cell, gene, and matrix information, and supports batch processing of directories.

It can create a `.parquet` index of the metadata for all of the files in a directory (S3 or local).

## Library Overview

The core library is in `src/anndata_metadata/` and provides:

- **Metadata extraction**: Functions to extract key metadata (cell count, gene count, matrix format, group contents, etc.) from AnnData `.h5ad` files.
- **S3 and local support**: Utilities to process files both on local disk and in S3 buckets.
- **JSON-serializable output**: All metadata is returned as Python dictionaries with native types.

## Installing

```
pip install anndata-metadata
```

## CLI Usage

**Usage:**
```sh
usage: anndata-metadata [-h] [-o OBS] [-c COUNT] [-f FILE_LIST] [-p S3_PREFIX]
                        [-m OBS_MAX_CARDINALITY] [-w WORKERS] [-b S3_BLOCK_SIZE]
                        [-e {thread,process}]
                        [input_paths ...] output

Extract AnnData metadata from file(s) or S3 object(s).

positional arguments:
  input_paths           Input file(s), directory, or S3 URI(s)/directory (may be
                        combined with --file-list)
  output                Output filename (JSON for a single file, Parquet for
                        multiple/--file-list, '-' for stdout)

options:
  -h, --help            show this help message and exit
  -o OBS, --obs OBS     Observation column to count (can be specified multiple times)
  -c COUNT, --count COUNT
                        Maximum number of files to process
  -f FILE_LIST, --file-list FILE_LIST
                        Path to a TSV (with a 'file' column) or newline-delimited list
                        of files to index; entries are prefixed with --s3-prefix
  -p S3_PREFIX, --s3-prefix S3_PREFIX
                        Prefix prepended to each --file-list entry (e.g. s3://bucket/prefix/)
  -m OBS_MAX_CARDINALITY, --obs-max-cardinality OBS_MAX_CARDINALITY
                        Auto-count value distributions for every obs column with at most
                        this many distinct values (the obs index column is always skipped)
  -w WORKERS, --workers WORKERS
                        Number of concurrent workers for multi-file/--file-list mode
                        (default 1)
  -b S3_BLOCK_SIZE, --s3-block-size S3_BLOCK_SIZE
                        s3fs read-ahead block size in bytes. Small (e.g. 262144) is best
                        over a high-latency link; larger (e.g. 1048576) is best in-region.
  -e {thread,process}, --executor {thread,process}
                        Concurrency model for multi-file mode. Reading H5AD metadata is
                        GIL-bound, so 'process' scales near-linearly with cores.
```

A single input file produces JSON; multiple inputs (several paths, a directory, or
`--file-list`) produce a resumable Parquet index. Extracted metadata includes the
detected `organism` and `ensembl_prefix` (inferred from Ensembl gene-ID prefixes).

**Examples:**
```sh
anndata-metadata data/myfile.h5ad metadata.json
anndata-metadata data/ metadata.parquet
anndata-metadata s3://my-bucket/ metadata.parquet

# Multiple mixed inputs in one run
anndata-metadata a.h5ad b.h5ad s3://bucket/dir/ metadata.parquet

# Index a curated list of files from S3, counting low-cardinality obs columns.
# Resumable: re-running skips files already present in the output Parquet.
# In-region (e.g. on EC2) use process workers + a larger block size:
anndata-metadata --file-list files.tsv \
  --s3-prefix s3://my-bucket/prefix/ \
  --obs-max-cardinality 1000 \
  --workers 12 --executor process --s3-block-size 1048576 \
  metadata.parquet
```


## Development

### Setup

This project uses [uv](https://github.com/astral-sh/uv) for fast Python environment management.

1. **Install dependencies:**
   ```sh
   uv sync # this gets the dependenceis you need to run the command
   uv sync --group dev # this gets the dev dependencies for testing and formatting
   ```

2. **Run tests:**
   ```sh
   uv run pytest
   ```

3. **Format code:**
   ```sh
   uv run yapf --recursive . --in-place
   ```

4. **Type check (mypy):**
   ```sh
   uv run mypy
   ```

5. **Run CLI**
   ```sh
   PYTHONPATH=src uv run python -m anndata_metadata
   ```

6. Build and test the wheel
   ```sh
   uv run python -m build
   ```
   and test it using
   ```sh
    python -m venv testenv
    source testenv/bin/activate
    pip install dist/anndata_metadata-*.whl --force-reinstall   
   ```
   you will now be able to run the cli command like this
   ```
    anndata-metadata
   ```


### Project Structure
```
.
├── src/
│ └── anndata_metadata/
│   ├── extract.py # Core metadata extraction logic
│   └── main.py # CLI entry point
├── test/ # Unit tests for extraction functions and CLI
├── README.md # Project documentation
└── pyproject.toml # Project metadata and dependencies
```

# TODO

- [x] add mypy support
- [x] add a wheel and submit to pypy
- [ ] CI/CD pipeline for updating pyp
- [ ] write partial results and skip previously written values
- [ ] Add module level documentation
