Metadata-Version: 2.4
Name: juniper-data
Version: 0.6.0
Summary: Dataset generation and management service for the Juniper ecosystem
Author: Paul Calnon
License: MIT
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: arc-agi
Requires-Dist: arc-agi>=0.9.0; extra == "arc-agi"
Provides-Extra: api
Requires-Dist: fastapi>=0.100.0; extra == "api"
Requires-Dist: uvicorn[standard]>=0.23.0; extra == "api"
Requires-Dist: pydantic-settings>=2.0.0; extra == "api"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-timeout>=2.2.0; extra == "test"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "test"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "test"
Requires-Dist: httpx>=0.24.0; extra == "test"
Requires-Dist: coverage[toml]>=7.0.0; extra == "test"
Requires-Dist: juniper-data-client>=0.3.0; extra == "test"
Provides-Extra: observability
Requires-Dist: prometheus-client>=0.20.0; extra == "observability"
Requires-Dist: sentry-sdk[fastapi]>=2.0.0; extra == "observability"
Provides-Extra: dev
Requires-Dist: ruff>=0.9.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: bandit[sarif]>=1.9.4; extra == "dev"
Requires-Dist: pip-audit>=2.7.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Provides-Extra: all
Requires-Dist: juniper-data[api,arc-agi,dev,observability,test]; extra == "all"
Dynamic: license-file

# Juniper Data

Dataset generation and management service for the Juniper ecosystem.

## Overview

Juniper Data provides a centralized service for generating, storing, and serving datasets used by the Juniper neural network projects. It supports various dataset types including the classic two-spiral classification problem.

## Ecosystem Compatibility

This service is part of the [Juniper](https://github.com/pcalnon/juniper-ml) ecosystem.
Verified compatible versions:

| juniper-data | juniper-cascor | juniper-canopy | data-client | cascor-client | cascor-worker |
|---|---|---|---|---|---|
| 0.4.x | 0.3.x | 0.2.x | >=0.3.1 | >=0.1.0 | >=0.1.0 |

For full-stack Docker deployment and integration tests, see `juniper-deploy`.

## Architecture

JuniperData is the **foundational data layer** of the Juniper ecosystem. JuniperCascor and juniper-canopy both call JuniperData to generate and retrieve datasets.

```
┌─────────────────────┐     REST+WS      ┌──────────────────────┐
│   juniper-canopy     │ ◄──────────────► │    JuniperCascor     │
│   Dashboard         │                  │    Training Svc      │
│   Port 8050         │                  │    Port 8200         │
└──────────┬──────────┘                  └──────────┬───────────┘
           │ REST                                    │ REST
           ▼                                         ▼
┌──────────────────────────────────────────────────────────────┐
│                      JuniperData  ◄── (this service)          │
│                   Dataset Service  ·  Port 8100               │
└──────────────────────────────────────────────────────────────┘
```

**Data contract**: datasets are served as NPZ archives with keys `X_train`, `y_train`, `X_test`, `y_test`, `X_full`, `y_full` (all `float32`).

## Related Services

| Service | Relationship | Environment Variable |
|---------|-------------|---------------------|
| [juniper-cascor](https://github.com/pcalnon/juniper-cascor) | Consumes JuniperData for training datasets | `JUNIPER_DATA_URL=http://localhost:8100` |
| [juniper-canopy](https://github.com/pcalnon/juniper-canopy) | Consumes JuniperData for visualization data | `JUNIPER_DATA_URL=http://localhost:8100` |
| [juniper-data-client](https://github.com/pcalnon/juniper-data-client) | PyPI client library for this service | `pip install juniper-data-client` |

### Service Configuration

| Variable | Default | Description |
|----------|---------|-------------|
| `JUNIPER_DATA_HOST` | `0.0.0.0` | Listen address |
| `JUNIPER_DATA_PORT` | `8100` | Service port |
| `JUNIPER_DATA_LOG_LEVEL` | `INFO` | Log verbosity |

### Docker Deployment

```bash
# Full stack with all three services:
git clone https://github.com/pcalnon/juniper-deploy.git  # (private repository)
cd juniper-deploy && docker compose up --build
```

## Dependency Lockfile

The `requirements.lock` file pins exact dependency versions for reproducible Docker builds. The `pyproject.toml` retains flexible `>=` ranges for local development.

**Regenerate after changing dependencies in `pyproject.toml`:**

```bash
uv pip compile pyproject.toml --extra api --extra observability -o requirements.lock
```

## Installation

### Basic Installation

```bash
pip install -e .
```

### With API Support

```bash
pip install -e ".[api]"
```

### Development Installation

```bash
pip install -e ".[dev]"
```

### Full Installation

```bash
pip install -e ".[all]"
```

## Quick Start

### Generate a Spiral Dataset

```python
from juniper_data.generators.spiral import SpiralGenerator

generator = SpiralGenerator()
dataset = generator.generate(n_points=100, n_spirals=2, noise=0.1)
```

### Start the API Server

```bash
uvicorn juniper_data.api.app:app --reload
```

## API Endpoints

| Endpoint                              | Method | Description                                          |
| ------------------------------------- | ------ | ---------------------------------------------------- |
| `/v1/health`                          | GET    | Health check                                         |
| `/v1/health/live`                     | GET    | Liveness probe                                       |
| `/v1/health/ready`                    | GET    | Readiness probe (checks storage)                     |
| `/v1/generators`                      | GET    | List all generators with schemas                     |
| `/v1/generators/{name}/schema`        | GET    | Get parameter schema for a generator                 |
| `/v1/datasets`                        | POST   | Create dataset (or return cached dataset)            |
| `/v1/datasets`                        | GET    | List dataset IDs                                     |
| `/v1/datasets/filter`                 | GET    | Filter metadata by generator/tags/date/name/version |
| `/v1/datasets/stats`                  | GET    | Aggregate dataset statistics                         |
| `/v1/datasets/versions`               | GET    | List all versions for a logical dataset name         |
| `/v1/datasets/latest`                 | GET    | Get latest version for a logical dataset name        |
| `/v1/datasets/batch-create`           | POST   | Create multiple datasets                             |
| `/v1/datasets/batch-delete`           | POST   | Delete multiple datasets                             |
| `/v1/datasets/batch-tags`             | PATCH  | Update tags on multiple datasets                    |
| `/v1/datasets/batch-export`           | POST   | Export multiple datasets as ZIP                     |
| `/v1/datasets/cleanup-expired`        | POST   | Delete expired datasets                             |
| `/v1/datasets/{id}`                   | GET    | Get dataset metadata                                 |
| `/v1/datasets/{id}`                   | DELETE | Delete a dataset                                     |
| `/v1/datasets/{id}/artifact`          | GET    | Download NPZ artifact                                |
| `/v1/datasets/{id}/preview`           | GET    | Preview first N samples as JSON                      |
| `/v1/datasets/{id}/tags`              | PATCH  | Add/remove tags on one dataset                       |

See [docs/api/JUNIPER_DATA_API.md](docs/api/JUNIPER_DATA_API.md) for full endpoint documentation including filtering, batch operations, and tagging.

### Named Dataset Versioning

`POST /v1/datasets` supports logical names for versioned datasets:

- Set `name` to group related datasets into a version series.
- Persisted creates with the same `name` auto-increment `meta.dataset_version` (`1`, `2`, `3`, ...).
- Repeating an identical request returns the cached dataset and keeps its existing version.
- Use `GET /v1/datasets/versions?name=<dataset_name>` to view history and `GET /v1/datasets/latest?name=<dataset_name>` to resolve the latest.

## Project Structure

```bash
juniper-data/
├── juniper_data/
│   ├── core/           # Core functionality and base classes
│   ├── generators/     # Dataset generators (8 types)
│   │   ├── spiral/     # Multi-spiral classification
│   │   ├── xor/        # XOR classification
│   │   ├── gaussian/   # Mixture of Gaussians
│   │   ├── circles/    # Concentric circles
│   │   ├── checkerboard/ # 2D checkerboard pattern
│   │   ├── csv_import/ # CSV/JSON file import
│   │   ├── mnist/      # MNIST / Fashion-MNIST
│   │   └── arc_agi/    # ARC-AGI visual reasoning
│   ├── storage/        # Dataset persistence layer
│   ├── api/            # FastAPI application
│   │   └── routes/     # API route handlers
│   └── tests/          # Test suite
│       ├── unit/       # Unit tests
│       └── integration/ # Integration tests
├── pyproject.toml      # Project configuration
└── README.md           # This file
```

## Development

### Running Tests

```bash
pytest
```

### Running Tests with Coverage

```bash
pytest --cov=juniper_data --cov-report=html
```

### Code Formatting

```bash
ruff format juniper_data tests
ruff check --fix juniper_data tests
```

### Type Checking

```bash
mypy juniper_data
```

## Juniper Ecosystem

| Repository | Description |
|-----------|-------------|
| [juniper-data](https://github.com/pcalnon/juniper-data) | Dataset generation service (this repo) |
| [juniper-cascor](https://github.com/pcalnon/juniper-cascor) | CasCor neural network training service |
| [juniper-canopy](https://github.com/pcalnon/juniper-canopy) | Real-time monitoring dashboard |
| [juniper-data-client](https://github.com/pcalnon/juniper-data-client) | PyPI: `juniper-data-client` |
| [juniper-cascor-client](https://github.com/pcalnon/juniper-cascor-client) | PyPI: `juniper-cascor-client` |
| [juniper-cascor-worker](https://github.com/pcalnon/juniper-cascor-worker) | PyPI: `juniper-cascor-worker` |

## License

MIT License - Copyright (c) 2024-2026 Paul Calnon

## Git Leaks

![gitleaks badge](https://img.shields.io/badge/protected%20by-gitleaks-blue)
