Metadata-Version: 2.4
Name: gulag-chief
Version: 0.1.1
Summary: YAML-driven script orchestrator and scheduler with monitor telemetry helper
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: PyYAML<7,>=6.0
Requires-Dist: croniter<3,>=2.0
Provides-Extra: dev
Requires-Dist: pytest<9,>=8.3; extra == "dev"

# Gulag Chief Guide

This guide is the complete, user-focused reference for running `gulag-chief` and configuring `chief.yaml`.

## 1. What Chief Does

`gulag-chief` is a YAML-driven orchestrator for Python worker scripts.

It provides:

- ordered script execution per job
- a human-readable scheduling DSL
- configuration validation with explicit errors
- schedule preview with next run times
- one-shot runs and daemon scheduling
- cron export for cron-compatible schedules
- optional monitor telemetry and worker instrumentation

## 2. Prerequisites

- Python 3.9+
- `pip`
- a local `chief.yaml`

Install package locally (from `/Users/joshbeaver/Documents/Summit/gulag/chief`):

```bash
python -m pip install .
```

For development and tests:

```bash
python -m pip install -e ".[dev]"
```

## 3. Quick Start (5 Minutes)

Run from your Chief project directory.

1. Validate config

```bash
gulag-chief validate --config chief.yaml
```

2. Preview schedules

```bash
gulag-chief preview --config chief.yaml
```

3. Run all enabled jobs once

```bash
gulag-chief run --config chief.yaml
```

4. Run one job

```bash
gulag-chief run --config chief.yaml --job sample-etl-pipeline
```

5. Run daemon scheduler

```bash
gulag-chief daemon --config chief.yaml --poll-seconds 10
```

6. Export cron-compatible schedules

```bash
gulag-chief export-cron --config chief.yaml
```

## 4. CLI Commands and Arguments

Chief supports a global config flag plus command-specific flags.

## Global Flag

- `--config PATH`: path to config file (default `chief.yaml`)

You can place it before or after the subcommand:

```bash
gulag-chief --config chief.yaml validate
gulag-chief validate --config chief.yaml
```

## `validate`

Purpose: validate YAML structure, script paths, schedule rules, and compilation mode.

```bash
gulag-chief validate [--config PATH]
```

## `preview`

Purpose: print schedule description, compilation mode, next runs, and cron equivalent when available.

```bash
gulag-chief preview [--config PATH] [--job NAME] [--count N]
```

Flags:

- `--job NAME`: preview one job
- `--count N`: number of future runs (default `5`, must be `>= 1`)

Examples:

```bash
gulag-chief preview
gulag-chief preview --job sample-etl-pipeline
gulag-chief preview --job sample-etl-pipeline --count 10
```

## `run`

Purpose: execute selected jobs immediately.

```bash
gulag-chief run [--config PATH] [--job NAME] [--respect-schedule]
```

Behavior:

- default: run all enabled jobs once in YAML order
- `--job NAME`: run one enabled job
- `--respect-schedule`: run only if the job is due now

Examples:

```bash
gulag-chief run
gulag-chief run --job sample-etl-pipeline
gulag-chief run --job sample-etl-pipeline --respect-schedule
```

## `daemon`

Purpose: continuous scheduler loop.

```bash
gulag-chief daemon [--config PATH] [--poll-seconds N]
```

Flags:

- `--poll-seconds N`: polling interval in seconds (default `10`, must be `>= 1`)

## `export-cron`

Purpose: print cron lines for cron-compatible schedules.

```bash
gulag-chief export-cron [--config PATH] [--job NAME]
```

Common CLI behavior:

- unknown `--job` name returns an explicit error
- `run` and `daemon` operate only on enabled jobs
- `preview` can still inspect disabled jobs
- `run --respect-schedule` is intended for cron-invoked runs with runtime guards

## 5. `chief.yaml` Configuration

Top-level keys:

- `version`
- `defaults`
- `monitor`
- `jobs`

Full shape example:

```yaml
version: 1

defaults:
  working_dir: .
  stop_on_failure: true
  overlap: skip
  timezone: UTC


jobs:
  - name: sample-etl-pipeline
    enabled: true
    working_dir: .
    stop_on_failure: true
    overlap: skip
    schedule:
      frequency: daily
      time: "06:00"
      timezone: America/New_York
    scripts:
      - path: workers/sample/extract_demo.py
        args: ["--source", "orders"]
        timeout: 120
      - path: workers/sample/transform_demo.py
      - path: workers/sample/load_demo.py
```

## `defaults` Keys

- `working_dir`: default cwd for scripts
- `stop_on_failure`: default per-job behavior
- `overlap`: default overlap policy (`skip | queue | parallel`)
- `timezone`: default schedule timezone

## Job Keys

- `name` (required, unique)
- `enabled` (default `true`)
- `working_dir` (inherits from defaults)
- `stop_on_failure` (inherits from defaults)
- `overlap` (inherits from defaults)
- `schedule` (required)
- `scripts` (required non-empty list)
- `monitor` (optional job-level override)

## Script Keys

- `path` (required)
- `args` (optional)
- `timeout` (optional, seconds)

`args` supports both list and shell-style string forms.

```yaml
scripts:
  - path: workers/sample/load_demo.py
    args:
      - --input
      - workers/sample/state/transformed_orders.json
      - --table
      - fact_orders
```

```yaml
scripts:
  - path: workers/sample/transform_demo.py
    args: --input in.json --output out.json --label "nightly run"
```

Path behavior:

- relative script paths resolve from job `working_dir`
- script files must exist at validation/load time

## 6. Scheduling DSL (Friendly but Strict)

Every job must define exactly one schedule mode:

- `daily`
- `weekly`
- `monthly`
- `yearly`
- `interval`
- `custom`

## Global Schedule Modifiers

Allowed on all frequencies:

- `timezone`
- `start` (ISO datetime)
- `end` (ISO datetime)
- `exclude` (list of `YYYY-MM-DD`)

Behavior:

- if `start/end` are naive datetimes, Chief interprets them in schedule timezone
- Chief will not run outside `[start, end]`
- `exclude` dates are enforced using local date in schedule timezone
- named holiday shortcuts are not supported in v1

## `daily`

Required:

- `time` (`HH:MM`, 24-hour)

Optional:

- `weekdays_only` (`true|false`)

Example:

```yaml
schedule:
  frequency: daily
  time: "14:30"
  weekdays_only: true
```

## `weekly`

Required:

- `day`
- `time`

`day` supports:

- single (`monday`)
- comma list (`monday,wednesday,friday`)
- named range (`monday-friday`)
- YAML list (`[monday, wednesday, friday]`)

Example:

```yaml
schedule:
  frequency: weekly
  day: monday-friday
  time: "09:00"
```

## `monthly`

Choose one style.

Style A: day of month.

```yaml
schedule:
  frequency: monthly
  day_of_month: 15
  time: "08:00"
```

Style B: ordinal weekday.

```yaml
schedule:
  frequency: monthly
  ordinal: last
  day: friday
  time: "18:00"
```

Valid ordinals:

- `first`
- `second`
- `third`
- `fourth`
- `last`

Rule:

- `day_of_month` and `ordinal/day` cannot be mixed

## `yearly`

Required:

- `month`
- `day_of_month`
- `time`

Example:

```yaml
schedule:
  frequency: yearly
  month: january
  day_of_month: 1
  time: "00:00"
```

## `interval`

Required:

- `every` in `<number><unit>` form

Supported units:

- `m` (minutes)
- `h` (hours)
- `d` (days)

Examples:

```yaml
schedule:
  frequency: interval
  every: 5m
```

```yaml
schedule:
  frequency: interval
  every: 2h
```

Rules:

- `time` is forbidden in interval mode
- seconds intervals (`s`) are intentionally unsupported in v1

## `custom`

Labeled cron-like fields:

- `minute`
- `hour`
- `day_of_month`
- `month`
- `day_of_week`

At least one field is required.

Example:

```yaml
schedule:
  frequency: custom
  minute: 0
  hour: 9
  day_of_week: monday-friday
```

## 7. Compilation Modes and Runtime Semantics

Chief compiles schedules into one of three kinds:

- `pure_cron`: fully representable by cron
- `hybrid`: cron trigger + runtime guard
- `runtime_only`: runtime scheduler logic required

Examples:

- `weekly friday 17:30` -> pure cron (`30 17 * * 5`)
- `monthly ordinal + day` -> hybrid
- `interval every: 90m` -> runtime only

## Job Execution Semantics

- scripts run sequentially in each job
- args are passed as parsed from YAML
- each script can have its own timeout
- if a script fails and `stop_on_failure: true`, remaining scripts are skipped
- if `stop_on_failure: false`, job continues to remaining scripts

## Daemon Semantics

- no startup catch-up (cron-like)
- deterministic ordering by YAML job order
- overlap behavior per job:
  - `skip`: drop trigger while running
  - `queue`: keep one pending trigger while running
  - `parallel`: allow same-job concurrent runs; global scheduling stays deterministic

## Timezone and DST Behavior

- schedule matching is timezone-aware
- wall-clock semantics are used
- spring-forward nonexistent times are skipped
- fall-back ambiguous times run once

## 8. Using Chief with Monitor

Chief Monitor is a companion service that receives telemetry events from Chief and workers, stores them, and exposes status/alerts in API and UI.

Chief emits lifecycle telemetry automatically when monitor is enabled. Worker scripts can also send custom messages via `gulag_chief.monitor_client`.

Telemetry delivery is best-effort and non-blocking, so job execution continues if monitor is down.

## Start Monitor Service (example)

```bash
cd monitor
npm install
npm run db:migrate
npm run dev
```

## Monitor Config in `chief.yaml`

```yaml
monitor:
  enabled: true
  endpoint: http://127.0.0.1:7410
  api_key: "" # optional; use env/.env in most setups
  timeout_ms: 400
  heartbeat_seconds: 15
  buffer:
    max_events: 5000
    flush_interval_ms: 1000
    spool_file: .chief/telemetry_spool.jsonl
```

Top-level monitor fields:

- `enabled`: global telemetry on/off
- `endpoint`: monitor base URL (`http://` or `https://`)
- `api_key`: optional auth key sent as `x-api-key`
- `timeout_ms`: HTTP send timeout
- `heartbeat_seconds`: Chief heartbeat interval (`chief.heartbeat`)
- `buffer.max_events`: in-memory queue cap
- `buffer.flush_interval_ms`: flush cadence
- `buffer.spool_file`: local JSONL fallback if endpoint is unavailable

## API Key Without YAML Secrets

Chief resolves API key in this order:

1. `monitor.api_key` in YAML if non-empty
2. `CHIEF_MONITOR_API_KEY` environment variable
3. `MONITOR_API_KEY` environment variable
4. `.env` in same directory as `chief.yaml`

Examples:

```bash
export CHIEF_MONITOR_API_KEY=your-secret
# or
export MONITOR_API_KEY=your-secret
```

`.env` example:

```dotenv
CHIEF_MONITOR_API_KEY=your-secret
```

## Per-Job Monitor Override

Job monitor settings inherit from top-level monitor, then can be overridden per job.

```yaml
monitor:
  enabled: true
  endpoint: http://127.0.0.1:7410

jobs:
  - name: critical-pipeline
    monitor:
      enabled: true
      check:
        enabled: true
        grace_seconds: 120
        alert_on_failure: true
        alert_on_miss: true
    schedule:
      frequency: interval
      every: 5m
    scripts:
      - path: workers/critical.py

  - name: noisy-ad-hoc-job
    monitor:
      enabled: false
    schedule:
      frequency: daily
      time: "06:00"
    scripts:
      - path: workers/noisy.py
```

Override behavior:

- `jobs[].monitor.enabled` defaults to top-level `monitor.enabled`
- `jobs[].monitor.check.*` customizes alert/check behavior per job
- you can disable noisy jobs while keeping telemetry enabled globally

Per-job key reference:

- `jobs[].monitor.enabled`
  Controls whether Chief emits telemetry for this job and injects monitor env vars into worker scripts for this job. Default: inherits top-level `monitor.enabled`.
- `jobs[].monitor.check.enabled`
  Controls whether the monitor check state for this job is evaluated (`UP` / `LATE` / `DOWN`) against expected next run timing. Default: inherits `jobs[].monitor.enabled`.
- `jobs[].monitor.check.grace_seconds`
  Additional allowed delay after `expected_next_at` before the job is marked `DOWN` and considered missed. Default: `120` (minimum `0`).
- `jobs[].monitor.check.alert_on_failure`
  If `true`, opens `FAILURE` alerts on failed job runs and closes them on recovery success (with `RECOVERY` alert). Default: `true`.
- `jobs[].monitor.check.alert_on_miss`
  If `true`, opens `MISSED` alerts when heartbeat is overdue beyond `grace_seconds`, and closes with `RECOVERY` when heartbeat resumes. Default: `true`.

## Worker Monitor Helper (`gulag_chief.monitor_client`)

Use inside workers:

```python
from gulag_chief.monitor_client import monitor

monitor.info("worker started", step="extract")
monitor.warn("slow upstream response", latency_ms=1450)
monitor.error("load failed", table="fact_orders")
```

Available methods:

- `monitor.debug(message, **meta)`
- `monitor.info(message, **meta)`
- `monitor.warn(message, **meta)`
- `monitor.error(message, **meta)`
- `monitor.critical(message, **meta)`

Chief injects context env vars into worker subprocesses:

- `CHIEF_MONITOR_ENDPOINT`
- `CHIEF_MONITOR_API_KEY`
- `CHIEF_RUN_ID`
- `CHIEF_JOB_NAME`
- `CHIEF_SCRIPT_PATH`
- `CHIEF_SCHEDULED_FOR`

## Telemetry Event Types

Chief lifecycle events:

- `job.started`
- `script.started`
- `script.completed`
- `job.completed`
- `job.failed`
- `job.next_scheduled`
- `chief.heartbeat`
- `daemon.dispatch`
- `daemon.overlap_skipped`
- `daemon.queued_pending`

Worker custom event type:

- `worker.message`

Levels:

- `DEBUG`
- `INFO`
- `WARN`
- `ERROR`
- `CRITICAL`

## 9. `export-cron` Workflow

Generate cron lines:

```bash
gulag-chief export-cron --config chief.yaml
```

Output includes:

- `CRON_TZ=<timezone>` lines
- cron entries for pure/hybrid schedules
- comments for runtime-only schedules

Important:

- hybrid schedules still require runtime guard enforcement
- use `gulag-chief run --respect-schedule` in exported cron commands

## 10. Tutorial: Add a New Job

1. Create your script under `workers/`.
2. Add a job block in `chief.yaml` with schedule and scripts.
3. If needed, add helper telemetry in worker code.
4. Validate and preview:

```bash
gulag-chief validate --config chief.yaml
gulag-chief preview --config chief.yaml --job your-job-name
```

5. Execute once:

```bash
gulag-chief run --config chief.yaml --job your-job-name
```

6. Move to daemon mode once stable:

```bash
gulag-chief daemon --config chief.yaml
```

## 11. Practical YAML Examples

## Example 1: Daily ETL with args

```yaml
jobs:
  - name: ga-daily
    enabled: true
    schedule:
      frequency: daily
      time: "06:00"
      timezone: America/New_York
    scripts:
      - path: scripts/google-analytics/google_analytics_to_supabase.py
        args:
          - --function
          - sessions_by_channel
          - --start-date
          - 2026-01-01
          - --end-date
          - 2026-01-31
        timeout: 1800
```

## Example 2: Monthly last Friday with exclusion

```yaml
jobs:
  - name: monthly-report
    enabled: true
    overlap: queue
    schedule:
      frequency: monthly
      ordinal: last
      day: friday
      time: "18:00"
      timezone: UTC
      exclude:
        - 2026-12-25
    scripts:
      - path: scripts/other/offer_report_to_supabase.py
        timeout: 1200
```

## Example 3: Runtime-only interval

```yaml
jobs:
  - name: rolling-check
    enabled: true
    schedule:
      frequency: interval
      every: 90m
    scripts:
      - path: scripts/weather/weather_to_supabase.py
        timeout: 600
```

## 12. Validation Rules and Common Errors

Chief enforces strict validation with explicit errors.

Key rules:

- schedule `frequency` is required and must be valid
- required fields must exist for selected frequency
- conflicting fields are rejected
- time must be valid `HH:MM`
- timezone must be valid IANA timezone
- interval mode cannot include `time`
- monthly must be either `day_of_month` or `ordinal + day`
- script files must exist

Representative error:

```text
Error: "monthly" requires either "day_of_month" or "ordinal + day".
```

## 13. Troubleshooting

## `ModuleNotFoundError: No module named 'pytest'`

Use the same interpreter for install and run:

```bash
python -m pip install pytest
python -m pytest -q tests/test_chief.py
```

Do not run test files directly with `python tests/test_chief.py`.

## Missing `PyYAML` or `croniter`

```bash
python -m pip install gulag-chief
# or from source
python -m pip install .
```

## Config validates but nothing runs with `run --respect-schedule`

This is expected if current time is not due. Inspect next runs:

```bash
gulag-chief preview --job <job-name>
```

## Monitor warnings in `chief.log`

If telemetry send warnings appear, jobs still execute. It indicates temporary monitor connectivity/auth issues.

Quick checks:

1. Ensure monitor is running.
2. Verify `monitor.endpoint` and API key source.
3. Confirm `/v1/health` is reachable.

