Metadata-Version: 2.4
Name: forest-cli
Version: 0.1.0
Summary: git-like data management for arbitrary data trees: workspaces, checkouts, remotes, and sync
Author-email: Troy Sincomb <troysincomb@gmail.com>
License: MIT
Keywords: data-management,sync,cli
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0
Requires-Dist: polars>=1.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: observability
Requires-Dist: sentry-sdk>=2.0; extra == "observability"
Provides-Extra: dev
Requires-Dist: deptry>=0.23; extra == "dev"
Requires-Dist: mypy>=1.14; extra == "dev"
Requires-Dist: pre-commit>=4.0; extra == "dev"
Requires-Dist: pylint>=3.3; extra == "dev"
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=6.0; extra == "dev"
Requires-Dist: pytest-randomly>=3.15; extra == "dev"
Requires-Dist: pytest-rerunfailures>=15.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.6; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: types-PyYAML; extra == "dev"
Requires-Dist: vulture>=2.14; extra == "dev"
Dynamic: license-file

# 🌲 forest <a href="#-dogfood-this-repo-runs-forest"><img align="right" width="42%" src="docs/assets/hero.svg" alt="Animated isometric forest: data trees on a workspace platform, packets syncing up a git branch to a remote cloud"></a>

**git-like data management for arbitrary data trees.**

[![CI](https://github.com/tmsincomb/forest/actions/workflows/ci.yml/badge.svg)](https://github.com/tmsincomb/forest/actions/workflows/ci.yml)
[![Docs](https://github.com/tmsincomb/forest/actions/workflows/docs.yml/badge.svg)](https://github.com/tmsincomb/forest/actions/workflows/docs.yml)
[![Release](https://github.com/tmsincomb/forest/actions/workflows/release.yml/badge.svg)](https://github.com/tmsincomb/forest/actions/workflows/release.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-3776AB?logo=python&logoColor=white)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-2ea44f)](LICENSE)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![mypy: strict](https://img.shields.io/badge/mypy-strict-blue)](https://mypy-lang.org/)

Forest is the data-side parallel to git's version control. Git tracks code in
`.git/`; forest tracks large data — local layout plus remote sync — in
`.forest/`. It borrows git's mental model and verbs (`checkout`, `status`,
`push`, `pull`, `remote`, a HEAD-style pointer) so your git intuition carries
over, but the two domains never overlap and neither requires the other.

Forest is **domain-agnostic and self-contained**: it manages *any* data trees,
knows nothing about what the data means, and depends on no other project. It
moves files and tracks their sync state; it does not validate or interpret their
contents.

## Model

Fixed-depth, no arbitrary nesting:

```
workspace → checkout → stage → unit → files
```

- **Workspace** — a per-repo `.forest/` control area (registry + active pointer).
- **Checkout** — a named data view/focus registered in the workspace (like a git
  branch you stay rooted in). Switching is an O(1) pointer rewrite; data never
  moves.
- **Stage** — a named data category inside a checkout, with a remote layout.
- **Unit** — one addressable item within a stage (a subdirectory, a directory,
  or a file, per the stage's `sync_by`).

## Install

Requires [`rclone`](https://rclone.org/) on `$PATH` for transfers:

```bash
pip install -e ./forest
```

## Quick start

No config files are hand-written. Onboarding is a few commands:

```bash
forest init                      # create the nameless .forest/ workspace container
forest checkout demo             # create if absent, register + activate 'demo'
forest remote add origin s3://my-bucket/prefix   # allowed before any local data exists
forest add raw ./data/raw        # register stage 'raw' and bind it to a local path
forest push                      # sync every bound stage to the active remote
```

- `forest init` creates **only** `.forest/config.yaml` (`version: 1`,
  `checkouts: {}`) and a managed `.gitignore`. No root config, no checkout, no
  active pointer.
- `forest checkout <name>` switches to the checkout, creating, registering, and
  activating it first (with `.forest/checkouts/<name>/forest.yaml`) when the
  name is not registered. `forest checkout create <name>` is the explicit form.
- Remotes can be added before any local binding — useful when your data is
  remote-only at first.
- `forest add STAGE PATH` registers a new stage and binds it to a local path in
  one step (use `forest bind` to rebind an existing stage).

## Metadata layout

Everything forest owns lives under `.forest/`; your data does not.

```
.forest/
  config.yaml                     # workspace registry: version, checkouts{}
  HEAD                            # active checkout name (gitignored)
  checkouts/
    demo/
      forest.yaml                 # shared: stages, remotes, manifest
      local.yaml                  # user-local: active_remote, stage_paths (gitignored)
      sync_state.json             # user-local push/pull state (gitignored)
```

Shared metadata (`config.yaml`, each `forest.yaml`) is committed so a fresh
clone bootstraps with `bind` + `remote use` + `pull`. User-local files
(`HEAD`, `local.yaml`, `sync_state.json`) are gitignored.

## Commands

| Command | Purpose |
|---|---|
| `forest init` | Create the workspace container, or report setup status if it exists. |
| `forest checkout create/adopt/list/current/remove <name>` | Manage checkouts; bare `forest checkout <name>` switches, creating first if needed. `remove --yes` skips the prompt for scripts. |
| `forest add STAGE PATH [--sync-by MODE]` | Register a new stage and bind it to a local path; `--sync-by` picks unit discovery (`subdirectory`/`directory`/`file`). |
| `forest bind [STAGE PATH]` / `forest unbind STAGE` | Manage local stage↔path bindings. |
| `forest remote add/remove/list/use/show` | Manage remotes; `use` selects the active remote (optional while only one remote exists). |
| `forest push / pull / status / diff / ls` | Sync and inspect against the active remote. Bare `push`/`pull`/`status`/`diff` cover every bound stage (unbound stages warn and skip); `--all` requires all stages bound. |
| `forest flow` | Emit a Mermaid data-flow diagram of the active checkout. |
| `forest migrate` | Migrate a legacy `biostore` layout in place (see below). |

Run any command with `-C <path>` to operate on another repo without `cd`.

Forest syncs **all** files in a data unit, skipping OS junk (`.DS_Store`,
AppleDouble `._*`, `*.tmp`). It applies no content-based include/exclude rules.

## Config reference

Checkout `forest.yaml` (shared, committed):

```yaml
project: demo
remotes:
  origin:
    url: s3://my-bucket/prefix
    region: us-east-2          # optional; also endpoint, profile, key_file, known_hosts
stages:
  raw:
    remote_path: demo/raw      # optional; defaults to <checkout>/<stage>
    sync_by: subdirectory      # subdirectory | directory | file
```

Checkout `local.yaml` (per-machine, gitignored):

```yaml
active_remote: origin
stage_paths:
  raw: ../data/raw             # relative resolves from the workspace root
```

## Environment variables

All optional, all off by default — forest is silent and sends nothing anywhere
unless configured. Copy `.env.example` for a commented template; operational
guides live in `docs/runbooks/`.

| Variable | Default | Effect |
|---|---|---|
| `FOREST_LOG_FILE` | unset | Append structured logs (JSON lines) to this file. |
| `FOREST_LOG_FORMAT` | `json` | `json` or `text`; set without `FOREST_LOG_FILE` to log to stderr. |
| `FOREST_LOG_LEVEL` | `INFO` | Standard logging level name. |
| `FOREST_METRICS_FILE` | unset | Append metric samples as JSON lines for external collectors. |
| `FOREST_ANALYTICS_FILE` | unset | Opt-in local usage analytics (JSON lines); nothing leaves the machine. |
| `FOREST_SENTRY_DSN` | unset | Sentry error tracking; needs `pip install "forest-cli[observability]"`. |
| `FOREST_ALERT_WEBHOOK` | unset | POST failure alerts to this HTTPS endpoint (Slack/Mattermost compatible). |
| `FOREST_TRANSFER_RETRIES` | `2` | Extra attempts for transient rclone failures; `0` disables. |
| `FOREST_RETRY_BASE_DELAY` | `0.5` | Initial retry backoff in seconds; doubles per attempt. |
| `FOREST_BREAKER_THRESHOLD` | `5` | Consecutive transfer failures before the circuit opens; `0` disables. |
| `FOREST_BREAKER_RESET_SECONDS` | `60` | Cool-down before an open circuit allows a probe operation. |
| `FOREST_FLAGS` | unset | Comma-separated feature flags; `raw-logs` disables log secret-scrubbing. |

## Dogfood: this repo runs forest

This repository manages its own `examples/` tree with forest — a live
demonstration that `.forest/` and `.git/` coexist without overlapping. It was
set up with exactly the quick-start commands:

```bash
forest init
forest checkout demo
forest remote add origin s3://forest-test-542222635421-us-east-2-an --region us-east-2
forest add examples ./examples
forest push
```

Inspect the result:

```bash
git ls-files .forest        # what a clone gets: config.yaml + checkouts/demo/forest.yaml
cat .gitignore              # forest-managed: HEAD, local.yaml, sync_state.json stay local
forest status               # sync state of the examples stage
```

A fresh clone bootstraps the local half with `forest bind examples ./examples`
followed by `forest pull` (the single configured remote is used automatically).
Pulling needs AWS credentials for the bucket; the layout is the demonstration.

## Migrating from biostore

Existing biostore repos use `.biostore/` and `biostore.yaml`. Migrate in place:

```bash
forest migrate
```

This renames `.biostore/` → `.forest/`, each `biostore.yaml` → `forest.yaml`,
rewrites the managed `.gitignore` patterns, and verifies the registry parses. It
refuses to run if a `.forest/` already exists.

## Notes

- **Single active machine (v1).** `HEAD`/`local.yaml`/`sync_state.json` are
  git-invisible but may be synced by a file-syncing tool; forest assumes one
  active machine and uses atomic writes plus a per-checkout `flock` for
  intra-machine write races.
- **Real filenames.** Forest stores data under real paths, not a
  content-addressed blob store.
- See `docs/adr/` for the design decisions behind the workspace/checkout model.
