Metadata-Version: 2.4
Name: edgescaleai-cube-mcp
Version: 0.3.30.dev20260601020019
Summary: MCP server for EdgescaleAI Cube operations
Requires-Python: >=3.10
Requires-Dist: httpx>=0.27.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: websockets>=12.0
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=1.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: pyyaml>=6.0; extra == 'dev'
Description-Content-Type: text/markdown

# cube-mcp

MCP server for EdgescaleAI Cube management and Apollo deployments.

> **New owner / inheriting this repo?** Jump to [Operations & Ownership](#operations--ownership). The two runbooks you need are [docs/admin.md](docs/admin.md) (full RBAC + infra reference) and [docs/adding-new-user.md](docs/adding-new-user.md) (onboard a teammate end-to-end). Both are written so you can point Claude Code at them and have it execute the steps for you.

## Architecture

**cube-agent** runs locally as an MCP server (via `npx cube-mcp`). It handles authentication, Docker builds, Helm chart packaging, and app proxy tunnels. Server-side operations (Kubernetes, Teleport, Apollo) are proxied to **cube-cloud**, a FastAPI backend hosted on AWS ECS.

```
┌──────────────────────────────────────────────────────────────────────────────┐
│  LOCAL (developer machine)                                                   │
│                                                                              │
│  ┌─────────────┐    MCP JSON-RPC     ┌──────────────────────────────────┐   │
│  │ Claude Code  │◄──────────────────►│  cube-agent (MCP server)         │   │
│  └─────────────┘                     │                                  │   │
│                                      │  Local tools:                    │   │
│                                      │   • agent_login_browser          │   │
│                                      │   • agent_login / logout         │   │
│                                      │   • agent_status                 │   │
│                                      │   • cube_cluster_login           │   │
│                                      │   • build_and_publish_to_apollo  │   │
│                                      │   • app_proxy / stop / status    │   │
│                                      │                                  │   │
│                                      │  Remote tools:                   │   │
│                                      │   (proxied to cube-cloud ──────) │   │
│                                      └────────┬──────────┬──────────────┘   │
│                                               │          │                   │
│   ~/.kube/config ◄── kubeconfig merge         │          │ WebSocket         │
│   ~/.cube-agent/  ◄── API key storage         │          │ /tunnel           │
│   localhost:PORT  ◄── app proxy listener      │          │                   │
└───────────────────────────────────────────────┼──────────┼───────────────────┘
                                                │          │
                               HTTPS + Bearer   │          │  TCP-over-WS
                               POST /mcp/        │          │  relay
                                                │          │
┌───────────────────────────────────────────────┼──────────┼───────────────────┐
│  CLOUD (AWS ECS)                              │          │                   │
│                                               ▼          ▼                   │
│  ┌────────────────────────────────────────────────────────────────────────┐  │
│  │  cube-cloud (FastAPI)                                                  │  │
│  │                                                                        │  │
│  │  ┌─────────────────────┐     ┌──────────────────────────────────────┐ │  │
│  │  │  Auth Middleware     │     │  RBAC (Cognito + Profiles)          │ │  │
│  │  │  • Validate API key  │────►│  • Cognito group → profile          │ │  │
│  │  │    (DynamoDB lookup)│     │  • Profile → Apollo credentials     │ │  │
│  │  │  • Attach profile   │     │    (AWS Secrets Manager)            │ │  │
│  │  └─────────────────────┘     └──────────────────────────────────────┘ │  │
│  │                                                                        │  │
│  │  Cloud tools (RBAC-gated):                                            │  │
│  │   Kubernetes        Apollo Environments    Apollo Products            │  │
│  │   • cube_list       • list_environments    • list_products            │  │
│  │   • cube_status     • create_environment   • compare_product_versions │  │
│  │   • kubectl_exec    • replicate_environment• list_release_channels    │  │
│  │   • app_list        • delete_environment   • get_product_releases     │  │
│  │                     • install/uninstall_    • add/remove_product_     │  │
│  │   Modules             entity                 to_release_channel       │  │
│  │   • list_modules    • entity_health        • set/remove_label         │  │
│  │   • install_module  • plan_details                                    │  │
│  │   • uninstall_module• update/enforce_      Change Requests            │  │
│  │   • update_module_    entity_config        • list_change_requests     │  │
│  │     variables       • reset_env_agents     • review_change_request    │  │
│  │                                                                        │  │
│  │   Secrets           Registry                                          │  │
│  │   • create_secret   • acr_get_token                                   │  │
│  │   • update_secret   • apollo_publish_manifest                         │  │
│  └────────────────────────┬───────────────────────────┬──────────────────┘  │
│                           │                           │                      │
│  ┌────────────────────────┴────────────────┐          │                      │
│  │  tbot sidecar (Teleport credentials)    │          │                      │
│  │  • IAM join method (Fargate identity)   │          │                      │
│  │  • identity/  → tsh commands            │          │                      │
│  │  • kube/{cluster}/ → kubeconfigs        │          │                      │
│  │  • app-{name}/ → app TLS certs          │          │                      │
│  └──────────┬─────────────────────────────┘          │                      │
│             │                                         │                      │
└─────────────┼─────────────────────────────────────────┼──────────────────────┘
              │                                         │
              │ tsh/tctl (short-lived certs)            │ GraphQL (OAuth2)
              ▼                                         ▼
┌──────────────────────────┐            ┌──────────────────────────────────┐
│  Teleport                │            │  Apollo                          │
│  edgescaleai.teleport.sh │            │  edgescaleai.palantirapollo.com  │
│                          │            │                                  │
│  • Kube clusters         │            │  • Environments & Modules        │
│  • App proxies           │            │  • Entities (Helm charts)        │
│  • SSH access            │            │  • Products & Release Channels   │
│  • Identity & TLS certs  │            │  • Change Requests               │
└──────────────────────────┘            │  • ACR (Docker + Helm registry)  │
                                        └──────────────────────────────────┘
```

### How it works

1. **Local tools** run directly on the developer's machine inside `cube-agent`. These handle authentication (Cognito browser login, API key storage), Docker builds, Helm chart packaging, kubeconfig merging, and app proxy tunnels.

2. **Cloud tools** are proxied from `cube-agent` to `cube-cloud` (FastAPI on AWS ECS) via MCP Streamable HTTP. Every request is authenticated with a Bearer API key validated against DynamoDB.

3. **RBAC** maps Cognito user groups to profiles. Each profile resolves to Apollo OAuth2 credentials stored in AWS Secrets Manager, scoping what the user can access.

4. **Teleport** access is provided by a `tbot` sidecar running alongside `cube-cloud` in the same ECS task. It uses IAM join to obtain short-lived certificates for Kubernetes clusters and app proxies — no `tsh` is needed on the developer's machine.

5. **Apollo** operations (environments, modules, entities, products, releases, change requests) go through a GraphQL API authenticated with per-profile OAuth2 client credentials.

## Install

Requires [uv](https://docs.astral.sh/uv/) (or [pipx](https://pipx.pypa.io/)).

```bash
# Add to Claude Code
claude mcp add cube -- npx @edgescaleai/cube-mcp
```

Or run directly:

```bash
npx @edgescaleai/cube-mcp    # via Node (calls uvx under the hood)
uvx edgescaleai-cube-mcp     # via uv directly
```

## Getting Started

```
You: "Log me in"             → agent_login_browser (opens browser for Cognito login)
You: "Connect to staging"    → cube_cluster_login (merges kubeconfig)
You: "Show Cube status"      → cube_status
```

## Tools

### Auth

| Tool | Description |
|------|-------------|
| `agent_login_browser` | Log in via browser (Cognito) |
| `agent_login` | Log in with an API key |
| `agent_logout` | Remove stored API key |
| `agent_status` | Check auth and connectivity |

### Kubernetes

| Tool | Description |
|------|-------------|
| `cube_list` | List available Cube clusters |
| `cube_status` | Get node status for a cluster |
| `cube_cluster_login` | Get kubeconfig for a cluster (merges into ~/.kube/config) |
| `kubectl_exec` | Run kubectl commands (server-side) |

### Apps

| Tool | Description |
|------|-------------|
| `app_list` | List Teleport apps |
| `app_proxy` | Start local proxy tunnel to an app |
| `app_proxy_stop` | Stop running proxies |
| `app_proxy_status` | Show proxy status |

### Build & Registry

| Tool | Description |
|------|-------------|
| `build_and_publish_to_apollo` | Build Docker image, package chart, push to ACR, publish manifest |
| `acr_get_token` | Get Apollo Container Registry token |
| `apollo_publish_manifest` | Publish a manifest YAML |

### Apollo Environments

| Tool | Description |
|------|-------------|
| `list_environments` | List/search Apollo environments |
| `create_environment` | Create a new environment with control plane |
| `replicate_environment` | Clone modules, entities, configs, and secrets |
| `delete_environment` | Delete an environment |
| `install_entity` | Install a Helm chart entity |
| `uninstall_entity` | Uninstall an entity |
| `entity_health` | Get entity health and activity status |
| `plan_details` | Get plan tasks, events, and error logs |
| `update_entity_config` | Update entity config overrides |
| `enforce_entity_config` | Force re-apply entity configuration |
| `reset_environment_agents` | Reset environment agents |

### Apollo Modules

| Tool | Description |
|------|-------------|
| `install_module` | Install a module on an environment |
| `uninstall_module` | Uninstall a module |
| `list_modules` | List modules in an environment |
| `update_module_variables` | Update module variables |

### Apollo Secrets

| Tool | Description |
|------|-------------|
| `create_secret` | Create a secret on an environment |
| `update_secret` | Update a secret value |

### Apollo Products & Release Channels

| Tool | Description |
|------|-------------|
| `list_products` | List available products |
| `compare_product_versions` | Compare versions of a product |
| `list_release_channels` | List release channels |
| `get_product_releases` | Get releases for a product |
| `add_product_to_release_channel` | Add a product to a release channel |
| `remove_product_from_release_channel` | Remove a product from a release channel |
| `set_label_on_product` | Set a label on a product |
| `remove_label_from_product` | Remove a label from a product |

### Apollo Change Requests

| Tool | Description |
|------|-------------|
| `list_change_requests` | List pending change requests |
| `review_change_request` | Approve or reject a change request |

## Local Development

```bash
# Install dependencies
uv sync --extra dev
uv pip install -e packages/cube-common -e packages/cube-cloud -e packages/cube-agent

# Run tests
uv run pytest packages/ -v

# Run cube-agent locally (for debugging)
uv run cube-agent
```

To test with Claude Code, point the MCP server at your local code:

```bash
claude mcp add cube-local -- uv run --directory /path/to/cube-mcp cube-agent
```

Reload after changes with `/mcp` in Claude Code.

## Contributing

1. Create a branch
2. Make changes
3. Run `uv run pytest packages/ -v`
4. Push and open a PR — tests run automatically
5. Merge to main — auto-publishes to PyPI and npm, auto-deploys to ECS

## Operations & Ownership

This section exists so anyone inheriting cube-mcp can run it end-to-end without tribal knowledge. If something here is wrong or missing, fix it in this README — don't keep the truth in your head.

### Runbooks (point Claude Code at these)

| Doc | Use it when |
|-----|-------------|
| [docs/admin.md](docs/admin.md) | Day-to-day RBAC, profiles, API keys, Teleport, infra layout, troubleshooting. The single most important file in this repo. |
| [docs/adding-new-user.md](docs/adding-new-user.md) | Onboarding a new user (Cognito create + group assignment + verification). |
| [docs/rbac-architecture.md](docs/rbac-architecture.md) | Deeper architectural background on the RBAC model. |
| [resources/teleport.md](resources/teleport.md) | User-facing Teleport quickstart (`tsh login`, `tsh kube ls`, etc.). |

These docs are written in runbook style — open Claude Code in this repo and ask it to "follow `docs/adding-new-user.md` to add `alice@example.com` to the `lear-dev` group" and it will execute the AWS CLI calls itself.

### Where AWS resources live

Everything is in **AWS account `992382448282`**, region **`us-west-2`**. All resources are managed by Terraform in `infra/terraform/` — change them there, not in the console.

| Resource | Name / ARN suffix | Terraform file |
|----------|-------------------|----------------|
| Cognito user pool | `cube-mcp-prod-*` (look up with `aws cognito-idp list-user-pools`) | `infra/terraform/cognito.tf` |
| Cognito groups | `admin`, `lear-dev`, `conagra-dev`, `pltr-dev`, `edgescaleai-dev` | `infra/terraform/cognito.tf` |
| API keys table (DynamoDB) | `cube-mcp-prod-api-keys` | `infra/terraform/dynamodb.tf` |
| Profile credentials (Secrets Manager) | `cube-mcp/profiles/<profile>` | `infra/terraform/secrets.tf` |
| tbot config (Secrets Manager) | `cube-mcp/tbot-config` | `infra/terraform/tbot.tf` |
| ECS task role | `cube-mcp-prod-ecs-task` | `infra/terraform/iam.tf` |
| ECS cluster | `cube-mcp-prod-cluster` (service: `cube-mcp-prod-service`) | `infra/terraform/ecs.tf` |
| CloudWatch Logs | `/ecs/cube-mcp-prod` (cube-cloud + tbot containers) | `infra/terraform/ecs.tf` |
| ALB + HTTPS listener | `cube.edgescaleai-cube.com` | `infra/terraform/alb.tf` |
| Route53 records | `edgescaleai-cube.com` zone (`Z0571857327BEMX2EHNZU`) | `infra/terraform/route53.tf` |
| ECR (cube-cloud + tbot images) | `cube-mcp/cube-cloud`, `cube-mcp/tbot` | `infra/terraform/ecr.tf`, `tbot.tf` |
| SES sender | `noreply@edgescaleai-cube.com` (identity `edgescaleai-cube.com`) | `infra/terraform/cognito.tf` |

**Outside AWS:**

- **Teleport:** `edgescaleai.teleport.sh` — bot is `cube-mcp-bot`, joins via IAM. See [docs/admin.md §4](docs/admin.md#4-teleport-access--rbac).
- **Apollo:** `edgescaleai.palantirapollo.com` — OAuth2 credentials per profile, stored in Secrets Manager.
- **PyPI / npm:** auto-published from `main` via CI (GitHub Actions). The npm shim under `npm/` is a thin wrapper that shells to `uvx edgescaleai-cube-mcp`.

### CI/CD pipelines

All deploys go through GitHub Actions in `.github/workflows/`. AWS auth uses OIDC — no long-lived AWS keys are stored in GitHub.

| Workflow | Trigger | What it does |
|----------|---------|--------------|
| `test.yml` | Reusable, called by others | Runs `pytest packages/ -v`. |
| `deploy.yml` | Push to `main` (paths: `packages/cube-cloud`, `cube-common`, `src/`, `infra/`) | Runs tests → `terraform apply` (only if `infra/terraform/` changed) → builds & pushes `cube-mcp/cube-cloud:latest` and `cube-mcp/tbot:latest` to ECR → seeds new profile secrets with admin creds → pushes `infra/tbot/tbot-config-prod.yaml` to Secrets Manager → `aws ecs update-service --force-new-deployment` and waits for `services-stable`. Also posts `terraform plan` as a PR comment for terraform PRs. |
| `deploy-dev.yml` | Push to `dev-main` branch | Same shape as `deploy.yml`, but targets `cube-mcp-dev-cluster` / `cube-mcp-dev-service` and uses TF state key `cube-cloud-dev/terraform.tfstate`. Images tagged `:dev`. Use this branch to test infra changes before merging to `main`. |
| `publish.yml` | Push to `main` (paths: `src/`, `npm/`, `pyproject.toml`) | Runs tests → bumps the patch version in `pyproject.toml` and `npm/package.json` → commits as `Release vX.Y.Z [skip ci]` and tags `vX.Y.Z` → publishes to PyPI (`twine`) and npm (`npm publish --access public`). |
| `sync-knowledge.yml` | Scheduled / manual | Syncs internal Claude Code knowledge from the `disco-projects` repo. Not on the critical deploy path — safe to ignore unless it breaks. |

**OIDC role** for all AWS-touching jobs: `arn:aws:iam::992382448282:role/github-actions`.

### Required CI secrets

Stored as GitHub repo secrets. To rotate any of these, generate a new value at the source and update via `gh secret set <NAME>` (or the repo Settings → Secrets UI).

| Secret | Where used | Source / how to rotate |
|--------|-----------|------------------------|
| `PYPI_TOKEN` | `publish.yml` | [pypi.org/manage/account/token/](https://pypi.org/manage/account/token/) — scope to the `cube-mcp` project. |
| `NPM_TOKEN` | `publish.yml` | npm account → Access Tokens → Automation token (must allow publish). |
| `PUBLISH_DEPLOY_KEY` | `publish.yml` | SSH deploy key on this repo with **write** access (used to push the auto-bumped version commit + tag back to `main`). Regenerate: create a new SSH keypair, add the public key as a repo Deploy Key with write access, paste the private key into the secret. |
| `DISCO_PROJECTS_TOKEN` | `sync-knowledge.yml` | Fine-grained PAT with read access to `EdgescaleAI/disco-projects`. |
| `ANTHROPIC_API_KEY` | `sync-knowledge.yml` | console.anthropic.com — used by the disco knowledge sync job. |

AWS credentials are **not** stored as secrets — `deploy.yml`/`deploy-dev.yml` assume `arn:aws:iam::992382448282:role/github-actions` via OIDC. To grant a new repo or change permissions, update that role's trust policy and inline policies in AWS IAM.

### Terraform state

Stored in S3, no DynamoDB lock table currently configured (single-writer assumption — CI is the only applier).

| Env | Bucket | Key |
|-----|--------|-----|
| Prod | `cube-mcp-terraform-state` (`us-west-2`) | `cube-cloud/terraform.tfstate` |
| Dev | `cube-mcp-terraform-state` (`us-west-2`) | `cube-cloud-dev/terraform.tfstate` |

**Local apply** (only if you really need to bypass CI):

```bash
cd infra/terraform
terraform init                                                    # prod
terraform init -backend-config="key=cube-cloud-dev/terraform.tfstate"  # dev
terraform plan
```

The S3 bucket has versioning enabled, so a corrupt state file can be rolled back to a previous version via the AWS console.

### Dev environment

A parallel stack exists in the same AWS account for testing infra changes before merging to `main`.

| | Prod | Dev |
|---|-----|-----|
| Branch | `main` | `dev-main` |
| ECS cluster | `cube-mcp-prod-cluster` | `cube-mcp-dev-cluster` |
| ECS service | `cube-mcp-prod-service` | `cube-mcp-dev-service` |
| Image tag | `:latest` | `:dev` |
| TF state key | `cube-cloud/terraform.tfstate` | `cube-cloud-dev/terraform.tfstate` |
| TF var file | (defaults) | `infra/terraform/environments/dev.tfvars` |

Push to `dev-main` to validate, then PR `dev-main` → `main` (or cherry-pick) for prod.

### Versioning & releases

- **Scheme:** patch-bump only, automated. `publish.yml` reads `pyproject.toml`, increments the patch, writes it back to both `pyproject.toml` and `npm/package.json`, commits as `github-actions[bot]`, and tags `vX.Y.Z`.
- **Major/minor bumps:** edit `pyproject.toml` manually in a PR. The next merge to `main` that touches `src/` / `npm/` / `pyproject.toml` will publish from there.
- **PyPI ↔ npm sync:** the same version number is used for both packages. The `npm/` shim is a thin wrapper that shells to `uvx edgescaleai-cube-mcp`.
- **What "main" means for users:** every merge that touches the publish paths ships to PyPI + npm within minutes. There is no staging release.

### Rollback playbook

| Surface | How to roll back |
|---------|------------------|
| **ECS deploy (cube-cloud or tbot)** | Images are tagged `:latest` only — there is no previous-image tag to roll forward to. Rollback path: `git revert` the offending commit on `main`; the next `deploy.yml` run will rebuild and redeploy. For an emergency, manually re-tag a known-good ECR image as `:latest` (`aws ecr batch-get-image` → `put-image`) and `aws ecs update-service --force-new-deployment`. **Improvement worth making:** tag images with the git SHA so rollback is one CLI call. |
| **Terraform** | `git revert` the offending commit; `deploy.yml` re-runs `terraform apply` on next push. For state corruption, restore from S3 versioning on `cube-mcp-terraform-state`. |
| **PyPI / npm** | Cannot un-publish (yanking PyPI is reversible only in narrow windows; npm allows unpublish within 72h). The standard path is to push a new patch version with the fix. Users on `npx @edgescaleai/cube-mcp` and `uvx edgescaleai-cube-mcp` pick up the new version on next invocation. |
| **Cognito user / group change** | Cognito has no native rollback. Reverse the change manually: re-add the user, restore group membership. Audit trail is in CloudTrail (`cube-mcp-prod-trail` if enabled — verify). |
| **Secrets Manager profile credentials** | Each secret has versioning. Roll back with `aws secretsmanager update-secret-version-stage --secret-id cube-mcp/profiles/<profile> --version-stage AWSCURRENT --move-to <prev_version_id>`. |

### Monitoring & alerting

> **Status: not yet configured.** No CloudWatch alarms, SNS topics, Sentry, or paging are set up at the time of this handoff. The only observability is CloudWatch Logs (`/ecs/cube-mcp-prod`) and the ECS service's own task health.

Day-to-day debugging:

```bash
# Live tail prod logs (cube-cloud + tbot)
aws logs tail /ecs/cube-mcp-prod --follow --region us-west-2

# Filter to just tbot
aws logs tail /ecs/cube-mcp-prod --filter-pattern "tbot" --follow

# Service health
aws ecs describe-services \
  --cluster cube-mcp-prod-cluster \
  --services cube-mcp-prod-service \
  --query 'services[0].{running:runningCount,desired:desiredCount,events:events[0:5]}'
```

**Recommended additions** for the next owner (none of these exist yet):
- CloudWatch alarm on ECS service `RunningTaskCount < 1` → SNS → email/PagerDuty.
- ALB 5xx alarm on the listener.
- DynamoDB throttling alarm on `cube-mcp-prod-api-keys`.
- Synthetic check that runs `agent_login_browser` end-to-end weekly.

### Stakeholders & contacts

> `TODO(owner)`: Vinayak (departing) is the only person with full context on these. Fill these in before he leaves so the next owner has someone to call.

| Role | Person | Contact |
|------|--------|---------|
| Internal product owner | `TODO` | |
| Eng escalation / on-call | `TODO` | |
| AWS account admin | `TODO` | |
| Teleport admin (can grant `tctl` access) | `TODO` | |
| Apollo admin (can issue per-tenant OAuth2 creds) | `TODO` | |
| Lear tenant POC | `TODO` | |
| Conagra tenant POC | `TODO` | |
| Palantir tenant POC | `TODO` | |
| EdgescaleAI internal dev (`edgescaleai-dev` profile) | `TODO` | |

### Cost & billing

> `TODO(owner)`: Add rough monthly run-rate and the account that gets billed.

- **AWS** (account `992382448282`): `TODO` — typical monthly spend, biggest line items, who pays the invoice.
- **Teleport** (`edgescaleai.teleport.sh`): `TODO` — plan tier, seat count, billing contact.
- **Apollo** (`edgescaleai.palantirapollo.com`): `TODO` — Palantir contract reference, who renews.
- **Domain** (`edgescaleai-cube.com`): `TODO` — registrar, expiry date, who has the login.
- **PyPI / npm**: free tiers, no recurring cost.

If costs need to be cut quickly: the dev ECS service (`cube-mcp-dev-service`) can be scaled to zero with no user impact.

### Access & disaster recovery

| Surface | Status / what the next owner should verify |
|---------|--------------------------------------------|
| GitHub repo admins | `TODO(owner)`: list everyone with admin/maintain on `EdgescaleAI/cube-mcp` and confirm branch protection on `main` (require PR + passing tests). |
| AWS root account access | `TODO(owner)`: confirm who holds the root credentials for account `992382448282` and that MFA is enforced. |
| Domain registrar login | `TODO(owner)`: who has the registrar account for `edgescaleai-cube.com`. |
| DynamoDB `cube-mcp-prod-api-keys` | Point-in-time recovery enabled (35-day window). |
| Secrets Manager | Per-secret versioning enabled by default — see Rollback playbook. |
| Terraform state | S3 bucket versioning enabled on `cube-mcp-terraform-state`. |
| ECR images | Lifecycle policy `TODO`: verify untagged-image cleanup so cost doesn't grow unbounded. |

### Known tech debt / planned work

> Update this list as you ship or de-scope items so the next owner sees current state, not folklore.

- **Phase 2 RBAC:** per-role tbot identity outputs are wired up but only `admin` profile actually has scoped Apollo creds. Lear/Conagra/Palantir profiles still resolve to admin Apollo credentials at runtime — see the note in `docs/admin.md` §3 ("Currently defined groups"). Scoping these is the biggest open item.
- **ECR `:latest`-only tagging** makes ECS rollback awkward (see Rollback playbook). Tag images with the git SHA in `deploy.yml`.
- **No alerting.** See Monitoring section.
- **Teleport role management is manual `tctl`.** The future plan is the Teleport Terraform provider — see [docs/admin.md → Future](docs/admin.md#future-teleport-terraform-provider).
- **No DynamoDB lock on Terraform state.** Acceptable while CI is the only applier; add a lock table if local applies become routine.
- **Cognito drift: `esaiadmin` group.** Surfaced in [#93](https://github.com/EdgescaleAI/cube-mcp/pull/93). The prod Cognito pool has an `esaiadmin` group that isn't defined in `infra/terraform/cognito.tf` — it was created in the console. Either `terraform import` it or delete it; don't leave it as drift.
- **Stale claim in `docs/rbac-architecture.md` line ~134** that "Cognito user → profile assignment is not yet implemented." This contradicts `docs/admin.md` and `login_flow.py` (the `groups[0]` resolution has shipped). Fix in a docs-only PR — also flagged in [#93](https://github.com/EdgescaleAI/cube-mcp/pull/93).
- **`TODO(owner)`:** anything else in flight at handoff time.

### First-day checklist for a new owner

1. **Read [docs/admin.md](docs/admin.md) end-to-end.** Then run `agent_login_browser` yourself so you've experienced the user flow.
2. **Get console access** to AWS account `992382448282` (`us-west-2`) with permissions for Cognito, DynamoDB, Secrets Manager, ECS, IAM, and CloudWatch Logs.
3. **Get Teleport admin** (`tctl`) on `edgescaleai.teleport.sh` — needed to manage the `cube-mcp-bot` and add roles. Ask the person listed in [Stakeholders & contacts](#stakeholders--contacts).
4. **Get Apollo admin** on `edgescaleai.palantirapollo.com` — needed to issue per-profile OAuth2 credentials when adding new tenants.
5. **Install [`cube-admin`](packages/cube-admin)** locally for API key management: `uv pip install -e packages/cube-admin`. This package is **internal-only and must never be published**.
6. **Verify CI is healthy** and you can rotate every secret in [Required CI secrets](#required-ci-secrets). Push a no-op commit to `dev-main` to confirm `deploy-dev.yml` succeeds end-to-end against the dev stack.
7. **Replace every `TODO(owner)` in this README** — Stakeholders, Cost, Access & DR, Tech debt. If you can't fill one in, write down who you asked and what they said. Don't leave them blank.

### Common ops tasks

- **Add a user:** [docs/adding-new-user.md](docs/adding-new-user.md).
- **Add a new tenant / role:** [docs/admin.md → Creating a new role / profile](docs/admin.md#creating-a-new-role--profile). Six steps — Cognito group, SM secret, Teleport role, bot role grant, tbot output, merge.
- **Revoke access:** [docs/admin.md → Revoking access](docs/admin.md#revoking-access).
- **Manage API keys:** `cube-admin keys {create,list,revoke}` — see [docs/admin.md → Managing API keys with cube-admin](docs/admin.md#managing-api-keys-with-cube-admin).
- **Debug a failing tool call:** check `aws logs tail /ecs/cube-mcp-prod --follow` for the cube-cloud and tbot containers.

### Repo layout cheatsheet

```
packages/
  cube-agent/    # Local MCP server (npx cube-mcp entrypoint)
  cube-cloud/    # FastAPI service on ECS — auth, RBAC, Apollo, Teleport relay
  cube-common/   # Shared types and helpers
  cube-admin/    # Internal-only API key CLI — DO NOT PUBLISH
infra/
  terraform/     # All AWS resources
  tbot/          # tbot Dockerfile, entrypoint, prod config
docs/            # Runbooks (start here)
resources/       # User-facing reference docs
npm/             # npm shim that calls uvx edgescaleai-cube-mcp
scripts/         # Release / publish helpers
```
