Metadata-Version: 2.4
Name: opsfabric-discovery
Version: 0.2.0
Summary: Read-only AWS reliability audit. Alarm coverage assessment for ECS, Lambda, RDS, Aurora, and SQS.
Author: Vaishal Shah
License: MIT
License-File: LICENSE
Keywords: alarms,audit,aws,cloudwatch,observability,ops,reliability,sre
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: System :: Systems Administration
Requires-Python: >=3.11
Requires-Dist: boto3>=1.34.0
Requires-Dist: jinja2>=3.1.4
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: weasyprint>=62
Description-Content-Type: text/markdown

# opsfabric-discovery

A read-only AWS reliability audit you run on your own laptop. Produces an executive PDF assessing CloudWatch alarm coverage across ECS, Lambda, RDS, Aurora, and SQS workloads against the OpsFabric reliability baseline.

## See what your audit would look like (no AWS needed)

[**Download a sample report (PDF, ~68 KB)**](docs/sample-audit.pdf)

Or run it yourself in 30 seconds without any AWS credentials:

```bash
pip install opsfabric-discovery        # (locally — see install section below)
opsfabric-discovery audit --demo
# → out/audit-demo.pdf
```

`--demo` runs against a baked-in synthetic dataset that exercises every feature of the audit (DEGRADED alarm detection, ALB→ECS bridge, critical-gap cards, coverage breakdown). No AWS calls, no credentials needed. Same matching engine, same PDF — only the input is fake.

## What it does

- Discovers AWS resources via Resource Explorer 2 across one or all enabled regions.
- Maps CloudWatch alarms to those resources using a five-strategy matcher (exact dimensions, ALB target-group bridge for ECS, namespace + partial dimensions, log-group → metric-filter linkage, naming heuristic).
- Detects alarms that exist but won't notify (actions disabled / no SNS target / `INSUFFICIENT_DATA`) and surfaces them as DEGRADED — they don't count toward coverage.
- Scores required-check coverage against a baseline pack (`discovery_fabric/data/alarm_pack.yaml`).
- Renders an executive PDF (3 pages, McKinsey-style) plus JSON appendices for every artifact.

## Trust statement

- **Read-only.** Calls only AWS describe / list APIs. Never creates, modifies, or deletes any resource.
- **Runs on your laptop.** No telemetry, no phone-home. Your data never leaves your machine.
- **Source is auditable.** Open this directory's Python files — every AWS call is visible.
- **Customer IAM policy** for cross-account audits lives at `discovery-fabric/docs/customer-iam-policy.json` (in the parent monorepo for stage 1; bundled inside the package in stage 2).

## Install

```bash
pip install opsfabric-discovery
opsfabric-discovery --help
```

## Quickstart

Once installed, from any directory:

```bash
# Audit a profile from ~/.aws/credentials
opsfabric-discovery audit --profile prod --regions all --account-alias acme-prod

# Or via STS assume-role (cross-account)
opsfabric-discovery audit \
  --assume-role-arn arn:aws:iam::CUSTOMER_ACCOUNT:role/OpsFabricAuditor \
  --external-id agreed-secret \
  --regions all \
  --account-alias acme-prod

# Outputs land in ./out/ by default; override with --output-dir
ls out/
# audit-<account-id>-<YYYYMMDD>.pdf
# alarm-coverage-score.json
# alarm-coverage-missing.json
# resource-mapping.json
# all-resources.json
# audit-meta.json
```

## How it works (one-paragraph)

For every required check in the alarm pack and every discovered resource, the matcher tries five strategies in priority order. First hit wins:

1. **Exact dimension match** — alarm dimensions equal the resource's canonical dimensions (e.g. `ClusterName + ServiceName` for ECS). HIGH confidence.
2. **ALB target-group bridge** — alarm uses `TargetGroup` dimension; we cross-reference back to the ECS service via its registered load-balancer attachments. Exact-ARN equality. HIGH.
3. **Namespace + partial dimension match** — alarm is in the resource's expected namespace and at least one dimension matches. MEDIUM.
4. **Metric-filter → log-group linkage** — alarm metric was published by a metric filter on one of the resource's log groups. HIGH.
5. **Naming heuristic** — resource name appears as substring in alarm name. LOW (last-resort).

Per-region scoping prevents cross-region false positives. Per-region failures (RE2 not enabled, IAM gap, throttling beyond retry) are skipped + logged rather than aborting the audit.

## Re-syncing from the monorepo source

When the monorepo at `discovery-fabric/` changes, run:

```bash
./bin/sync-from-monorepo.sh
```

This re-copies the package files, re-applies the two hand-edits (alarm_pack path; PDF CTA copy), and re-creates `cli.py` from the latest `main.py`. The script fails loudly if the source files have drifted in ways that break the patches — fix-up is then manual.

## What's NOT in this build

- The Streamlit UI (lives in the monorepo only — internal tool).
- The Next.js dashboard (lives in `discovery-fabric-ui/` — internal tool).
- AlarmFabric integration (separate product, closed-source).

The customer-facing surface for this stage is the CLI + the PDF.
