OpsFabric · Reliability Audit
{{ scanned_at_utc }} UTC Talk to us →
{% if demo_mode %}
Sample audit — synthetic data. This report was rendered from a baked-in fixture; no AWS resources were inspected. Run opsfabric-discovery audit --profile <your-profile> --regions all against your AWS account to see real coverage.
{% endif %}

Reliability audit

Reliability monitoring coverage

AWS account {{ account_id }}{% if account_alias %} ({{ account_alias }}){% endif %} · scanned {{ scanned_at_utc }} UTC

How to read this report

1

What was checked

Every service this AWS account is running — applications, background jobs, databases, queues — and the alarms configured to detect their failures.

2

What was found

Where alarms are missing or broken. A missing alarm means a failure will go undetected until customers report it. A broken alarm means you've already paid for monitoring that isn't actually working.

3

What to do

See Where the business is exposed for the categories of damage, then Recommended next steps for the actions to take this week.

{% if not scoreable %}
No scoreable resources found. We scanned {{ regions_scanned|length }} region{{ "" if regions_scanned|length == 1 else "s" }} but found no ECS services, Lambda functions, RDS / Aurora resources, or SQS queues tagged in Resource Explorer 2. Either this account isn't running any workloads in those services, or Resource Explorer indexing is incomplete.
{% else %}

The risk

{{ threat_lead|safe }}

{{ threat_sub|safe }}

{% if remediation_callout %}

{{ remediation_callout|safe }}

{% endif %}
{% if first_audit_context %}

{{ first_audit_context|safe }}

{% endif %}
{{ required_missing }}
Monitoring gaps
checks that should exist but don't
{{ critical_resources }}
Services at risk
have at least one missing alarm
{{ p1_gap_count }}
Critical-impact gaps
cause customer-visible outages
{{ degraded_alarms_count }}
Broken alarms
exist but won't actually notify
{% if required_missing > 0 %} {% endif %}

Where the business is exposed

{% if total_gaps_count > 0 %} {{ total_gaps_count }} technical gaps grouped into the categories of damage they create. Switch tabs to drill into the per-check engineering detail. {% else %} No required-check gaps detected — coverage is strong. {% endif %}

{% if urgent_fixes %} {% endif %} {% if business_risks %}
{% for risk in business_risks %}
{{ risk.count }}
resource{{ 's' if risk.count != 1 else '' }} affected

{{ risk.title }}

{{ risk.consequence }}

{% for chip in risk.impact_chips %} {{ chip.label }}: {{ chip.value }} {% endfor %}
{% if risk.examples %}

Including

    {% for ex in risk.examples %}
  • {{ ex.name }} {{ ex.kind }}
  • {% endfor %} {% if risk.examples_more > 0 %}
  • + {{ risk.examples_more }} more
  • {% endif %}
{% endif %}
{% endfor %}
{% if technical_detail %}

Every gap, grouped first by category of damage, then by the service it affects, then by the specific check. Use this view to assign work to engineers — the same data lives in alarm-coverage-missing.json.

{% for cat in technical_detail %}

{{ cat.title }}

{{ cat.subtitle }}

{{ cat.resource_count }} service{{ 's' if cat.resource_count != 1 else '' }} · {{ cat.check_count }} check{{ 's' if cat.check_count != 1 else '' }}
{% for resource in cat.resources %}
{{ resource.name }} {{ resource.kind }} {{ resource.region }} {% if resource.urgency %} {{ resource.urgency }} {% endif %}
{% if resource.urgency_reason %}

{{ resource.urgency_reason }}

{% elif resource.usage_summary %}

{{ resource.usage_summary }}

{% elif not resource.data_available and resource.no_data_note %}

{{ resource.no_data_note }}

{% endif %}
    {% for check in resource.checks %}
  • {{ check.title }} {{ check.severity }} {% if check.is_degraded and check.degraded_reason_text %} {{ check.degraded_reason_text }} {% endif %}
    {% if check.business_impact %}

    {{ check.business_impact }}

    {% endif %}
  • {% endfor %}
{% endfor %}
{% endfor %}
{% else %}
No technical gaps detected.
{% endif %}
{% else %}
No required-check gaps detected — strong baseline.
{% endif %}

Coverage by service type

Each row is a category of your infrastructure. Coverage = the percentage of standard monitoring checks (service up/down, error rates, capacity) that have working alarms today. Gap = what's missing. Anything under 80% is a known incident class.

{% for row in coverage_by_kind %} {% endfor %}
Resource type Resources Required Met Coverage Gap
{{ row.label }} {{ row.resource_count }} {{ row.required_total }} {{ row.required_met }} {{ row.coverage_pct }}% {{ row.gap_pct }}%

How you compare

Industry baseline = the typical coverage we see across mid-market cloud engineering teams. Your account = the percentage of standard checks that have working alarms today. A 90+ score means most failures will be detected automatically; below 60 means most failures will be reported by customers first.

Industry baseline
{{ industry_baseline }}%
Your account
{{ score_int }}%
{% endif %}

Recommended next steps

In priority order. The first action is the highest-leverage thing to do this week; the last is the engagement that closes the loop.

    {% for step in next_steps %}
  1. {{ step|safe }}
  2. {% endfor %}

What happens next

Two paid fabrics close this loop.

DiscoveryFabric is the free audit you just read. AlarmFabric and OpsFabric are the paid multi-agent fabrics that act on what it found.

AlarmFabric · paid

Close these {{ required_missing }} monitoring gaps in your account. One deploy.

  • We deploy the missing alarms using a read-only role you grant — no broader access required
  • Each alarm is connected to your team's on-call tool (Slack, PagerDuty, Opsgenie, email) so it actually notifies someone
  • Every alarm is reversible — your team can remove the entire set with one command if needed
  • We re-run the audit on a schedule so newly-deployed services don't quietly drift below the bar
Book a demo →

OpsFabric · paid

When the alarms fire, we run the incident — end to end.

  • Your team gets the incident in Slack with the failing service, recent logs, and the most likely cause already analyzed
  • Suggested fix arrives at the same time — your team approves it, or you let it run automatically once you've built trust
  • The ticket is created and tracked in Jira; the post-mortem is drafted in Confluence — no after-hours documentation work
  • Three trust levels: AI suggests / your team approves / fully autonomous — your call, per category of incident
Book a demo →

Pilot pricing during the first customer cohort vaishal2611@gmail.com