Metadata-Version: 2.4
Name: om-health-check
Version: 0.4.0
Summary: Automated MongoDB cluster health checks via the Ops Manager API — for incident triage and proactive review.
Author: Frank Snow
Maintainer: Frank Snow
License: 
                                         Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to the Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by the Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding any notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           Copyright 2026 Frank Snow
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
        
Project-URL: Homepage, https://github.com/fsnow/om-health-check
Project-URL: Documentation, https://github.com/fsnow/om-health-check#readme
Project-URL: Repository, https://github.com/fsnow/om-health-check.git
Project-URL: Issues, https://github.com/fsnow/om-health-check/issues
Keywords: mongodb,ops-manager,health-check,monitoring,incident-response,database
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Database
Classifier: Topic :: System :: Monitoring
Classifier: Topic :: System :: Systems Administration
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: opsmanager>=0.5.3
Requires-Dist: jinja2>=3.0
Requires-Dist: packaging>=21.0
Requires-Dist: pyyaml>=6.0
Dynamic: license-file

# om-health-check

A CLI tool that queries the MongoDB Ops Manager API to produce a structured health assessment of MongoDB clusters. Designed for use during incidents and proactive health checks.

## What it does

Runs 9 categories of checks against one or more clusters via the Ops Manager API:

1. **Connectivity & Infrastructure** — API reachability, node status, agent status, active alerts, network throughput
2. **Compute Resources** — CPU (user, iowait, process), memory, swap; deeper CPU breakdown when issues detected
3. **Disk Resources** — read/write latency, IOPS, partition space, iowait correlation
4. **Cache Resources** — WiredTiger cache used bytes, dirty bytes, cache read/write rates
5. **Database Activity & Workload** — query targeting, scan and order, opcounters, document metrics, execution times, global lock queues, Performance Advisor
6. **Replication** — replication lag, oplog window, oplog rate
7. **Connections** — connection count, zero-connection detection, connection storm correlation
8. **Backup** — backup config status, snapshot schedule adherence, capture lag
9. **Version Information** — version consistency across nodes, known-bad version detection (CVEs)

Each metric is compared against both an absolute threshold and a 1-week baseline (same day-of-week, same hour) to reduce false positives from normal workload variance.

## Installation

```bash
pip install om-health-check
```

Requires Python 3.9+.

## API key permissions

The API key must have the **Project Read Only** role on each project being checked. This provides read access to deployments, measurements, alerts, agents, and backup status — covering 8 of the 9 check sections.

No write permissions are required. The tool never modifies any Ops Manager configuration.

### Performance Advisor section — additional role required

The **Performance Advisor** section calls endpoints that require the **Project Data Access Read Only** role (or higher). Per the [Ops Manager docs](https://www.mongodb.com/docs/ops-manager/current/reference/api/performance-advisor/), the allowed roles are: Project Owner, Project Data Access Admin, Project Data Access Read/Write, or Project Data Access Read Only.

The minimum role granting this access (Project Data Access Read Only) also grants the holder read access to database contents. There is no narrower read-only-observability role for Performance Advisor in Ops Manager.

For security-conscious deployments where most personnel should not have database read access, run the tool with a **Project Read Only** key. The Performance Advisor section will report an INFO message — *"Performance Advisor access denied — requires Project Data Access Read Only role or higher"* — and the other 8 sections work normally. To minimize API load when access is denied, the script makes only one Performance Advisor call per cluster and reuses the message for the remaining hosts.

If the API key lacks sufficient permissions, affected checks report a clear message indicating which permission is missing rather than failing the whole report.

## Usage

```bash
export OPS_MANAGER_USER=your-public-key
export OPS_MANAGER_API_KEY=your-private-key

om-health-check --om-url https://ops-manager.example.com --project "My Project"
```

### Options

| Flag | Required | Description |
|---|---|---|
| `--om-url` | Yes | Ops Manager base URL |
| `--project` | Yes | Project name (repeatable for multiple projects) |
| `--cluster` | No | Cluster name filter; omit to check all clusters in the project(s) |
| `--format` | No | `txt` (default), `json`, `html`, or comma-separated (e.g. `txt,html`) |
| `--config` | No | Path to YAML config file for threshold overrides |

### Output formats

- **txt** — plain text, suitable for pasting into incident tickets
- **json** — machine-readable, for downstream tooling or dashboards
- **html** — self-contained HTML with color-coded status and collapsible sections

### Examples

Check all clusters in a project:
```bash
om-health-check --om-url https://om.example.com --project "Production"
```

Check a specific cluster across two projects, output as text and HTML:
```bash
om-health-check --om-url https://om.example.com \
  --project "Production" --project "Staging" \
  --cluster "rs0" \
  --format txt,html
```

## Threshold configuration

Every metric has a default threshold. To override defaults, create a YAML config file.

The tool looks for config in this order:
1. Path passed via `--config`
2. `OM_HEALTH_CHECK_CONFIG` environment variable
3. `~/.om-health-check.yaml`

Only metrics you want to change need to be specified. Unspecified fields retain their defaults.

```yaml
thresholds:
  CONNECTIONS:
    red: 30000
    warn: 25000
  SYSTEM_NORMALIZED_CPU_USER:
    red: 90.0
    mode: "or"
  DISK_PARTITION_LATENCY_READ:
    red: 15.0
    warn: 8.0
```

See [`examples/all-thresholds.yaml`](examples/all-thresholds.yaml) for a reference file listing every metric with its built-in defaults — copy, subset, and edit to produce a custom config.

See [`examples/low-thresholds.yaml`](examples/low-thresholds.yaml) for a smoke-test config with aggressively low thresholds designed to trigger RED on a healthy cluster — useful for verifying the tool runs end-to-end.

### Threshold fields

| Field | Type | Description |
|---|---|---|
| `red` | float | Value that triggers RED status |
| `warn` | float | Value that triggers WARN status |
| `direction` | string | `"above"` (RED when value >= red) or `"below"` (RED when value <= red) |
| `deviation` | float | Baseline multiplier (e.g. `3.0` = RED if current >= 3x baseline) |
| `mode` | string | How threshold and baseline interact (see below) |

### Evaluation modes

| Mode | Behavior |
|---|---|
| `absolute` | RED if value crosses threshold. Baseline is informational. |
| `baseline` | RED only if value deviates from baseline by the configured multiplier. No absolute threshold. |
| `and` | RED only if value crosses threshold AND deviates from baseline. Suppresses false positives from stable elevated values. |
| `or` | RED if value crosses threshold OR deviates from baseline. Catches both absolute danger and unusual spikes. |

## Baseline comparison

Current metric values are compared against the same hour, same day of week, one week prior. This accounts for recurring workload patterns (business hours vs nights vs weekends) and avoids flagging normal variance as anomalous.

**Current values** are fetched at PT1M granularity over the past hour and averaged, producing a 1-hour rolling average. This sidesteps Ops Manager's mid-hour PT1H rollup, which is not yet populated for rate-based metrics (CPU %, network bytes/sec) until the hour boundary.

**Baseline values** are fetched at PT1H granularity from the 1-hour window one week ago. Ops Manager retains hourly data for 2 months by default.

Comparing two hourly averages keeps the check apples-to-apples and resistant to single-minute spikes.

### Graceful degradation when data is missing

The tool is resilient to gaps in OM data:

- **No current data available** → reported as INFO (e.g., no read activity means no `DISK_PARTITION_LATENCY_READ` sample)
- **No baseline data available** (cluster is less than 1 week old) → behavior depends on evaluation mode:
  - `absolute` — works unchanged (baseline is informational)
  - `baseline` — reports INFO with the current value and "no baseline yet (cluster < 1 week old)"
  - `and` / `or` — degrades to threshold-only evaluation, with a "no baseline yet" note appended to the message
- **Metric not exposed by the OM API version** → batched fetch falls back to per-metric calls; unavailable metrics are summarized once on stderr

## Status rollup

Each check produces one of four statuses:

- `GREEN` — healthy
- `WARN` — approaching threshold
- `RED` — threshold crossed or baseline significantly deviated
- `INFO` — informational only (missing data, advisory alerts, degraded evaluation)

Section, cluster, and overall status roll up the worst status among their children — **with one important rule: `INFO` never bubbles up**. A cluster with only INFO items still reports overall GREEN. This keeps the headline color honest about operational health without hiding informational details.

Certain advisory alerts (e.g., `HOST_SECURITY_CHECKUP_NOT_MET`, which commonly fires as a false positive for deployments using external auth like LDAP) are classified as INFO so they are visible but do not color the overall report.

## Monitoring agents

Ops Manager uses leader election for monitoring agents: exactly one agent per project is `ACTIVE`, the rest are `STANDBY` (ready to take over if the active agent fails). The tool reports a single GREEN "Agent status" check when at least one agent is ACTIVE, and RED only if no ACTIVE agent exists (which means monitoring data is not being collected).

## Dependencies

- [opsmanager](https://pypi.org/project/opsmanager/) — Ops Manager API client
- [Jinja2](https://pypi.org/project/Jinja2/) — HTML report templating
- [packaging](https://pypi.org/project/packaging/) — version comparison
- [PyYAML](https://pypi.org/project/PyYAML/) — config file parsing

## License

Apache 2.0
