Metadata-Version: 2.4
Name: query-doctor
Version: 0.1.1
Summary: Local-first Big Data query diagnostics for Apache Impala.
Home-page: https://github.com/alexandrefimov/Query-Doctor
Author: Aleksandr Efimov
Maintainer: Aleksandr Efimov
License-Expression: AGPL-3.0-or-later
Project-URL: Homepage, https://github.com/alexandrefimov/Query-Doctor
Project-URL: Repository, https://github.com/alexandrefimov/Query-Doctor
Project-URL: Issues, https://github.com/alexandrefimov/Query-Doctor/issues
Project-URL: Documentation, https://github.com/alexandrefimov/Query-Doctor/blob/main/docs/README.md
Keywords: apache-impala,big-data,cloudera-manager,diagnostics,lakehouse,query-analysis,sql
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Database
Classifier: Topic :: System :: Monitoring
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pre-commit>=3.5; extra == "dev"
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: ruff>=0.8; extra == "dev"
Provides-Extra: e2e
Requires-Dist: playwright>=1.48; extra == "e2e"
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Query Doctor

Last reviewed: 2026-05-13

Language: English | [Russian](docs/i18n/ru/README.md)

[![Safety CI](https://github.com/alexandrefimov/Query-Doctor/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/alexandrefimov/Query-Doctor/actions/workflows/ci.yml)
[![Package CI](https://github.com/alexandrefimov/Query-Doctor/actions/workflows/package.yml/badge.svg?branch=main)](https://github.com/alexandrefimov/Query-Doctor/actions/workflows/package.yml)
[![Docs CI](https://github.com/alexandrefimov/Query-Doctor/actions/workflows/docs.yml/badge.svg?branch=main)](https://github.com/alexandrefimov/Query-Doctor/actions/workflows/docs.yml)
[![CodeQL](https://github.com/alexandrefimov/Query-Doctor/actions/workflows/codeql.yml/badge.svg?branch=main)](https://github.com/alexandrefimov/Query-Doctor/actions/workflows/codeql.yml)

Query Doctor is a local-first Apache Impala query diagnostic tool. It helps data
engineers explain slow, suspicious, or resource-heavy queries without pasting
raw operational data into a chat tool.

It runs near the operator's own credentials, collects bounded read-only context
from Cloudera Manager or direct Impala daemon endpoints, extracts deterministic
facts in Python, and can generate validated reports without treating an LLM as a
source of truth. Trusted reports default to English; Russian output uses the
same language-specific prompt, normalizer, and validator boundary.

Core rule:

```text
Python owns facts. LLM owns wording only.
```

Query Doctor is not a free-form chat wrapper over raw profiles, and it is not a
SQL execution tool.

## What It Does

- Scans completed Recent queries, Running queries, or one explicit Known Query
  ID for Apache Impala.
- Works with Cloudera Manager when available, or with direct Impala daemon
  profile/query-list endpoints for vanilla, Ambari-style, or otherwise
  non-Cloudera-Manager clusters.
- Optionally collects bounded Prometheus runtime metric summaries for direct
  Impala workflows and bounded read-only Impala metadata through `impala-shell`.
- Ranks suspicious cases and action candidates from deterministic analyzer
  facts, not LLM scoring.
- Generates trusted reports only after deterministic normalization,
  sanitization, and validation.
- Provides a read-only Query Optimizer workflow for pasted SQL review, plus an
  explicit details-page optimizer action for server-owned analyzed cases.
- Keeps raw SQL, raw profiles, raw metadata, local paths, secrets, subprocess
  output, model/runtime internals, and raw artifact filenames out of browser and
  trusted report surfaces.

## Supported Scope

| Area | Supported today | Not current support |
| --- | --- | --- |
| Query engine | Apache Impala | Other engines are roadmap seams only. |
| Cloudera Manager | Full Recent discovery/profile/metrics/events context for Impala workflows | Generic cluster diagnosis beyond the Query Doctor flow. |
| Direct Impala | Bounded Recent scans, Running scans, and one Known Query ID through impalad daemon endpoints | Cloudera Manager events, broad log scraping, or SQL execution. |
| Runtime metrics | Optional bounded Prometheus summaries for configured direct Impala workflows | Raw time-series output or arbitrary PromQL from users. |
| Metadata | Read-only allowlisted metadata statements through `impala-shell` | User SQL execution or unbounded metadata crawling. |
| Reports and optimizer | Python-owned facts, validation, and explicit selected-case actions | LLM output as trusted evidence or automatic batch LLM jobs. |

Future Big Data SQL/lakehouse engines, broader providers, prepared event/log
sources, and Cluster Doctor workflows remain roadmap seams, not current support.

## Install

Use an editable install for local development:

```bash
python3 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .
```

For contributor tooling, install the development extra:

```bash
python -m pip install -e ".[dev]"
pre-commit install
```

In a network-restricted environment, install from a prebuilt wheel or make sure
the build dependencies are already present locally, then install the checkout:

```bash
python -m pip install .
```

Local JSON configuration is documented in [docs/configuration.md](docs/configuration.md).
The preferred workstation path is `~/.qdcreds/query-doctor-config.json`;
secrets still stay in environment variables or local env files.

## Quickstart Smoke

Run the deterministic local checks first. They do not call Cloudera Manager,
Impala, Ollama, or the network:

```bash
query-doctor-demo-preflight
query-doctor-demo --out /tmp/query-doctor-demo-pack --overwrite
query-doctor-web --batch-summary /tmp/query-doctor-demo-pack/batch_summary.json
```

Open the localhost URL printed by `query-doctor-web`. The synthetic demo pack is
local-only and contains no real SQL, profiles, metadata, hostnames, users, or
credentials.

![Synthetic Query Doctor web demo](docs/assets/query-doctor-synthetic-demo.png)

The synthetic demo follows the same safety shape as real local workflows:

```mermaid
flowchart LR
    DemoPack[Synthetic demo pack] --> Web[Local web UI]
    Web --> Ranked[Ranked cases]
    Ranked --> Details[Details page]
    Details --> Facts[Analyzer-owned facts]
    Details --> Report[Explicit trusted report action]
    Details --> Optimizer[Explicit optimizer action]
```

## Console Scripts

After installation, use the packaged entry points:

```bash
query-doctor-analyze --help
query-doctor-batch-recent --help
query-doctor-cleanup-generated --help
query-doctor-cm-events --help
query-doctor-cm-sample-smoke --help
query-doctor-collect-cm-profiles --help
query-doctor-collect-impala-context --help
query-doctor-corpus-smoke --help
query-doctor-demo --help
query-doctor-demo-preflight --help
query-doctor-optimize-query --help
query-doctor-pipeline --help
query-doctor-report --help
query-doctor-web --help
```

Root-level compatibility launchers have been removed. Use the `query-doctor-*`
commands, or `python -m query_doctor.cli.<command_module>` when running directly
from a checkout without installing console scripts.

## Main Workflows

### Web UI

```bash
query-doctor-web --help
```

The local web UI exposes:

- `Diagnose`: the primary screen for Recent query triage. `Finished queries` is
  the default target; `Running now` is available as lower-confidence live
  context.
- `Known Query ID`: a secondary mode inside `Diagnose` for one explicit Impala
  query ID. It uses Cloudera Manager by default or direct Impala daemon profile
  endpoints when `query_profile_source=impala` is configured.
- Details pages with deterministic findings, evidence context, and explicit
  LLM Report / Query LLM optimizer actions.
- `Help`: curated in-product workflow, safety, and documentation guidance.

The pasted-SQL `Query Optimizer` remains a read-only compatibility route and
test surface. It does not execute SQL and does not echo submitted SQL after
submit, but it is not promoted as a primary navigation item while profile-backed
diagnosis is the main product workflow.

Validated reports and details-page optimizer drafts are generated only by
explicit user action for selected cases.

### CLI And Headless Use

The packaged CLI entry points cover analyzer runs, batch Recent scans, profile
collection, metadata collection, reports, optimizer review, demo generation, and
cleanup. They are intended for local diagnosis, automation in a controlled
environment, and CI-style smoke checks.

For team workflows, prefer a pinned project version and shared conventions such
as a reports repository, scheduled headless scans under a controlled service
account, a team jumpbox, or a shared local LLM endpoint. Query Doctor itself
remains local-first and single-user unless a future shared-deploy design adds
authentication, authorization, tenant/job isolation, audit logging, TLS trust,
and resource limits.

### Analyzer

```bash
query-doctor-analyze CASE_DIR
```

The analyzer reads collected local case files and writes deterministic facts.
It does not call Cloudera Manager, Impala, Ollama, or the report writer.

### Pipeline

```bash
query-doctor-pipeline CASE_DIR --stop-after-analysis
```

Pipeline mode runs analyzer-first, can optionally collect bounded metadata when
configured, and generates reports only when requested.

### Query Optimizer

```bash
query-doctor-optimize-query --help
```

The Query Optimizer accepts one safe read-only `SELECT` or `WITH` statement for
analysis. It never executes SQL, never echoes pasted SQL back after submit, and
trusts SQL drafts only when Python-owned recipes and validation prove the
supported transform.

### Cloudera Manager (CM) Events And Cluster Context

```bash
query-doctor-cm-events --help
```

The CM Events CLI is a read-only Cluster Doctor seam for Cloudera Manager event
summaries. It can write normalized event summaries plus schema-versioned
raw-free `cluster_event_context.json` and `cluster_context.json` artifacts.
Recent scan can also collect one bounded Cluster Event Context from Cloudera
Manager Events per scan window and show only raw-free cluster context status in
the web UI. These artifacts are not yet a Cluster Doctor web workflow or report
path.

### Demo Preflight

```bash
query-doctor-demo-preflight
```

The demo preflight is deterministic and local. It checks git hygiene,
safety-sensitive changed areas, browser/trusted-output denylist patterns, and
focused test suggestions without LLM, network, Cloudera Manager, or Impala
access.

## Supported Deployment

Query Doctor is supported as a single-user, local-first tool run by an operator
with their own local Cloudera Manager, Kerberos, Impala, Prometheus, and LLM
credentials. Use localhost or a tightly controlled local bind for the web UI.

Do not deploy the current web UI as a shared service for a team or company.
Shared deployments need a separate design for authentication, authorization,
tenant/job isolation, audit logging, TLS/reverse-proxy trust, and resource
limits before they are supported.

## Why Not A Chat Wrapper?

Query Doctor is built for operational diagnostics, where unsupported certainty
is worse than saying "unknown." A chat wrapper over raw profiles would make it
too easy for model wording to become accidental evidence.

Instead:

- collectors gather bounded, read-only, redacted inputs;
- analyzers extract deterministic facts;
- reports use LLMs only to phrase those facts;
- validators reject unsupported claims and unsafe output;
- browser surfaces show trusted summaries, not raw operational artifacts.

## Safety Model

- Python/analyzer-owned facts are the only trusted diagnostic evidence.
- Raw LLM output is untrusted unless normalized, sanitized, and validated.
- Browser-visible UI and trusted reports must not expose raw SQL, raw profiles,
  raw metadata, local paths, secrets, subprocess output, model/runtime internals,
  or raw artifact filenames.
- External collection must be explicit, bounded, read-only, redacted, and safe
  by default.
- Local config `privacy_mode` defaults to `true`; disabling it can relax local
  artifact identifier/host masking, but browser-visible UI and trusted reports
  still do not show raw SQL, profiles, or metadata. Local config `no_llm=true`
  keeps report and optimizer actions on deterministic Python-owned output.
- Impala metadata collection is allowlisted and read-only.
- Query Optimizer accepts only a single safe read-only statement and never
  executes pasted SQL.

See [docs/safety-contract.md](docs/safety-contract.md) for the full contract.
For a public, reviewer-oriented overview, see
[docs/security-model.md](docs/security-model.md).

## Licensing

Query Doctor is licensed under the GNU Affero General Public License version 3
or later (`AGPL-3.0-or-later`). See [LICENSE](LICENSE).

Commercial licensing is available for proprietary, hosted, embedded, or
enterprise use cases where AGPL obligations are not a fit. See
[COMMERCIAL-LICENSE.md](COMMERCIAL-LICENSE.md).

## Documentation

Start with [docs/README.md](docs/README.md). It separates current user docs,
operations guides, architecture contracts, internal audits, and historical
planning notes.

The canonical documentation language is English. Russian localized companion
pages live under [docs/i18n/ru/](docs/i18n/ru/) when they are useful for long
operator-facing explanations. If English and Russian pages diverge, the English
page is the source of truth until the localized companion is updated.

High-value references:

- [docs/local-smoke.md](docs/local-smoke.md): local validation and smoke checks.
- [docs/credentials.md](docs/credentials.md): local credentials layout.
- [docs/public-release-readiness.md](docs/public-release-readiness.md): public
  release readiness checklist.
- [docs/release-checklist.md](docs/release-checklist.md): maintainer release
  and visibility-change checklist.
- [docs/repository-hardening.md](docs/repository-hardening.md): repository
  security, CI hardening, release automation, and strong-test backlog.
- [docs/architecture.md](docs/architecture.md): current and future component
  boundary diagrams.
- [docs/contributor-architecture.md](docs/contributor-architecture.md):
  contributor-oriented architecture map.
- [docs/roadmap.md](docs/roadmap.md): implemented scope and planned seams.
- [docs/query-optimizer-contract.md](docs/query-optimizer-contract.md):
  optimizer trust boundary.
- [docs/cluster-doctor-contract.md](docs/cluster-doctor-contract.md): future
  Cluster Doctor contract.

## Development Checks

Before committing:

```bash
pre-commit run --all-files
scripts/local_gate.sh
python -m ruff check query_doctor tests
python -m ruff format --check query_doctor tests scripts
python3 -m pytest -q
git diff --check
query-doctor-demo-preflight
git status --short
```

Stage only explicit files. Do not commit generated cases, reports, local
configs, credentials, raw profiles, raw metadata, or temporary outputs.

## Public Status

This repository is public. `v0.1.0` is the initial public GitHub release
baseline. The public license is AGPL-3.0-or-later, with commercial licensing
available separately.

PyPI publishing is prepared through GitHub OIDC Trusted Publishing. The
repository-side `testpypi` and `pypi` environments require maintainer approval;
the first package-index upload still requires matching TestPyPI and PyPI
pending publishers before release workflows are run.

Before cutting a new tag, publishing to a package index, or announcing a public
release, run the public-release guard from a clean working tree:

```bash
query-doctor-demo-preflight --public-release
```

Use [docs/release-checklist.md](docs/release-checklist.md) for the full release
and visibility-change checklist.
