Metadata-Version: 2.4
Name: sfild-repo-auditor
Version: 1.0.0
Summary: Deep GitHub repository audit CLI with context, documentation, and security heuristics
Author: SfilD
License: MIT License
        
        Copyright (c) 2026 SfilD
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/SfilD/repo-auditor
Project-URL: Source, https://github.com/SfilD/repo-auditor
Project-URL: Issues, https://github.com/SfilD/repo-auditor/issues
Keywords: audit,security,github,repository,cli,sast,sca
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Russian
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=9.0.3; extra == "dev"
Requires-Dist: ruff<1.0,>=0.15.12; extra == "dev"
Requires-Dist: mypy<3.0,>=1.10; extra == "dev"
Requires-Dist: jsonschema<5.0,>=4.23; extra == "dev"
Provides-Extra: mcp
Requires-Dist: mcp<2.0,>=1.27.0; extra == "mcp"
Dynamic: license-file

<p align="center">
  <a href="README.md">🇺🇸 English</a> | <a href="README.ru.md">🇷🇺 Русский</a>
</p>

<p align="center">
  <a href="https://pypi.org/project/sfild-repo-auditor/"><img alt="PyPI" src="https://img.shields.io/pypi/v/sfild-repo-auditor"></a>
  <img alt="Python" src="https://img.shields.io/pypi/pyversions/sfild-repo-auditor">
  <img alt="License" src="https://img.shields.io/pypi/l/sfild-repo-auditor">
</p>

# repo-auditor

Objective GitHub repository audit: a compact terminal card and a full Markdown report. Four balanced categories (purpose · security · quality · maturity), transparent methodology, optionally powered by `gitleaks`, `osv-scanner`, `tokei`, `semgrep`.

Pure Python 3.12+, zero third-party dependencies. Networking via `urllib` with `gh`/`curl` fallback; TOML via `tomllib`.

> 👋 **Not a programmer but want to check a project?** Open [**GETTING_STARTED.ru.md**](GETTING_STARTED.ru.md) (Russian) — a step-by-step guide in plain language, from installing Python to reading your first report.

---

## What you'll see

After running, a card is printed to the terminal:

```
┌─ vercel/turbo ──────────────────────────────── Rust · MIT · ★30258
│ Purpose          ██████████  10/10
│ Security         ██████░░░░  6/10
│ Code Quality     ██████████  10/10
│ Maturity         ██████████  10/10
├─ TOTAL: 90/100 · ~ reliable
│ Key: 0 high + 5 medium security findings (heuristic)
│ Suggestion: review .devcontainer/Dockerfile, .github/workflows/turborepo-library-release.yml
└─ full report: data/github.com/vercel__turbo/runs/2026-04-25_071849/report.ru.md
```

At the same time, `data/.../latest/` receives:

- `report.ru.md` — full Markdown report (8 sections, methodology, delta with previous run);
- `report.json` — versioned JSON (`schema_version`) for automation;
- `card.txt` — copy of the terminal card;
- `raw/` — raw tool outputs for manual drill-down.

> **Note:** Reports are generated in Russian by default. Use `--language en` for English output.

---

## Quick start

```bash
# One-off audit — no installation needed
python3 -m repo_auditor https://github.com/pallets/flask

# View the full report
cat data/github.com/pallets__flask/latest/report.ru.md
```

That's enough to get your first result. External scanners (`gitleaks`, `osv-scanner`, `tokei`, `semgrep`) are optional — built-in heuristics work without them.

---

## Installation

Base requirements: **Python 3.12+** and **git**.

```bash
# Install from PyPI (recommended)
pip install sfild-repo-auditor

# Verify
repo-audit --version
```

Or install from source:

```bash
git clone https://github.com/SfilD/repo-auditor.git
cd repo-auditor
pip install -e .
```

Also works without `pip install` — invoke via `python3 -m repo_auditor`.

### Docker (alternative to local install)

```bash
# Build image
 docker build -t repo-auditor .

# Run an audit
 docker run --rm -v $(pwd)/data:/data repo-auditor \
   https://github.com/pallets/flask --output-root /data
```

The image is based on `python:3.13-slim` (Debian 12 trixie-slim) and includes
all external tools (`gitleaks`, `osv-scanner`, `tokei`, `semgrep`, `gh`).
It is single-stage by design — prioritising simplicity and fast rebuilds
for audit environments over minimal runtime size.

### GitHub Actions

```yaml
- uses: SfilD/repo-auditor@v1.0.0
  with:
    target: 'owner/repo'
    fail-below: '60'
    format: 'json'
    exclude: 'vendor/**,*.min.js'
    clone-depth: '1'
```

Action outputs: `total`, `verdict`, `output-path`.

### External tools (optional but recommended)

The more you install, the more accurate the assessment.

| Tool | What it gives | Linux (apt/snap) | macOS (brew) | Cargo |
|------|---------------|------------------|--------------|-------|
| `gh` | GitHub API without rate limits | `sudo apt install gh` | `brew install gh` | — |
| `gitleaks` | Leaked secret detection | `sudo apt install gitleaks` | `brew install gitleaks` | — |
| `osv-scanner` | CVE dependency scanning | — (binary from GitHub Releases) | `brew install osv-scanner` | — |
| `tokei` | Accurate LOC count | `sudo apt install tokei` | `brew install tokei` | `cargo install tokei` |
| `semgrep` | Static analysis | `pip install semgrep` | `brew install semgrep` | — |

Authentication for private repos and rate-limit removal:

```bash
gh auth login                       # interactive via browser
# or
export GITHUB_TOKEN=ghp_xxxxxxxx    # personal access token
```

### Adapter prerequisites

v0.8.0 adds optional adapters that normalize external-tool signals into the report. They do **not** affect the four-category score in this release.

| Adapter | What it needs | Install |
|---------|---------------|---------|
| **OpenSSF Scorecard** | `scorecard` binary on PATH + `GITHUB_AUTH_TOKEN` env var | `go install github.com/ossf/scorecard/v5/cmd/scorecard@latest` |
| **GitHub Community Profile** | Existing GitHub client (no extra install) | — |

Token for Scorecard: any GitHub personal access token with `public_repo` scope (`repo` scope for private repositories). Export it as `GITHUB_AUTH_TOKEN`.

Both adapters are optional. If prerequisites are missing, the adapter returns `UNAVAILABLE` and the audit continues. Use `--no-external-tools` to disable all adapters.

---

## MCP server

repo-auditor can run as an [MCP (Model Context Protocol)](https://modelcontextprotocol.io) server, exposing the audit pipeline to AI agents such as Claude Desktop, Claude Code, Cline, and Continue.

This is **opt-in** — the default CLI install has zero extra dependencies.

```bash
pip install 'sfild-repo-auditor[mcp]'
```

Launch the server:

```bash
repo-audit-mcp
```

Then wire it into your MCP client. See [`docs/mcp.md`](docs/mcp.md) for full setup instructions, tool reference, and trust-model notes.

---

## Usage

```bash
repo-audit <target> [flags]
```

**Target:**
- `owner/repo` — short form
- `https://github.com/owner/repo` — repository
- `https://github.com/owner` — organization or user

**Flags:**

| Flag | Purpose |
|------|---------|
| `--output-root PATH` | Storage root (default `./data`) |
| `--quick` | `clone_depth=50`, skip `osv-scanner` and `semgrep` (2–3× faster) |
| `--no-external-tools` | Built-in heuristics only (if nothing is installed) |
| `--format {card,json,sarif}` | Output format (`card` by default) |
| `--json-only` | Alias for `--format json` (legacy, still works) |
| `--language {ru,en}` | Report language (`ru` by default) |
| `--plugin PATH` | External plugin executable (repeatable). Trusted: honoured even with `--ignore-repo-config` |
| `--allow-repo-plugins` | Allow repo-local executable plugins from `.repo-auditor.toml`. Disabled by default for security |
| `--exclude PATTERN` | Skip files/dirs matching glob (repeatable) |
| `--clone-depth N` | Git clone depth (overrides `--quick`) |
| `--max-workers N` | Max parallel collector workers (`auto` by default) |
| `--fail-below N` | CI gate: exit 4 if total < N (0..100) |
| `--no-raw-tool-output` | Don't save raw outputs to `raw/` |
| `--config-file PATH` | Path to `.repo-auditor.toml` (overrides auto-discovery) |
| `--ignore-repo-config` | Ignore `.repo-auditor.toml` in the audited repository |
| `--org-mode {primary,multiple,off}` | Behavior for organization URLs |
| `--org-limit N` | How many repos in `multiple` mode (default 5) |
| `--keep-last N` | Keep N latest runs after audit |
| `--gc --keep-last N` | Standalone GC: apply retention to all slugs in `index.json` |
| `--doctor` | Environment diagnostics (tool versions) |
| `--version` | Version |

**Exit codes:**

- `0` — success
- `1` — clone/collection error
- `2` — invalid target / `--gc` without `--keep-last`
- `3` — organization URL with `--org-mode off`
- `4` — total below `--fail-below` (CI gate)
- `5` — no suitable repos in organization (all forks/archived)
- `6` — regression detected (`--since`)
- `8` — policy threshold breach (`--fail-on`); if both `4` and `8` trigger, `4` wins
- `16` — regression detected in compare mode (`--regression-fail`)

---

## Repository configuration

repo-auditor looks for `.repo-auditor.toml` in the root of the cloned repository.
It can contain scoring overrides and security suppression rules.

### Security suppressions

You can suppress known-false-positive security findings with `[[security.suppress]]` tables:

```toml
[[security.suppress]]
tool = "gitleaks"
path_pattern = "tests/fixtures/*"
check_id_prefix = "generic-api-key"
```

Each rule matches when **all** specified fields match a finding. Fields act as wildcards when omitted:

- `tool` — scanner name (`gitleaks`, `osv-scanner`, `semgrep`)
- `path_pattern` — glob pattern against the file path
- `check_id_prefix` — prefix of the check/rule identifier

### Trust boundary

`.repo-auditor.toml` inside the audited repository is **repo-local config**. When you audit a third-party repository you do not control, that config is untrusted — the repository could hide findings via local suppressions or override scoring.

For high-trust audits of third-party repositories, use `--ignore-repo-config` to skip loading `.repo-auditor.toml` from the cloned repository. This disables **all** repo-local config, including:

- scoring overrides;
- `[[security.suppress]]` rules;
- repo-local executable `[[plugins]]`;
- `[[scoring.plugins]]` rules.

Repo-local executable plugins execute code and remain **disabled by default** unless `--allow-repo-plugins` is provided. `--ignore-repo-config` still wins even when `--allow-repo-plugins` is passed.

Trusted suppressions passed programmatically via `AuditConfig.security_suppressions` are always honored and remain separate from repo-local suppressions. Trusted CLI plugins from `--plugin PATH` are also always honored.

Plugin scoring documentation is in [`docs/plugins.md`](docs/plugins.md).

---

## Where to find results

After a run, the report lives here:

```bash
data/github.com/<owner>__<repo>/latest/report.ru.md
```

Handy commands:

```bash
# Latest report
cat data/github.com/pallets__flask/latest/report.ru.md

# Just the JSON score section (for scripts)
jq '.score' data/github.com/pallets__flask/latest/report.json

# List all audited repos
jq '.entries[] | "\(.slug)\t\(.total)/100\t\(.verdict)"' data/index.json

# Open in editor
$EDITOR data/github.com/pallets__flask/latest/report.ru.md
```

---

## Policy scoring (v0.9.0+)

v0.9.0 introduces optional policy profiles that map adapter evidence (OpenSSF Scorecard, GitHub Community Profile) into bounded per-category contributions. The default profile preserves backward compatibility; use `--profile` to opt in.

See [`docs/scoring.md`](docs/scoring.md) for profiles, trust matrix, CI gating with `--fail-on`, and migration notes.

---

## Cross-repo compare (v1.0+)

Compare two repositories side-by-side before adoption or track your own repository over time.

```bash
# Pairwise library evaluation
repo-audit pallets/click --compare-with pallets/flask

# Regression mode against latest stored run
repo-audit owner/repo --diff-previous --regression-fail
```

Compare produces a terminal card, Markdown report, JSON document, and SARIF export. Repo-local config is always ignored for trust; use `--plugin <path>` for explicit scoring rules.

See [`docs/compare.md`](docs/compare.md) for the full reference: trust boundary, storage layout, SARIF structure, schema evolution, and exit code 16.

---

## Scoring explained

### Categories (each 0..10)

| Category | What counts |
|----------|-------------|
| **Purpose** | README, Installation/Usage sections, ARCHITECTURE, CONTRIBUTING |
| **Security** | High/medium findings from `gitleaks` + `osv-scanner`, SECURITY.md, committed `.env` |
| **Code Quality** | CI, tests, linter, type checking, TODO density |
| **Maturity** | License, releases, contributors, stars, activity ≤ 90 days |

Total = `sum(categories) × 100 / 40`. Red flag = any category ≤ 3.

### Verdicts

| Condition | Verdict | Meaning |
|-----------|---------|---------|
| Any category ≤ 3 **and** total < 40 | **not-recommended** | Critical gap and overall weak |
| Any category ≤ 3 **and** total ≥ 40 | **caution** | Mostly OK, but one block is weak |
| No red flags, total ≥ 80 | **reliable** | Strong across all categories |
| No red flags, total ≥ 60 | **working** | Viable, no obvious gaps |
| No red flags, total ≥ 40 | **caution** | Medium level, needs attention |
| No red flags, total < 40 | **not-recommended** | Weak overall |

---

## Storage layout

```
data/
  github.com/
    <owner>__<repo>/
      repo/                       ← git clone, reused via git fetch
      runs/
        YYYY-MM-DD_HHMMSS/        ← UTC timestamp
          report.ru.md
          report.json
          card.txt
          raw/
            gitleaks.json
            osv-scanner.json
            tokei.json
      latest -> runs/<ts>
    <org>__org/
      runs/...
  index.json                      ← aggregate entry per slug
```

---

## Retention

`runs/` grow linearly — each run writes a new UTC directory. Cleanup:

- `--keep-last N` — after a successful audit, keep N latest runs, delete older ones.
- `--gc --keep-last N` — standalone pass over `index.json`: apply retention to all slugs at once (without a new audit).

Invariant: the target of the `latest` symlink is never deleted.

```bash
# Clean entire storage, keeping 3 latest runs per repo
repo-audit --gc --keep-last 3
```

---

## Examples

```bash
# Basic audit
python3 -m repo_auditor https://github.com/pallets/flask

# Fast run without heavy external tools
python3 -m repo_auditor pallets/flask --quick

# Built-in heuristics only (no gitleaks/osv/tokei)
python3 -m repo_auditor pallets/flask --no-external-tools

# Primary repo of an organization
python3 -m repo_auditor https://github.com/amnezia-vpn

# Top-3 by stars in an organization
python3 -m repo_auditor https://github.com/amnezia-vpn \
    --org-mode multiple --org-limit 3

# JSON to stdout for pipelines
python3 -m repo_auditor pallets/flask --format json | jq '.score.total'

# SARIF for GitHub Advanced Security
python3 -m repo_auditor pallets/flask --format sarif --output-root ./sarif

# Exclude vendor and minified assets
python3 -m repo_auditor pallets/flask --exclude 'vendor/**' --exclude '*.min.js'

# CI gate: fail if total < 60
python3 -m repo_auditor pallets/flask --fail-below 60 || echo "Audit failed"

# With retention — keep only 5 latest runs
python3 -m repo_auditor pallets/flask --keep-last 5
```

---

## Auditing large repositories

Repositories like `torvalds/linux`, `chromium/chromium`, or `semgrep/semgrep` can take a long time to clone and consume significant disk space.

| Approach | Command | When to use |
|----------|---------|-------------|
| **Fast audit** (recommended) | `--quick` | Shallow clone (`depth=50`) and skip `osv-scanner` + `semgrep` (2–3× faster) |
| **Shallow clone** | `--clone-depth 1` | CI pipelines where you only need the latest snapshot |
| **No clone** | `--no-external-tools` | Skip clone entirely; relies on GitHub API only (poorer card, but instant) |

```bash
# Example: audit a very large repo without downloading full history
python3 -m repo_auditor torvalds/linux --quick

# CI pipeline: minimal clone, fail gate
python3 -m repo_auditor owner/repo --clone-depth 1 --fail-below 60
```

> **Tip:** `semgrep/semgrep` clones at ~170 MB. `--quick` brings it down to ~50 MB and cuts scan time by half because `semgrep` and `osv-scanner` are skipped.

---

## Troubleshooting

**`gh: command not found` or `403 rate-limit`**
GitHub API without auth = 60 requests/hour. Install `gh` (`gh auth login`) or export `GITHUB_TOKEN`.

**`fatal: could not read Username for 'https://github.com'` (private repo)**
Auth required. `gh auth login` or `git config --global credential.helper store` + `GITHUB_TOKEN`.

**Score seems low for model-card / dataset repos**
The rubric is geared toward software projects. Repos without `tests/`, `CI`, `pyproject.toml` will score low on quality. This is by design.

**`tool unavailable` for `osv-scanner`/`tokei`/`gitleaks` in JSON**
Scanner not installed — pipeline falls back to built-in heuristics. For full accuracy install the tools (see table above) or run with `--no-external-tools` to suppress warnings.

**Clone hangs / consumes a lot of disk**
Use `--quick` (clone_depth=50). Large repos (100+ MB) can be audited without cloning via `--no-external-tools` + GitHub API only — but the card will be poorer (no file-walk findings).

**Card breaks on CJK/emoji in `owner/name`**
Known limitation: alignment is character-based, not cell-width. Use `--format json` for machine-readable output without the visual card.

**Want just the header without the full report**
`--format json | jq '.score'` or `cat data/.../latest/card.txt`.

---

## Limitations

- Does not replace manual security review or CodeQL.
- Heuristic mode (without `gitleaks`) may produce rare false positives — e.g. on 12-character PowerShell cmdlets (`'Get-Location'` after `pwd:`).
- Without `tokei`, `todo_density` is approximated via source LOC from `code_analysis` — lower accuracy.
- Monorepo with nested manifests: each manifest is counted separately, no aggregation across "total" dependencies.
- HTML fallback for organization listing is not implemented — requires working `gh` or GitHub REST API.

---

## See also

- [`README.ru.md`](README.ru.md) — Russian version of this README.
- [`GETTING_STARTED.ru.md`](GETTING_STARTED.ru.md) — step-by-step guide for non-programmers (Russian).
- [`AUDIT_GUIDE.md`](AUDIT_GUIDE.md) — for external auditors / AI tools before review (load-bearing constraints, hot zones, regression baseline).
- [`ARCHITECTURE.md`](ARCHITECTURE.md) — 30-second overview of layers and data flow.
- [`ROADMAP.md`](ROADMAP.md) — public direction of development.
- [`SECURITY.md`](SECURITY.md) — disclosure policy.
- [`CHANGELOG.md`](CHANGELOG.md) — release history.
- [`CONTRIBUTING.md`](CONTRIBUTING.md) — dev cycle, conventional commits, stdlib-only rule.
- [`docs/compare.md`](docs/compare.md) — cross-repo compare reference.
- [`CLAUDE.md`](CLAUDE.md) — architecture, data flow, scoring rules, false-positive filters (for contributors and Claude Code).
- [`LICENSE`](LICENSE) — MIT.
