Metadata-Version: 2.4
Name: agentdiscover
Version: 2.7.4
Summary: Discover every AI agent in your infrastructure. 5-layer detection: static analysis, network monitoring, eBPF/Kubernetes runtime, endpoint, and cloud audit (CloudTrail). Company-level correlation. AIBOM export. MCP server detection.
Project-URL: Homepage, https://defendai.ai
Project-URL: Documentation, https://github.com/Defend-AI-Tech-Inc/agent-discover-scanner/blob/main/README.md
Project-URL: Repository, https://github.com/Defend-AI-Tech-Inc/agent-discover-scanner
Project-URL: Issues, https://github.com/Defend-AI-Tech-Inc/agent-discover-scanner/issues
Project-URL: Changelog, https://github.com/Defend-AI-Tech-Inc/agent-discover-scanner/blob/main/CHANGELOG.md
Author-email: Mohamed Waseem <mwaseem@defendai.tech>
Maintainer-email: DefendAI <support@defendai.ai>
License: MIT
License-File: LICENSE
Keywords: agent-discovery,agent-scanner,agents,ai,ai-agent-security,ai-governance,ai-inventory,ai-security,aibom,cyclonedx,ebpf,ghost-detection,kubernetes,llm,llm-security,mcp,mcp-security,security,shadow-ai,shadow-ai-detection,static-analysis
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: System :: Monitoring
Requires-Python: >=3.10
Requires-Dist: certifi>=2024.2.2
Requires-Dist: esprima>=4.0.1
Requires-Dist: httpx>=0.27.0
Requires-Dist: kubernetes>=28.1.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: sarif-om>=1.0.4
Requires-Dist: typer>=0.9.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# AgentDiscover Scanner

**Open-Source AI Agent Discovery for the Enterprise**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyPI](https://img.shields.io/pypi/v/agentdiscover.svg)](https://pypi.org/project/agentdiscover/)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)

*Part of the [DefendAI](https://defendai.ai) platform for autonomous AI governance*

> **Formerly known as `agent-discover-scanner`** — the PyPI package has been renamed to `agentdiscover`.
> `pip install agent-discover-scanner` continues to work and will install `agentdiscover` automatically.
> The legacy entry points `agent-discover-scanner` and `agent-discover` remain as aliases.

---

## The finding that matters

```
$ agentdiscover scan-all ./your-repo --duration 10

🔍 Scanning for autonomous AI agents...

📂 Analyzing source code at ./your-repo
🌐 Monitoring live network connections...
   Observing runtime behavior (10s)...
🔗 Correlating findings...
✓ Correlation complete

🤖 Autonomous Agent Inventory

┏━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Classification ┃ Count ┃ Description                                                    ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ CONFIRMED      │ 2     │ Active — detected in code and observed at runtime              │
│ UNKNOWN        │ 3     │ Code found — not yet observed at runtime                       │
│ SHADOW AI      │ 3     │ Known app using AI — review for governance                     │
│ ZOMBIE         │ 0     │ Inactive — code exists but no recent runtime activity          │
│ GHOST          │ 1     │ ⚠ Critical — runtime activity with no source code (ungoverned) │
└────────────────┴───────┴────────────────────────────────────────────────────────────────┘
```

A GHOST agent is an AI system making real API calls — consuming tokens, potentially accessing sensitive data — with no corresponding source code, deployment record, or owner. No static analysis tool finds this. No SIEM alerts on it. AgentDiscover Scanner finds it in under 60 seconds by watching the runtime and cross-referencing it against your codebase simultaneously.

**Your engineering team thinks they know what AI is running. The GHOST classification is what they don't know.**

---

## What makes this different

Most security tools tell you what's in your code. AgentDiscover Scanner tells you what's **actually running** — and crucially, what's running that has **no business being there**.

```
👻 GHOST AGENT DETECTED
   Workload:    trading-bot (Deployment/default)
   Connected:   api.openai.com — LIVE
   SaaS:        openai — confirmed active connection
   Source code: None found in scanned repositories
   Owner:       Unknown — no deployment record, no code review

👻 GHOST AGENT DETECTED
   Workload:    shadow-agent (Pod/kube-system)
   Connected:   api.anthropic.com — LIVE
   SaaS:        anthropic — confirmed  |  gcp — active socket
   Blast radius: HIGH (cloud provider access confirmed)
   Source code: None found in scanned repositories
   Owner:       Unknown — no deployment record, no code review
```

Every detected agent also carries a **SaaS blast radius** — a live-observed map of which services it's actively connected to, derived from network traffic, not just configuration files:

```
crewai-agent (CONFIRMED)
  saas_connections:
    anthropic: confirmed  ← active_connection observed
    github:    medium     ← open socket
  risk_flags: [cloud_credentials_present]
  blast_radius: 70/100
```

`confirmed` means the connection was **live-observed** during the scan — not inferred from a config file.

---

## Agent classifications

| Classification    | What it means                              | Risk         |
| ----------------- | ------------------------------------------ | ------------ |
| 👻 **GHOST**      | Runtime AI activity — no source code found | **Critical** |
| ✅ **CONFIRMED**   | Detected in code AND observed running      | High         |
| ⚠️ **UNKNOWN**    | Found in code, not yet observed at runtime | Medium       |
| 🖥️ **SHADOW AI** | Known app using AI without governance      | Medium       |
| ☠️ **ZOMBIE**     | Was active, no longer observed             | Low          |

---

## What counts as an "agent"

DefendAI classifies **AI-capable components**, not just top-level orchestrators. Any component that invokes a model, holds a memory buffer, binds a tool, or queries a vector store is an independently governable unit — it can exfiltrate data, consume budget, or behave unexpectedly on its own.

This matters because the gap between "we have one AI agent" (what the team believes) and the actual component count is routinely 5–15×.

**Example — a single LangGraph application with 3 workers:**

| # | Component | Why it's tracked |
|---|---|---|
| 1 | `StateGraph` | Graph entrypoint; controls execution flow |
| 2–4 | Worker agent nodes ×3 | Each is an independent LangChain agent |
| 5–7 | LLM bindings ×3 (one per worker) | Direct model invocations; each has its own token budget |
| 8 | Supervisor node | Routes tasks between workers; has its own LLM call |
| 9 | LLM binding for supervisor | Additional model invocation with separate prompt |
| 10 | Tool node | Executes tool calls on behalf of workers |
| 11 | Vector store retriever | RAG component; queries an external embedding store |
| 12 | Memory checkpointer | Persists conversation state across turns |
| 13 | Prompt templates | Carry system-level instructions that can be injected or drifted |
| 14 | Output parser | Transforms model output; can silently drop or alter content |
| 15 | Human-in-the-loop interrupt | Pause point that can be bypassed in non-interactive runs |

One application. One developer who says "it's just an AI assistant." Fifteen components that each independently touch a model, a store, or a tool — any of which could be ungoverned, GHOST-classified, or carrying a stale permission scope.

**Why component-level visibility matters:**

- A worker's LLM binding can be swapped (model drift) without changing the agent node that wraps it.
- A retriever can be pointed at a new vector store index without redeploying the application.
- A prompt template lives in a config file, not code — static analysis misses it; only runtime observation catches the change.
- GHOST detection fires at the component level: if worker 2's LLM binding starts calling a different endpoint, the graph-level agent still looks CONFIRMED while that specific binding is GHOST.

agentdiscover reports each component as a separate inventory item so your governance controls can target the right granularity.

---

## Quick start

```bash
# macOS (recommended)
brew install python@3.12 osquery pipx
pipx install agentdiscover
pipx ensurepath && source ~/.zshrc   # add ~/.local/bin to PATH

# Linux (Debian/Ubuntu)
sudo apt-get install -y python3 osquery
pip3 install agentdiscover

# Linux (RHEL/Fedora)
sudo dnf install -y python3 osquery
pip3 install agentdiscover

# Windows (PowerShell — elevated)
winget install Python.Python.3.12
winget install osquery.osquery
pip install agentdiscover
```

> **macOS:** never use `sudo` with the installer — Homebrew refuses root and osquery silently fails.
> Use `pipx` to avoid Python environment conflicts. If `agentdiscover` is not found after install, run `pipx ensurepath` and restart your terminal.

Then run your first scan:

```bash
agentdiscover scan-all ~/projects --duration 30
```

To verify all layers are working before your first real scan:

```bash
agentdiscover --version
osquery --version
which agentdiscover   # macOS: should show ~/.local/bin/agentdiscover

# Or use --dry-run to get a complete layer readiness report:
agentdiscover scan-all ~/projects --dry-run
```

To upload results to the DefendAI platform:

```bash
agentdiscover scan-all ~/projects \
  --platform \
  --api-key YOUR_API_KEY
```

---

## What you'll see on your first scan

Running `scan-all` on a real developer machine (macOS, ~30s observation window):

```
$ agentdiscover scan-all ~/projects --duration 30

🔍 Scanning for autonomous AI agents...

📂 Analyzing source code at /Users/alice/projects
🌐 Monitoring live network connections...
   Observing runtime behavior (30s)...
💻 Scanning endpoints...

[DETECT] Anthropic connection from Cursor Helper (PID: 61436) → api.anthropic.com:443
[DETECT] OpenAI connection from Microsoft Edge Helper (PID: 4172) → api.openai.com:443

🔗 Correlating findings...
✓ Correlation complete

⚠ Unverified MCP server: filesystem (Community/Unknown) — not from a verified publisher

🤖 Autonomous Agent Inventory

┏━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Classification ┃ Count ┃ Description                                                    ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ CONFIRMED      │ 1     │ Active — detected in code and observed at runtime              │
│ UNKNOWN        │ 2     │ Code found — not yet observed at runtime                       │
│ SHADOW AI      │ 4     │ Known app using AI — review for governance                     │
│ ZOMBIE         │ 0     │ Inactive — code exists but no recent runtime activity          │
│ GHOST          │ 0     │ ⚠ Critical — runtime activity with no source code (ungoverned) │
└────────────────┴───────┴────────────────────────────────────────────────────────────────┘

Risk Breakdown:
  ● Critical: 0
  ● High: 1
  ● Medium: 2
  ● Low: 4

✅ Scan complete — results saved to defendai-results
```

All output files land in `./defendai-results/`:

| File | Contents |
|---|---|
| `layer1_code.sarif` | Code findings in SARIF format (GitHub Security tab ready) |
| `layer2_network.json` | Live network connections observed during scan |
| `layer3_k8s.jsonl` | Kubernetes workload events (if cluster available) |
| `layer4_endpoint.json` | Installed packages, desktop apps, browser AI usage |
| `agent_inventory.json` | Final correlated agent inventory |

For an executive-ready audit bundle (AIBOM + markdown reports):

```bash
agentdiscover audit ~/projects --output ./audit-report
# Writes: audit-report/aibom.json, ghost-agents.md, mcp-report.md, summary.md
```

---

## Common issues

**`agentdiscover: command not found` after pipx install**

```bash
pipx ensurepath
source ~/.zshrc   # or ~/.bashrc on Linux
```

If still missing: `which agentdiscover` should show `~/.local/bin/agentdiscover`. If `~/.local/bin` is not in `$PATH`, add it manually.

**Layer 2 network monitoring fails on Linux**

Layer 2 requires elevated privileges on Linux. Either run with `sudo` (avoid on macOS) or skip the layer:

```bash
sudo agentdiscover scan-all ~/projects --duration 30
# or skip Layer 2:
agentdiscover scan-all ~/projects --skip-layers 2
```

**osquery not installed — Layer 4 skipped**

Layer 4 is optional. If osquery is not installed, the scan continues with Layers 1–3. To install:

```bash
# macOS
brew install osquery
# Linux
sudo apt-get install osquery   # or see https://osquery.io/downloads
```

**Large repo warning — scan is slow**

If you see `⚠ Large scan path detected: N Python files`, point the scanner at a specific project directory rather than your entire home folder:

```bash
agentdiscover scan-all ~/projects/my-agent-project --duration 30
```

**Layer 3 Kubernetes not available**

If no cluster is reachable, Layer 3 logs a warning and continues. GHOST detection still works via Layer 2 network correlation. To skip Layer 3 explicitly:

```bash
agentdiscover scan-all ~/projects --skip-layers 3
```

**Check what layers are ready before scanning**

```bash
agentdiscover scan-all ~/projects --dry-run
```

---

## How it works

![GHOST detection mechanism](./docs/ghost-detection.svg)

AgentDiscover Scanner runs **five detection layers** simultaneously and correlates them into a single agent inventory. Each layer sees something the others can't.

| Layer | Name | Technology | Requirements |
|---|---|---|---|
| 1 | Source code | Python AST + esprima (JS/TS) | None |
| 2 | Live network | psutil connection observation | Linux: root/sudo |
| 3 | Kubernetes runtime | Tetragon/eBPF events; K8s API fallback | Linux (eBPF); kubectl (K8s API) |
| 4 | Endpoint discovery | osquery — packages, apps, browser history | osquery (optional) |
| 5 | Cloud Audit | AWS CloudTrail, Azure Monitor (stub), GCP Audit Logs (stub) | AWS credentials (boto3) |

![AgentDiscover detection pipeline](./docs/architecture.svg)

### Layer 1 — Source code analysis

Static analysis of Python and JavaScript/TypeScript. Detects LangChain, LangGraph, CrewAI, AutoGen, direct OpenAI/Anthropic/Gemini API usage, and any HTTP client targeting LLM endpoints. Handles import aliasing and indirect usage patterns. Generates SARIF output for CI/CD integration.

### Layer 2 — Live network monitoring

Passive observation of outbound connections to AI providers — OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Azure OpenAI, AWS Bedrock, and vector stores. No packet capture. Identifies which process is making each connection, enabling per-agent SaaS attribution.

Real scan output:
```
[DETECT] Google AI connection from Mail (PID: 776) → generativelanguage.googleapis.com:993
[DETECT] OpenAI connection from Microsoft Edge Helper (PID: 4172) → api.openai.com:443
[DETECT] Anthropic connection from Cursor Helper (PID: 61436) → api.anthropic.com:443
[DETECT] OpenAI connection from OneDrive (PID: 96089) → api.openai.com:443
```

### Layer 5 — Cloud Audit (v2.7.0+)

**Why Layer 2 misses AWS Bedrock.** AWS Bedrock Runtime endpoints rotate across hundreds of generic EC2 IPs with no published CIDR ranges and no stable reverse-DNS pattern. Passive socket monitoring (psutil) can observe the TCP connection but cannot reliably identify it as Bedrock without a complete, continuously-updated IP allowlist — which does not exist publicly. On VPC endpoints, traffic stays inside the AWS network and never appears on the host's socket table at all.

**Layer 5 — Cloud Audit is the enterprise-grade alternative.** It queries cloud provider audit logs directly, giving you every AI API call with the caller identity, source IP, model ID, and the HTTP User-Agent the SDK set at call time — honest framework attribution (`langchain-aws`, `boto3`, `amazon-bedrock-agent`) that static code analysis can miss.

**Provider support matrix:**

| Provider | Service | Status | CLI flag |
|---|---|---|---|
| AWS | Bedrock (CloudTrail) | **GA** | `--cloud-audit` |
| Azure | Azure OpenAI (Monitor) | Preview stub | `--azure-monitor` |
| GCP | Vertex AI (Cloud Audit Logs) | Preview stub | `--gcp-audit` |

**Required IAM permission (AWS):**

```json
{
  "Effect": "Allow",
  "Action": ["cloudtrail:LookupEvents"],
  "Resource": "*"
}
```

For CloudTrail Lake (near-real-time, ~60s delay instead of 5-15 min):

```json
{
  "Effect": "Allow",
  "Action": [
    "cloudtrail:StartQuery",
    "cloudtrail:GetQueryResults"
  ],
  "Resource": "*"
}
```

**CLI usage:**

```bash
# Enable Cloud Audit detection (1-hour lookback, us-east-1)
agentdiscover scan-all ~/projects --cloud-audit

# Specify region and longer lookback window
agentdiscover scan-all ~/projects \
  --cloud-audit \
  --cloud-audit-region eu-west-1 \
  --cloud-audit-hours 4

# CloudTrail Lake — near-real-time (~60s delay)
agentdiscover scan-all ~/projects \
  --cloud-audit \
  --cloud-audit-lake-arn arn:aws:cloudtrail:us-east-1:123456789012:eventdatastore/YOUR-ARN \
  --cloud-audit-region us-east-1

# Works with audit mode too
agentdiscover audit ~/projects \
  --cloud-audit \
  --cloud-audit-region us-east-1
```

When Layer 5 findings are merged with Layer 2 network findings, the correlator can promote an agent from UNKNOWN to CONFIRMED even on VPC endpoints where psutil sees nothing. Layer 5 findings are written to `layer5_cloud_audit.json` in the output directory.

> **Credential configuration.** The scanner uses standard boto3 credential resolution: `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` environment variables, `~/.aws/credentials`, or an EC2/ECS/EKS instance role. If no credentials are found, a clear warning is printed and the scan continues without Cloud Audit.

### Layer 3 — Kubernetes runtime

Kernel-level visibility into pod behavior via Tetragon. Identifies which workloads are actively making AI calls — including workloads with no corresponding source code. Works with any CNI. Falls back to Kubernetes API discovery if Tetragon is unavailable.

When Layer 1 (code) and Layer 3 (K8s runtime) both detect the same agent, it becomes **CONFIRMED**:

```
Detection Coverage:
┏━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Layers        ┃ Agents ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ layer1,layer3 │ 2      │  ← CONFIRMED: seen in code AND running in K8s
│ layer1        │ 3      │  ← UNKNOWN: code found, not yet observed at runtime
└───────────────┴────────┘
```

### Layer 3 limitations

Layer 3 (eBPF via Tetragon) is **Linux-only**. On macOS and Windows developer machines, Layer 3 is skipped automatically — the scan continues with Layers 1, 2, and 4. The K8s API monitor path works on all platforms and requires only `kubectl` with cluster read access.

### Layer 4 — Endpoint discovery

Scans developer machines, CI/CD runners, and workstations via osquery. Finds installed AI packages, desktop AI applications (ChatGPT Desktop, Claude Desktop, Cursor, GitHub Copilot), active connections, browser-based AI usage, and VSCode extensions.

### Cross-layer correlation

After all layers run, the correlator builds a unified agent identity. An agent seen in code (Layer 1), confirmed running in K8s (Layer 3), and observed making network calls (Layer 2) is a single correlated identity — not three separate findings.

Agents present at runtime with **no Layer 1 match** become GHOST agents.

### SaaS blast radius detection (v2.3.0+)

After correlation, each agent receives a `saas_connections` profile built from all four layers:

```json
{
  "detected":  ["anthropic", "gcp", "github"],
  "confirmed": ["anthropic"],
  "evidence": {
    "anthropic": ["active_connection", "open_socket"],
    "gcp":       ["open_socket"],
    "github":    ["vscode_extension_detected"]
  },
  "confidence": {
    "anthropic": "confirmed",
    "gcp":       "medium",
    "github":    "medium"
  },
  "has_cloud_provider": true,
  "has_llm_provider":   true
}
```

---

## High-risk agent detection (v2.4.0+)

The scanner detects autonomous agent platforms that carry systemic security risk by design — not misconfigurations, but architecture.

**OpenClaw** (formerly Clawdbot/Moltbot) is the primary target. It has full filesystem access, terminal execution, email and messaging integration, and runs as a persistent background daemon. CVE-2026-25253 CVSS 8.8. Gartner: "insecure by default." Microsoft: "treat as untrusted code execution."

Detection uses corroborated signals — never a single port number:

```
🚨 HIGH-RISK AGENT CONFIRMED: OpenClaw
   Autonomous agent with system-level access — filesystem,
   terminal, email, and messaging integration.
   Capabilities: filesystem, terminal, email, browser, messaging
```

---

## MCP server detection (v2.4.0+)

MCP (Model Context Protocol) is the integration layer between AI agents and enterprise SaaS. Supported by Claude, ChatGPT, Gemini, Copilot, Cursor, and VS Code.

The scanner detects MCP servers across all AI clients and classifies each by publisher verification:

```
⚠ Local MCP script detected — unknown code with tool access
⚠ Unverified MCP server: filesystem (Community/Unknown) — not from a verified publisher
⚠ Unverified MCP server: mcpfw (Unknown) — not from a verified publisher

✓ Verified: @salesforce/mcp-server (Salesforce official)
```

Supported clients: Claude Desktop, Cursor, Windsurf, VS Code, Gemini CLI, OpenAI Codex, Continue.dev, Zed, and project-level MCP configs.

**Non-developer detection:** Financial analysts connecting ChatGPT Teams to Salesforce via UI leave no local config file. The scanner detects this via Layer 2 network traffic — the only tool that catches this pattern.

Risk prioritization in reporting (guidance):
- Unverified MCP server → HIGH
- MCP filesystem access → HIGH
- MCP + production environment → CRITICAL
- OpenClaw + GHOST → CRITICAL

---

## Daemon mode

Run continuously as a background service, updating the agent inventory every 30 seconds:

```bash
agentdiscover scan-all ~/projects \
  --daemon \
  --output ~/defendai-results \
  --platform \
  --platform-interval 5    # upload to platform every ~2.5 minutes
```

> **Note:** `--daemon` runs until you press Ctrl+C. Use `--output ~/defendai-results` (or any user-writable path) — avoid `/var/log/` which requires root.
> If running as root, `~/projects` resolves to root's home directory, not yours. Always run without `sudo`.

With `--platform`, the daemon syncs to the DefendAI platform every N correlation cycles (default: every 5 cycles ≈ 2.5 minutes) and always uploads a final snapshot on shutdown.

**Linux — install as a systemd service:**

```bash
sudo bash deployment/systemd/install-service.sh ~/projects
systemctl status defendai-scanner
```

---

## Scanning an additional source repository

The `--src-repo` flag adds a second codebase to every Layer 1 scan. Findings are merged into `layer1_code.sarif` alongside the primary scan, so the correlator sees code from both locations in the same run — useful when the runtime you're monitoring is served by a separate repo (microservices, shared ML libraries, a vendor repo you don't own locally).

```bash
# One-shot: include a remote team's repo in the scan
agentdiscover scan-all ~/projects \
  --src-repo https://github.com/acme/ml-services \
  --duration 30

# Local path — no clone step
agentdiscover scan-all ~/projects \
  --src-repo ~/shared/ml-services
```

In one-shot mode the remote repo is shallow-cloned, scanned, and deleted before the correlator runs.

In daemon mode, pass `--src-repo-ttl` to control how frequently the additional repo is re-fetched:

```bash
agentdiscover scan-all ~/projects \
  --daemon \
  --src-repo https://github.com/acme/ml-services \
  --src-repo-ttl 7200    # re-clone at most once every 2 hours
```

Auth failures (HTTP 401/403, SSH key rejection) back off exponentially up to 5 minutes and retry automatically — the primary scan continues uninterrupted.

---

## Customizing known applications

By default, the scanner classifies common desktop applications (browsers, Office 365, Cursor, Slack, Claude Desktop, etc.) as **Shadow AI** rather than GHOST when they make AI API calls.

Browser-based AI usage (claude.ai, chatgpt.com, copilot.microsoft.com) is detected via Layer 4 browser history — these are classified as Shadow AI automatically. Note that Layer 4 reads the browser's committed history database, not the current active session, so a tab open right now may not appear until the browser flushes its history.

To add your own internal tools:

```bash
mkdir -p ~/.defendai
echo "my-internal-ai-tool" >> ~/.defendai/known_apps.txt
echo "company-llm-client" >> ~/.defendai/known_apps.txt
```

See `docs/known-apps-example.txt` for the full format.

When connected to the DefendAI platform (`--platform` flag), the tenant-managed list is downloaded automatically on startup and merged with your local overrides.

---

## DefendAI platform integration

The scanner is the **discovery layer**. The platform is where discovered agents become governed agents.

```bash
agentdiscover scan-all ~/projects \
  --platform \
  --api-key YOUR_KEY \
  --duration 30
```

When connected to the platform, each scan triggers the **correlation engine** which builds a living identity map across every machine, every environment, and every scan:

- **Agent identity resolution** — the same CrewAI agent on a laptop, in staging K8s, and in prod K8s is recognized as one agent at different lifecycle stages
- **Behavioral drift detection** — agent added `has_code_execution=true` since last week? That's a signal. Platform tracks it.
- **Cross-machine intelligence** — agent seen on 3 machines and crossed from dev into prod? Automatic risk escalation.
- **SaaS blast radius** — platform aggregates confirmed SaaS connections across all scans and computes blast radius score.

After a few scans, the DefendAI platform report shows:

```
Agent Inventory Report — acme-corp
─────────────────────────────────────────────────────────────────────
 shadow-agent    GHOST     CRITICAL   anthropic, github   blast: 85   machines: 3
                           ↑ GHOST seen in production — immediate action required

 crewai-agent    SHADOW    MEDIUM     openai              blast: 25   machines: 1
                           ↑ Unreviewed — no governance record

 langchain-agent KNOWN     LOW        openai              blast: 15   machines: 1
                           ↑ Approved — monitoring active
─────────────────────────────────────────────────────────────────────
 3 agents total · 1 critical · 1 unreviewed · 1 governed
```

---

## CI/CD integration

### GitHub Action (recommended)

The repo ships a reusable composite action. Add it to any workflow with one step — no `pip install` required:

```yaml
# .github/workflows/agent-scan.yml
name: AI Agent Scan

on: [push, pull_request]

permissions:
  security-events: write   # required to upload SARIF to GitHub Security tab

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: Defend-AI-Tech-Inc/agent-discover-scanner@v2.7.1
        with:
          path: '.'               # directory to scan (default: .)
          upload-sarif: 'true'    # post findings to GitHub Security tab (default: true)
```

Findings appear in **Security → Code scanning alerts** as soon as the workflow runs.

**Inputs**

| Input | Default | Description |
|---|---|---|
| `path` | `.` | Directory to scan |
| `output` | `agent-scan-results.sarif` | SARIF output file path |
| `upload-sarif` | `true` | Upload to GitHub Security tab |
| `python-version` | `3.12` | Python version to use |

**Output**

| Output | Description |
|---|---|
| `sarif-file` | Path to the generated SARIF file |

> **Note:** `permissions: security-events: write` is required at the job or workflow level for `upload-sarif: 'true'` to work. If your repo is private and you don't have GitHub Advanced Security, set `upload-sarif: 'false'` and consume the SARIF artifact directly.

### Manual install in CI

```yaml
- name: Scan for AI agents
  run: |
    pip install agentdiscover
    agentdiscover scan . --format sarif --output results.sarif

- name: Upload SARIF to GitHub Security tab
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: results.sarif
```

For a full-stack scan (all layers, structured output):

```yaml
- name: Full agent scan
  run: |
    agentdiscover scan-all . \
      --duration 30 \
      --output ./defendai-results \
      --skip-layers 3    # no K8s cluster in CI
```

---

## Commands

```bash
# Full scan (recommended) — all 4 layers + correlation
agentdiscover scan-all PATH [OPTIONS]
  --duration/-d SECONDS      Network and K8s monitor observation window [default: 60]
  --output/-o PATH           Output directory for scan results [default: defendai-results]
  --format/-f TEXT           Output format: text|json [default: text]
                               (SARIF output is written to disk by Layer 1 as layer1_code.sarif)
  --layer3-file PATH         Use existing Tetragon JSONL output (skip live Layer 3)
  --skip-layers TEXT         Comma-separated layers to skip, e.g. '3' or '2,3'
  --verbose/-v               Include Layer 3 raw event output
  --daemon                   Run continuously, re-scanning every 30 seconds
  --platform                 Upload results to DefendAI platform after scan
  --api-key TEXT             DefendAI platform API key
  --tenant-token TEXT        DefendAI platform tenant token
  --wawsdb-url TEXT          DefendAI platform base URL [default: https://wauzeway.defendai.ai]
  --platform-interval INT    Upload every N correlation cycles in daemon mode [default: 5]
  --max-log-size INT         Rotate output files at this size in MB [default: 50]
  --max-log-backups INT      Rotated backup files to keep [default: 5]
  --src-repo TEXT            Additional source repo to scan through Layer 1 (local path or URL)
  --src-repo-ttl INT         Daemon: minimum seconds between re-scans of --src-repo [default: 3600]
  --dry-run                  Check layer availability without running a scan

# Individual layers
agentdiscover scan PATH              # Layer 1: source code only
agentdiscover deps PATH              # Dependency scanning
agentdiscover monitor                # Layer 2: network monitor only
agentdiscover monitor-k8s            # Layer 3: Kubernetes runtime only
agentdiscover endpoint               # Layer 4: endpoint scan only
agentdiscover correlate              # Correlate existing scan outputs

# Audit mode (v2.5.0+) — full report: aibom.json, ghost-agents.md, mcp-report.md
agentdiscover audit PATH [OPTIONS]
  --duration/-d SECONDS      Observation window [default: 60]
  --output/-o PATH           Report output directory [default: defendai-audit]
  --layer3-file PATH         Use existing Tetragon JSONL (skip live Layer 3)
  --platform                 Upload to DefendAI platform
  --api-key TEXT             DefendAI platform API key

# Legacy aliases — all three still work
agent-discover-scanner [COMMAND] [OPTIONS]
agent-discover [COMMAND] [OPTIONS]
```

---

## Detected frameworks and providers

**AI frameworks:** LangChain, LangGraph, CrewAI, AutoGen, direct HTTP LLM clients

**LLM providers:** OpenAI, Anthropic, Google Gemini / Google AI, Mistral, Cohere, Azure OpenAI, AWS Bedrock, Groq, DeepSeek

**Vector stores:** Pinecone, Weaviate, Qdrant, Chroma

**SaaS blast radius detection (v2.3.0+):** Salesforce, Slack, GitHub, GitLab, Jira, HubSpot, Notion, Airtable, Stripe, Twilio, Snowflake, Databricks, AWS, GCP, Azure, PostgreSQL, Redis, MongoDB

---

## Try the demo

```bash
git clone https://github.com/Defend-AI-Tech-Inc/agent-discover-scanner
cd agent-discover-scanner/demo
./setup.sh    # deploys LangChain, CrewAI, and a shadow agent to local Kubernetes
agentdiscover scan-all ./sample-repo --duration 60
```

Expected output: 2 CONFIRMED agents (crewai-agent, langchain-agent), 1 GHOST agent (shadow-agent — runtime activity, no source code).

---

## Requirements

| Capability         | Requirement                                                                                        |
| ------------------ | -------------------------------------------------------------------------------------------------- |
| Code scanning      | Python 3.10+, all dependencies included                                                            |
| Network monitoring | Linux: root/sudo required · macOS: no sudo (use pipx) · Windows: elevated PowerShell              |
| Kubernetes runtime | kubectl + read access (K8s API path) · Helm 3+ + root/sudo for Tetragon/eBPF (Linux only)         |
| Endpoint discovery | osquery (optional — graceful degradation if not installed)                                         |
| Layer 3 (eBPF)     | Linux only — unavailable on macOS and Windows. K8s API path works on all platforms.               |
| Cloud Audit (Layer 5) | AWS credentials — boto3 credential chain (`AWS_ACCESS_KEY_ID`, `~/.aws/credentials`, or instance role). If credentials are absent, the scan continues without Cloud Audit. |
| Platform upload    | DefendAI API key ([defendai.ai](https://defendai.ai))                                              |

Full Kubernetes setup: `install.sh` handles Helm, runtime monitoring setup, and permissions automatically.

---

## DefendAI platform

AgentDiscover Scanner is the **discovery layer** of the DefendAI platform.

| Component                 | Status         | Description                                                           |
| ------------------------- | -------------- | --------------------------------------------------------------------- |
| **AgentDiscover Scanner** | ✅ Open Source (v2.7.1) | Discover and classify AI agents across your environment  |
| **defendai-agent**        | 🧪 Beta        | MITM proxy for real-time AI traffic inspection and policy enforcement |
| **Correlation Engine**    | ✅ Available    | Cross-machine identity resolution and behavioral drift detection      |
| **Policy Engine**         | 🚧 Coming Soon | Define and enforce agent behavior rules                               |
| **DefendAI Platform**     | 💼 Enterprise  | Full lifecycle governance for autonomous AI                           |

[defendai.ai](https://defendai.ai) · [playground.defendai.ai](https://playground.defendai.ai) · [support@defendai.ai](mailto:support@defendai.ai)

---

## Contributing

```bash
git clone https://github.com/Defend-AI-Tech-Inc/agent-discover-scanner.git
cd agent-discover-scanner
uv sync
uv run pytest tests/ -v
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. Issues and PRs welcome.

---

## License

MIT — free to use, deploy, and modify.

---

*Built by [DefendAI](https://defendai.ai) · Securing the future of autonomous AI*
