Metadata-Version: 2.4
Name: cybersec-scanner
Version: 1.0.5
Summary: A comprehensive security scanner and RAG-based vulnerability analyzer
Home-page: https://github.com/AnubhavChoudhery/cybersec-scanner
Author: CyberSec Team
Author-email: JBAC EdTech <team@jbac.dev>
Maintainer-email: JBAC EdTech <team@jbac.dev>
License: MIT
Project-URL: Homepage, https://github.com/AnubhavChoudhery/cybersec-scanner
Project-URL: Documentation, https://github.com/AnubhavChoudhery/cybersec-scanner/blob/main/README.md
Project-URL: Repository, https://github.com/AnubhavChoudhery/cybersec-scanner
Project-URL: Issues, https://github.com/AnubhavChoudhery/cybersec-scanner/issues
Keywords: security,scanner,vulnerability,audit,rag,llm,cybersecurity
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Security
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: networkx>=3.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: requests>=2.28.0
Requires-Dist: ollama>=0.1.0
Requires-Dist: colorama>=0.4.0
Requires-Dist: mitmproxy>=10.0.0
Requires-Dist: playwright>=1.40.0
Requires-Dist: sentence-transformers>=2.0.0
Provides-Extra: mitm
Requires-Dist: mitmproxy>=10.0.0; extra == "mitm"
Provides-Extra: browser
Requires-Dist: playwright>=1.40.0; extra == "browser"
Provides-Extra: vector
Requires-Dist: sentence-transformers>=2.0.0; extra == "vector"
Requires-Dist: hnswlib>=0.7.0; extra == "vector"
Provides-Extra: all
Requires-Dist: mitmproxy>=10.0.0; extra == "all"
Requires-Dist: playwright>=1.40.0; extra == "all"
Requires-Dist: sentence-transformers>=2.0.0; extra == "all"
Requires-Dist: hnswlib>=0.7.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# CyberSec Scanner

[![Tests](https://github.com/AnubhavChoudhery/cybersec-scanner/workflows/Tests/badge.svg)](https://github.com/AnubhavChoudhery/cybersec-scanner/actions)
[![Python](https://img.shields.io/pypi/pyversions/cybersec-scanner.svg)](https://pypi.org/project/cybersec-scanner/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

A comprehensive, modular security scanning toolkit for detecting secrets, vulnerabilities, and misconfigurations in Git repositories, web applications, and browser extensions. Features multi-scanner architecture, RAG-powered analysis, and both SDK and CLI interfaces.

> **Use Responsibly**: This tool is for authorized security testing only. Always obtain proper permission before scanning applications you don't own.

## Table of Contents

- [Features](#features)
- [Architecture](#architecture)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [MITM Proxy Setup](#mitm-proxy-setup)
- [Configuration](#configuration)
- [Usage](#usage)
- [Scanner Modules](#scanner-modules)
- [Output Format](#output-format)
- [Advanced Usage](#advanced-usage)
- [Troubleshooting](#troubleshooting)

## Features

### Multi-Scanner Architecture
- **Git Scanner**: Detect secrets in commit history using efficient pickaxe search
- **Web Crawler**: Discover exposed endpoints, analyze JavaScript files and source maps
- **Browser Scanner**: Inspect localStorage, sessionStorage, cookies via Playwright
- **Network Scanner**: Real-time HTTPS traffic inspection with MITM proxy

### RAG-Powered Analysis
- **Knowledge Graph**: NetworkX-based relationship mapping between findings, files, and vulnerabilities
- **Semantic Search**: Vector-based retrieval for similar security patterns
- **LLM Integration**: Natural language queries powered by Ollama (Gemma, Llama, etc.)
- **CWE Enrichment**: Automatic mapping to Common Weakness Enumeration

### Detection Coverage
- **58+ Built-in Patterns**: AWS, OpenAI, Stripe, GitHub, Azure, Google Cloud, databases, and more
- **Entropy Analysis**: High-entropy string detection for unknown secrets
- **Custom Patterns**: Extensible regex-based pattern system via `patterns.env`
- **Contextual Severity**: Smart severity assignment based on exposure context

### Flexible Usage
- **CLI Application**: Full-featured command-line interface with 7 commands
- **Python SDK**: Use scanners independently or together in your own code
- **YAML Configuration**: Simple config files replace long CLI arguments
- **Modular Design**: Import only what you need, lazy loading for optional dependencies

## Installation

### From PyPI (Recommended)

```bash
pip install cybersec-scanner
```

### From Source

```bash
git clone https://github.com/AnubhavChoudhery/cybersec-scanner.git
cd cybersec-scanner
pip install -e .
```

### Optional Dependencies

```bash
# For MITM proxy (HTTPS traffic inspection)
pip install cybersec-scanner[mitm]

# For browser runtime inspection (Playwright)
pip install cybersec-scanner[browser]

# For vector search (RAG features)
pip install cybersec-scanner[vector]

# Install everything
pip install cybersec-scanner[all]

# For development
pip install cybersec-scanner[dev]
```

**Note:** The base installation includes Git scanning, web crawling, and RAG analysis. MITM and browser features require additional dependencies.

### System Requirements

- Python 3.11 or higher
- Git (for git history scanning)
- mitmproxy 10.0+ (for HTTPS inspection - optional)
- Playwright (for browser inspection - optional)

## Quick Start

### Prerequisites for RAG Queries

To use the `query` command with LLM-powered analysis, you need Ollama:

```bash
# Install Ollama (https://ollama.com)
# Linux/Mac:
curl -fsSL https://ollama.com/install.sh | sh
# Windows: Download from https://ollama.com

# Pull the default model
ollama pull gemma3:1b

# Start Ollama (keep running in background)
ollama serve
```

### Complete Scan-to-Query Workflow

```bash
# 1. Install the scanner
pip install cybersec-scanner

# 2. Download patterns file (required for detection)
curl -o patterns.env https://raw.githubusercontent.com/AnubhavChoudhery/cybersec-scanner/main/patterns.env

# 3. Scan your project (Git history)
cybersec-scanner scan --git --root . --output audit_report.json --enable-rag

# 4. Query the findings
cybersec-scanner query "What secrets were found?" --audit audit_report.json

# 5. Save response to file
cybersec-scanner query "Summarize critical findings" --output summary.txt
```

### CLI Usage

```bash
# Show all available commands
cybersec-scanner --help

# Initialize configuration file
cybersec-scanner init-config

# Scan a Git repository
cybersec-scanner scan-git /path/to/repo --max-commits 50

# Scan a web application
cybersec-scanner scan-web http://localhost:8000 --max-pages 100

# Scan MITM traffic logs
cybersec-scanner scan-mitm mitm_traffic.ndjson

# Full multi-scanner workflow
cybersec-scanner scan \
  --git \
  --web \
  --mitm \
  --runtime \
  --root . \
  --target http://localhost:8000 \
  --max-commits 50 \
  --mitm-traffic mitm_traffic.ndjson \
  --output audit_report.json \
  --enable-rag

# Query findings with RAG
cybersec-scanner query "What API keys were found?" --audit audit_report.json

# Build knowledge graph from existing report
cybersec-scanner build-graph audit_report.json

# Check version
cybersec-scanner version
```

### MITM Proxy Workflow

The scanner provides interactive MITM inspection that captures real HTTP/HTTPS traffic. **Traffic files are automatically shared** between scanner and backend via a temp directory.

**1. Add MITM injection to your backend (one-time setup):**

```python
# backend/app/main.py - MUST BE FIRST IMPORT
from cybersec_scanner.scanners.inject_mitm_proxy import inject_mitm_proxy_advanced

# No path needed - uses shared temp location automatically
inject_mitm_proxy_advanced()

# Now import your framework
from fastapi import FastAPI
# ... rest of your code
```

**2. Run the scanner with MITM enabled:**

```bash
# No --mitm-traffic flag needed - auto-discovers shared file
cybersec-scanner scan --mitm --output audit_report.json
```

**3. Start your backend when prompted:**

The scanner will start the MITM proxy and wait for you to start your backend and exercise the app.

```bash
# In another terminal
uvicorn backend.app.main:app --reload
```

**4. Test your application** - make requests, test endpoints

**5. Press Ctrl+C** in the scanner terminal when done

The scanner will parse all captured traffic and generate the audit report.

**Traffic File Location:**
- Windows: `C:\Users\<user>\AppData\Local\Temp\cybersec_scanner\mitm_traffic.ndjson`
- Linux/Mac: `/tmp/cybersec_scanner/mitm_traffic.ndjson`

**Advanced:** You can override with `--mitm-traffic /custom/path.ndjson` if needed.

### Python SDK Usage

```python
from cybersec_scanner import scan_git, scan_web, scan_all

# Scan a Git repository
findings = scan_git("/path/to/repo", max_commits=100)
print(f"Found {len(findings)} secrets in Git history")

# Scan a web application
web_findings = scan_web("http://localhost:8000", max_pages=300)

# Full scan with custom config
config = {
    "git": {
        "enabled": True,
        "repositories": ["/path/to/repo"],
        "max_commits": 100
    },
    "web": {
        "enabled": True,
        "target": "http://localhost:8000",
        "max_pages": 300
    },
    "output": {
        "file": "security_report.json"
    }
}

results = scan_all(config)
```

## CLI Command Reference

### Available Commands

| Command | Description |
|---------|-------------|
| `scan` | Run comprehensive scan with multiple scanners |
| `scan-git` | Scan Git repository for committed secrets |
| `scan-web` | Scan web application endpoints |
| `scan-mitm` | Parse MITM traffic logs |
| `query` | Query findings using RAG/LLM |
| `build-graph` | Build knowledge graph from audit report |
| `init-config` | Create default YAML configuration |
| `version` | Show version information |
| `install-cert` | Install mitmproxy CA certificate |
| `start-proxy` | Start MITM proxy daemon |

### Scan Command Options

```bash
cybersec-scanner scan [OPTIONS]
```

**Scanner Flags:**
- `--git` - Enable Git history scanner
- `--web` - Enable web application scanner
- `--mitm` - Enable MITM traffic analysis
- `--runtime` - Enable browser runtime inspector (Playwright)

**Scanner Configuration:**
- `--root PATH` - Root directory for Git scan (default: `.`)
- `--target URL` - Target URL for web scan
- `--max-commits N` - Maximum Git commits to scan (default: 50)
- `--mitm-traffic PATH` - Path to MITM traffic NDJSON file

**Output Options:**
- `--output PATH`, `-o PATH` - Output audit report file (default: `audit_report.json`)
- `--config PATH`, `-c PATH` - Load settings from YAML config file
- `--enable-rag` - Build knowledge graph after scan for RAG queries

**Example - Full Scan:**
```bash
cybersec-scanner scan \
  --git \
  --web \
  --mitm \
  --root ~/myproject \
  --target http://localhost:8000 \
  --max-commits 100 \
  --mitm-traffic mitm_traffic.ndjson \
  --output security_audit.json \
  --enable-rag
```

### Query Command

```bash
cybersec-scanner query "your question" [OPTIONS]
```

**Options:**
- `--audit PATH` - Audit report to build graph from (if graph doesn't exist)
- `--graph PATH` - Existing knowledge graph file (default: `rag/graph.gpickle`)
- `--model NAME` - Ollama model to use (default: `gemma3:1b`)
- `--top-k N` - Number of findings to retrieve (default: 5)
- `--output PATH`, `-o PATH` - Save LLM response to file

**Example:**
```bash
# Query with existing graph
cybersec-scanner query "What AWS credentials were found?"

# Build graph and query
cybersec-scanner query "List all high severity findings" --audit audit_report.json

# Use different model
cybersec-scanner query "Explain the security risks" --model llama3:8b

# Save response to file
cybersec-scanner query "Summarize critical findings" --output security_summary.txt
```

### Individual Scanner Commands

**Git Scanner:**
```bash
cybersec-scanner scan-git [REPO_PATH] [OPTIONS]

# Options:
#   --max-commits N     Max commits to scan (default: 50)
#   --output PATH       Output JSON file

# Example:
cybersec-scanner scan-git . --max-commits 100 --output git_findings.json
```

**Web Scanner:**
```bash
cybersec-scanner scan-web URL [OPTIONS]

# Options:
#   --max-pages N       Max pages to crawl (default: 50)
#   --output PATH       Output JSON file

# Example:
cybersec-scanner scan-web http://localhost:3000 --max-pages 200
```

**MITM Scanner:**
```bash
cybersec-scanner scan-mitm TRAFFIC_FILE [OPTIONS]

# Options:
#   --output PATH       Output JSON file

# Example:
cybersec-scanner scan-mitm mitm_traffic.ndjson --output mitm_findings.json
```

### Configuration File

Create a YAML config to avoid long command lines:

```bash
cybersec-scanner init-config --output my-config.yaml
```

**Example `my-config.yaml`:**
```yaml
scanner:
  git:
    enabled: true
    root: "."
    max_commits: 100
  
  web:
    enabled: true
    target: "http://localhost:8000"
    max_pages: 300
  
  mitm:
    enabled: true
    traffic_file: "mitm_traffic.ndjson"
  
  runtime:
    enabled: false

output:
  file: "audit_report.json"

rag:
  enabled: true
  model: "gemma3:1b"
```

**Usage:**
```bash
cybersec-scanner scan --config my-config.yaml
```

### Utility Commands

**Build Knowledge Graph:**
```bash
cybersec-scanner build-graph AUDIT_FILE [OPTIONS]

# Options:
#   --output PATH, -o    Output graph file (default: rag/graph.gpickle)

# Example:
cybersec-scanner build-graph audit_report.json --output my_graph.gpickle
```

**Initialize Config File:**
```bash
cybersec-scanner init-config [OPTIONS]

# Options:
#   --output PATH, -o    Output config file path (default: cybersec-config.yaml)

# Example:
cybersec-scanner init-config --output my-config.yaml
```

**Show Version:**
```bash
cybersec-scanner version
```

**Install MITM Certificate:**
```bash
cybersec-scanner install-cert [OPTIONS]

# Options:
#   --port PORT          MITM proxy port (informational, default: 8082)
#   --no-download        Skip HTTP download, use local cert only

# Example:
cybersec-scanner install-cert --port 8082
```

**Start MITM Proxy:**
```bash
cybersec-scanner start-proxy [OPTIONS]

# Options:
#   --port PORT          Proxy listen port (default: 8082)
#   --traffic-file PATH  Traffic log file path (default: temp dir auto-shared)

# Example:
cybersec-scanner start-proxy --port 9000 --traffic-file ./my_traffic.ndjson
```

## 📋 Required Files

**IMPORTANT**: Before running scans, you need these files **adjacent to your working directory** (where you run the scanner):

### 1. patterns.env (REQUIRED)

This file contains regex patterns for detecting secrets. **Copy it from the repository root**:

```bash
# If you installed from source
cp patterns.env /your/project/directory/

# If you installed from PyPI, download from GitHub
curl -o patterns.env https://raw.githubusercontent.com/AnubhavChoudhery/cybersec-scanner/main/patterns.env
```

The file includes 58+ detection patterns for major providers:
```env
AWS_ACCESS_KEY_ID=AKIA[0-9A-Z]{16}
OPENAI_API_KEY=sk-[a-zA-Z0-9]{20,}
STRIPE_SECRET_KEY=sk_live_[0-9a-zA-Z]{24,}
GITHUB_TOKEN=ghp_[0-9a-zA-Z]{36}
# ... and 54 more patterns
```

**Security Note**: This file is excluded from git by default to avoid triggering security scanners. Never commit actual secrets to this file!

```
Chrome_Ext/
├── local_check.py              # Main orchestrator
├── config.py                   # Configuration and patterns
├── utils.py                    # Utility functions
├── patterns.env                # Secret detection patterns (user-configured)
├── inject_mitm_proxy.py        # MITM proxy injection module
├── install_mitm_cert.py        # Certificate installation helper
├── scanners/
│   ├── git_scanner.py         # Git history analysis
│   ├── web_crawler.py         # HTTP endpoint scanning
│   ├── browser_scanner.py     # Playwright runtime inspection
│   └── network_scanner.py     # MITM proxy traffic analysis
└── audit_report.json          # Output report (generated)
```

## Installation

### System Requirements

- Python 3.8 or higher
- Git (for git history scanning)
- mitmproxy 10.0+ (for HTTPS inspection)
- Modern web browser (for Playwright scanner)

### Required Dependencies

```bash
pip install -r requirements.txt
```

If `requirements.txt` is not available, install manually:

```bash
pip install requests colorama
```

### Optional Dependencies

#### For HTTPS Traffic Inspection
```bash
# Install mitmproxy
pip install mitmproxy

# Verify installation
mitmdump --version
```

#### For Browser Runtime Inspection
```bash
pip install playwright
python -m playwright install
```

#### For Network Packet Capture (Advanced)
```bash
pip install scapy

# Windows: Install Npcap from https://npcap.com/
# Linux/Mac: May require libpcap
```

## Quick Start

### Initial Setup

1. **Clone or download the repository**

2. **Set up pattern file** (REQUIRED before first run)

```bash
# Copy the patterns file template
cp patterns.env.example patterns.env

# The file includes 58+ detection patterns for major providers
# Edit patterns.env to customize or add patterns (optional)
```

3. **Verify setup**

```bash
python -c "from config import KNOWN_PATTERNS; print(f'Loaded {len(KNOWN_PATTERNS)} patterns')"
```

Expected output: `Loaded 58 patterns` (or similar)

### Basic Usage

```bash
# Scan with default settings
python local_check.py --target http://localhost:8000 --root /path/to/project

# Generate audit report
cat audit_report.json
```

## MITM Proxy Setup

The MITM (Man-in-the-Middle) proxy feature allows inspection of HTTPS traffic in real-time, including request/response headers and bodies.

### Prerequisites

1. **Install mitmproxy**

```bash
pip install mitmproxy

# Verify installation
mitmdump --version
```

2. **Copy required files to your backend**

```bash
# From the Chrome_Ext directory
cp inject_mitm_proxy.py /path/to/your/backend/app/
cp patterns.env /path/to/your/backend/app/
```

### Backend Integration

**CRITICAL**: Add the import statement as the **VERY FIRST LINE** of your main application file. This is not optional - the import MUST come before any other imports (Flask, FastAPI, Django, etc.) for the MITM proxy to properly intercept HTTP libraries.

The `inject_mitm_proxy` module automatically:
1. Starts a proxy server on port 8082 (configurable via `MITM_PROXY_PORT`)
2. Patches HTTP libraries (requests, httpx, urllib, urllib3, aiohttp) to route through proxy
3. Inspects all outbound HTTP/HTTPS traffic for security issues
4. Logs traffic to `mitm_traffic.ndjson` in the same directory
5. Bypasses specific domains (AWS, OAuth, AI providers) to prevent authentication issues

**For FastAPI:**
```python
# backend/app/main.py
import inject_mitm_proxy  # MUST BE FIRST IMPORT (before FastAPI, before everything!)

from fastapi import FastAPI  # This comes AFTER inject_mitm_proxy
from fastapi.middleware.cors import CORSMiddleware
# ... rest of your imports

app = FastAPI()
# ... rest of your code
```

**For Flask:**
```python
# backend/app.py
import inject_mitm_proxy  # MUST BE FIRST IMPORT (before Flask, before everything!)

from flask import Flask  # This comes AFTER inject_mitm_proxy
from flask_cors import CORS
# ... rest of your imports

app = Flask(__name__)
# ... rest of your code
```

**For Django:**
```python
# backend/manage.py or wsgi.py
import inject_mitm_proxy  # MUST BE FIRST IMPORT (before Django, before everything!)

import os  # This comes AFTER inject_mitm_proxy
from django.core.wsgi import get_wsgi_application
# ... rest of Django setup
```

**Why FIRST import matters**: The module patches HTTP libraries at import time. If Flask/FastAPI/Django import first, their HTTP clients won't be patched, and traffic won't be intercepted.

### Running with MITM Proxy

1. **Start your backend application**

```bash
# No environment variables needed - proxy is always enabled
# Just start your backend normally
uvicorn app.main:app --reload  # FastAPI example
```

You should see:
```
[MITM] Proxy active on http://127.0.0.1:8082
[MITM] Bypass mode: AWS, OAuth, AI providers, payments, CDNs
[MITM] Patched libraries: requests, httpx, urllib, urllib3, aiohttp
```

2. **Run the security scanner**

```bash
# In a new terminal, run the scanner with MITM enabled
python local_check.py \
  --target http://localhost:8000 \
  --enable-mitm \
  --mitm-port 8082
```

3. **Interact with your application** (make HTTP requests, use API endpoints, etc.)

4. **Stop the scanner** (Ctrl+C) to generate the audit report

5. **Review results**

```bash
# View audit report
cat audit_report.json

# View traffic log (raw NDJSON)
cat mitm_traffic.ndjson
```

### MITM Proxy Detection Capabilities

The MITM proxy inspects both requests and responses for security issues:

**Request-Side Detection:**
- Credentials embedded in URLs (`user:pass@domain`)
- API keys in query parameters (`?api_key=xxx`)
- Basic Authentication headers (base64 credentials)
- API keys in Authorization headers (with context awareness)
- Plaintext passwords in request bodies (excludes bcrypt/argon2 hashes)
- Secrets matching any of the 58+ patterns

**Response-Side Detection:**
- Secrets leaked in response headers
- API keys in response bodies (JSON, HTML, JavaScript)
- Credentials in error messages
- Database connection strings in stack traces
- Debug information containing sensitive data

**Severity Levels:**
- `CRITICAL`: API keys in URLs, credentials over HTTP, plaintext passwords
- `HIGH`: API keys in headers over HTTPS (with expected auth disclaimer)
- `INFO`: Normal traffic logging (not a security issue)

### MITM Proxy Configuration

The `inject_mitm_proxy.py` module works automatically when imported. The only optional configuration is:

```bash
# Set custom MITM proxy port (default: 8082)
export MITM_PROXY_PORT=9000
```

**No other environment variables needed** - the proxy runs in full mode by default with intelligent domain bypass.

### Domain Bypass Configuration

By default, the following domains bypass the MITM proxy to prevent authentication and SSL issues:

**OAuth Providers:**
- `accounts.google.com`, `oauth2.googleapis.com`, `login.microsoftonline.com`

**AI Providers:**
- `api.openai.com`, `openai.com`
- `api.anthropic.com`, `anthropic.com`
- `api.groq.com`, `groq.com`
- `api.mistral.ai`, `mistral.ai`
- `api-inference.huggingface.co`, `huggingface.co`
- `api.cohere.ai`, `replicate.com`, `together.xyz`, `anyscale.com`, `perplexity.ai`

**AWS Services:**
- All `*.amazonaws.com` domains
- API Gateway, Lambda, S3, CloudFront

**Payment Providers:**
- `stripe.com`, `paypal.com`

**CDNs:**
- `cloudflare.com`, `cloudfront.net`

**Localhost:**
- `127.0.0.1`, `localhost`

To modify bypass rules, edit the `BYPASS_DOMAINS` and `AWS_SUFFIXES` sets in `inject_mitm_proxy.py`.

### Uninstalling MITM Proxy

To remove MITM proxy from your backend:

1. Remove or comment out the import:
```python
# import inject_mitm_proxy  # Disabled
```

2. Restart your backend application

The proxy is only active when the module is imported.

## Configuration

### Pattern File (patterns.env)

The `patterns.env` file contains regular expressions for detecting secrets. This file is excluded from version control to prevent triggering GitHub security alerts.

**Format:**
```
PATTERN_NAME=regex_pattern
```

**Adding custom patterns:**
```bash
# Edit patterns.env
nano patterns.env

# Add your pattern
MY_CUSTOM_KEY=mykey_[0-9a-f]{32}

# Reload the scanner
python local_check.py --target http://localhost:8000
```

### Configuration File (config.py)

**Entropy Threshold:**
```python
ENTROPY_THRESHOLD = 3.5  # Shannon entropy for randomness detection
```

**File Exclusions:**
```python
EXCLUDE_SUFFIXES = {
    '.png', '.jpg', '.jpeg', '.gif', '.bmp', '.ico',
    '.zip', '.tar', '.gz', '.pdf', '.exe', '.dll'
}
```

**Probe Paths (for web crawler):**
```python
PROBE_PATHS = [
    '/.env', '/.env.local', '/.env.production',
    '/.git/config', '/.git/HEAD',
    '/config.php.bak', '/backup.sql'
]
```

## Usage

### Command-Line Options

```bash
python local_check.py [OPTIONS]
```

**Core Options:**

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `--target`, `-t` | URL | `http://localhost:8000` | Target application URL |
| `--root`, `-r` | Path | `.` | Repository root for static analysis |
| `--out`, `-o` | Path | `audit_report.json` | Output report filename |

**Scanner Options:**

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `--depth` | Integer | `300` | Maximum pages to crawl |
| `--enable-playwright` | Flag | `False` | Enable browser runtime inspection |
| `--enable-pcap` | Flag | `False` | Enable packet capture (requires root) |
| `--pcap-timeout` | Integer | `12` | Packet capture duration (seconds) |

**MITM Proxy Options:**

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `--enable-mitm` | Flag | `False` | Enable MITM proxy for HTTPS inspection |
| `--mitm-port` | Integer | `8082` | MITM proxy port |
| `--mitm-duration` | Integer | `0` | Auto-stop after N seconds (0 = manual) |
| `--mitm-traffic` | Path | Auto-detect | Custom path to traffic NDJSON file |

### Usage Examples

**Basic scan:**
```bash
python local_check.py --target http://localhost:8000 --root /path/to/project
```

**Full scan with all features:**
```bash
python local_check.py \
  --target http://localhost:3000 \
  --root ~/myapp \
  --enable-playwright \
  --enable-mitm \
  --depth 500 \
  --out security_report.json
```

**MITM-only scan (skip static/git):**
```bash
python local_check.py \
  --target http://localhost:8000 \
  --enable-mitm \
  --mitm-duration 30
```

**Custom traffic log location:**
```bash
python local_check.py \
  --target http://localhost:8000 \
  --enable-mitm \
  --mitm-traffic /custom/path/to/traffic.ndjson
```

## Scanner Modules

### 1. Git Scanner (`scanners/git_scanner.py`)

Analyzes git commit history for leaked secrets using efficient pickaxe search.

**Features:**
- Searches git history for known secret patterns
- Uses `git log -S<term>` for 100x faster scanning than naive approaches
- Examines up to 100 commits by default (configurable)
- Scans added lines in diffs for pattern matches

**Configuration:**
```python
scan_git_history(root, max_commits=100)
```

### 2. Web Crawler (`scanners/web_crawler.py`)

Crawls web application endpoints to discover exposed sensitive paths and analyze client-side code.

**Features:**
- Discovers exposed `.env`, `.git/config`, backup files
- Analyzes JavaScript files for hardcoded secrets
- Extracts and scans source maps
- Checks HTTP headers and cookies for leaked secrets
- Detects catch-all responses (false positives)
- Multi-threaded crawling with process pool for regex scanning

**Configuration:**
```python
crawler = LocalCrawler(
    base="http://localhost:8000",
    timeout=6,
    max_pages=300,
    workers=8,
    max_js_size=500_000  # Skip large JS bundles
)
```

### 3. Browser Scanner (`scanners/browser_scanner.py`)

Uses Playwright to inspect browser runtime state and client-side storage.

**Features:**
- Extracts localStorage contents
- Extracts sessionStorage contents
- Retrieves all cookies
- Checks global variables (`window.__ENV`, `window.config`, `window.API_KEY`)

**Requirements:**
```bash
pip install playwright
python -m playwright install
```

**Usage:**
```python
playwright_inspect("http://localhost:8000")
```

### 4. Network Scanner (`scanners/network_scanner.py`)

Runs mitmproxy addon for deep packet inspection (Layer 2).

**Features:**
- Intercepts HTTP/HTTPS traffic at the proxy level
- Pattern matching on request/response bodies
- Security header validation
- Works alongside `inject_mitm_proxy.py` (Layer 1)

**Note:** Most users will use `inject_mitm_proxy.py` for MITM inspection. This module provides additional addon-based analysis.

## Output Format

### Audit Report (audit_report.json)

```json
{
  "timestamp": "2025-11-18T13:34:34.106644",
  "target": "http://localhost:8000",
  "stats": {
    "git_secrets": 0,
    "crawler_issues": 2,
    "browser_issues": 0,
    "mitm_proxied": 15,
    "mitm_bypassed": 3,
    "mitm_security_findings": 1
  },
  "severities": {
    "CRITICAL": 0,
    "HIGH": 1,
    "MEDIUM": 0,
    "LOW": 0,
    "INFO": 15
  },
  "findings": [
    {
      "type": "api_key_in_header",
      "severity": "HIGH",
      "timestamp": 1763494461,
      "timestamp_human": "2025-11-18 13:34:21",
      "description": "GROQ_API_KEY in Authorization header over HTTPS (expected for server-side API calls, review if unexpected)",
      "url": "https://api.groq.com/openai/v1/chat/completions",
      "client": "requests",
      "method": "post",
      "pattern": "GROQ_API_KEY",
      "header": "Authorization"
    }
  ]
}
```

### Traffic Log (mitm_traffic.ndjson)

NDJSON (newline-delimited JSON) format for append-only logging:

```json
{"ts": 1763494398, "timestamp": "2025-11-18 13:33:18", "stage": "mitm_outbound", "client": "requests", "method": "post", "url": "https://api.example.com/endpoint"}
{"ts": 1763494461, "timestamp": "2025-11-18 13:34:21", "stage": "security_finding", "severity": "HIGH", "type": "api_key_in_header", "pattern": "GROQ_API_KEY", "description": "...", "url": "...", "client": "requests", "method": "post", "header": "Authorization"}
```

**Stages:**
- `mitm_outbound`: Request sent through proxy
- `mitm_bypass`: Request bypassed proxy (OAuth, AWS, etc.)
- `security_finding`: Security issue detected

## Advanced Usage

### Custom Pattern Detection

Create a custom pattern file:

```bash
# Create custom-patterns.env
cat > custom-patterns.env << EOF
CUSTOM_API_KEY=custom_[0-9a-f]{32}
INTERNAL_TOKEN=int_tok_[A-Za-z0-9]{24}
EOF

# Edit config.py to load from custom file
# (Modify PATTERNS_FILE path in config.py)
```

### Integrating with CI/CD

```yaml
# .github/workflows/security-scan.yml
name: Security Audit
on: [push, pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - run: pip install -r requirements.txt
      - run: cp patterns.env.example patterns.env
      - run: python local_check.py --target http://localhost:8000 --root .
      - run: |
          if jq -e '.severities.CRITICAL > 0' audit_report.json; then
            echo "CRITICAL issues found!"
            exit 1
          fi
```

### Programmatic Usage

```python
from scanners import scan_git_history, LocalCrawler, playwright_inspect

# Git scanning
git_findings = scan_git_history("/path/to/repo", max_commits=100)

# Web crawling
crawler = LocalCrawler("http://localhost:8000", max_pages=200)
crawler.probe_common_paths()
crawler.crawl()
web_findings = crawler.findings

# Browser inspection
browser_data = playwright_inspect("http://localhost:8000")

# Combine results
all_findings = git_findings + web_findings
```

## Troubleshooting

### "No module named 'requests'"

```bash
pip install requests
```

### "patterns.env not found"

```bash
cp patterns.env.example patterns.env
```

### "playwright-not-installed"

```bash
pip install playwright
python -m playwright install
```

### "MITM proxy not loading patterns"

**Issue:** Backend shows `WARNING: patterns.env not found`

**Solution:**
```bash
# Verify patterns.env is in the same directory as inject_mitm_proxy.py
ls -la /path/to/backend/app/patterns.env

# If missing, copy it
cp patterns.env /path/to/backend/app/
```

### "MITM proxy not intercepting traffic"

**Issue:** No traffic logged in `mitm_traffic.ndjson`

**Solutions:**

1. Verify import is present and FIRST:
```python
import inject_mitm_proxy  # MUST BE FIRST
# ... other imports
python app.py
# Should see: "[MITM] Proxy active on http://127.0.0.1:8082"
```

2. Check proxy port matches:
```bash
# Scanner
python local_check.py --enable-mitm --mitm-port 8082

# Backend
export MITM_PROXY_PORT=8082
```

### "Permission denied during packet capture"

```bash
# Linux/Mac
sudo python local_check.py --enable-pcap

# Windows
# Run terminal as Administrator
```

### "Git scan is very slow"

This is normal for large repositories (100k+ commits). The tool limits to 100 commits by default. To adjust:

```python
# Modify scanners/git_scanner.py
scan_git_history(root, max_commits=50)  # Reduce commit limit
```

### "Too many false positives"

1. Adjust entropy threshold in `config.py`:
```python
ENTROPY_THRESHOLD = 4.0  # Higher = fewer false positives
```

2. Add exclusions for known patterns:
```python
# In config.py
EXCLUDE_PATTERNS = [
    r'test_api_key_123',  # Test keys
    r'example\.com',      # Example domains
]
```

3. Filter by severity in audit report:
```bash
# Only show CRITICAL issues
jq '.findings[] | select(.severity == "CRITICAL")' audit_report.json
```

## Security Considerations

### Testing Your Own Applications Only

This tool is designed for security testing of applications you own or have explicit permission to test. Unauthorized scanning may violate laws and terms of service.

### MITM Proxy Security

The MITM proxy **disables SSL verification** for testing purposes. This should only be used in development/testing environments, never in production.

**Do NOT:**
- Use MITM proxy in production environments
- Commit `inject_mitm_proxy.py` import to production code
- Share MITM proxy logs (may contain sensitive data)

**Best Practices:**
- Use environment variables to control MITM activation
- Keep `mitm_traffic.ndjson` and `audit_report.json` out of version control (add to `.gitignore`)
- Review and sanitize audit reports before sharing

### Pattern File Security

The `patterns.env` file is excluded from version control by default (`.gitignore`) to avoid triggering GitHub security alerts on pattern signatures.

**Do NOT:**
- Commit `patterns.env` to public repositories
- Include actual secret values in pattern files
- Share pattern files with untrusted parties

## Version History

| Version | Changes |
|---------|---------|
| **1.0.5** | Enhanced scan output with phase headers, fixed query output to be human-readable (extracts text from response) |
| **1.0.4** | Colored CLI output with colorama, `--output` flag for query command, simplified MITM traffic auto-sharing |
| **1.0.3** | Unified MITM traffic file location via temp directory, added colorama dependency |
| **1.0.2** | Fixed MITM traffic file path resolution bug |
| **1.0.1** | Initial PyPI release with full scanner suite |

## License

MIT License - See LICENSE file for details.

## Contributing

Contributions are welcome! Please follow these guidelines:

1. Test your changes with multiple target applications
2. Update documentation for new features
3. Follow existing code style and structure
4. Add tests for new scanner modules
5. Ensure no secrets are committed in test files

## Disclaimer

This tool is provided for lawful security testing only. Users are responsible for ensuring they have proper authorization before scanning any application. The authors assume no liability for misuse or unauthorized access.

## Testing

### Quick Test Commands

```bash
# Run all tests (auto-detects Ollama)
python run_tests.py

# Run all tests including LLM (requires Ollama)
python run_tests.py --all

# Fast tests only (no LLM)
python run_tests.py --fast

# With coverage report
python run_tests.py --coverage

# Specific test file
python run_tests.py --file retriever
```

### Test Prerequisites

**Core tests** (no additional setup):
```bash
pip install pytest pytest-cov
pytest tests/ -v -k "not llm_client"
```

**LLM tests** (requires Ollama):
```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh  # Linux/Mac
# Or download from https://ollama.com for Windows

# Pull model
ollama pull gemma3:1b

# Run all tests
pytest tests/ -v
```

### Test Coverage

| Component | Tests | Coverage |
|-----------|-------|----------|
| Knowledge Graph | PASS 1 test | 100% |
| CWE Enrichment | PASS 1 test | 100% |
| Database Normalizer | PASS 5 tests | 95% |
| Graph Retriever | PASS 8 tests | 100% |
| LLM Client | PASS 8 tests | 85% |
| End-to-End Pipeline | PASS 2 tests | Full flow |
| **Total** | **24 tests** | **~90%** |

See `tests/README.md` for detailed testing documentation.

## Support

For issues, questions, or contributions:
- Open an issue on GitHub
- Review existing issues before creating new ones
- Provide detailed information (OS, Python version, error messages, steps to reproduce)

Made by the JBAC EdtEch Team (Jai Ansh Bindra and Anubhav Choudhery)
