Metadata-Version: 2.4
Name: sparkguardian
Version: 0.1.0
Summary: Production-grade PySpark auditing, optimization, and governance framework — static + dynamic analysis with local AI assistance
Project-URL: Homepage, https://github.com/sparkguardian/sparkguardian
Project-URL: Documentation, https://github.com/sparkguardian/sparkguardian#readme
Project-URL: Repository, https://github.com/sparkguardian/sparkguardian
Project-URL: Issues, https://github.com/sparkguardian/sparkguardian/issues
Project-URL: Changelog, https://github.com/sparkguardian/sparkguardian/blob/main/CHANGELOG.md
Author: SparkGuardian Contributors
Maintainer: SparkGuardian Contributors
License-Expression: MIT
License-File: LICENSE
Keywords: anti-pattern,big-data,ci-cd,data-engineering,governance,linter,optimizer,performance,pyspark,sarif,spark,static-analysis
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: httpx>=0.27.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: libcst>=1.1.0
Requires-Dist: pydantic>=2.6.0
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: rich>=13.7.0
Requires-Dist: typer>=0.12.0
Provides-Extra: ai
Requires-Dist: ollama>=0.1.7; extra == 'ai'
Provides-Extra: all
Requires-Dist: build>=1.2.0; extra == 'all'
Requires-Dist: mypy>=1.9.0; extra == 'all'
Requires-Dist: ollama>=0.1.7; extra == 'all'
Requires-Dist: pytest-cov>=5.0.0; extra == 'all'
Requires-Dist: pytest>=8.0.0; extra == 'all'
Requires-Dist: ruff>=0.4.0; extra == 'all'
Requires-Dist: twine>=5.0.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: build>=1.2.0; extra == 'dev'
Requires-Dist: mypy>=1.9.0; extra == 'dev'
Requires-Dist: pytest-cov>=5.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Requires-Dist: twine>=5.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# SparkGuardian

> Production-grade PySpark auditing, optimization, and governance framework.

[![Tests](https://img.shields.io/badge/tests-373%20passed-brightgreen)](tests/)
[![Rules](https://img.shields.io/badge/rules-21%20active-blue)](sparkguardian/rules/)
[![Lint](https://img.shields.io/badge/ruff-clean-brightgreen)](pyproject.toml)
[![SARIF](https://img.shields.io/badge/output-SARIF%20v2.1.0-orange)](sparkguardian/reporting/sarif_reporter.py)
[![Python](https://img.shields.io/badge/python-3.11%2B-blue)](pyproject.toml)
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)

SparkGuardian detects PySpark anti-patterns, analyzes execution plans, generates prioritized remediation plans, and integrates into CI/CD pipelines — all **100% offline**, no cloud APIs, no source code exfiltration.

---

## Features

| Feature | Description |
|---------|-------------|
| **Static AST Analysis** | Deterministic analysis via `libcst` — never executes user code |
| **Execution Plan Analysis** | Parses `df.explain()` output, detects shuffles / CartesianProduct / AQE |
| **Safe Auto-Refactoring** | Rewrites code with backup, dry-run preview, interactive mode |
| **Local AI Assistant** | Ollama-powered explanations, structured analysis, remediation plans |
| **AI Chat Session** | Interactive multi-turn chat about findings with history management |
| **RAG Knowledge Base** | Keyword-based Spark knowledge retrieval — no external vector DB |
| **Performance Scoring** | 0–100 score across reliability, cost, and clarity dimensions |
| **CI/CD Integration** | Exit codes 0/1/2 — GitHub Actions, GitLab CI, Jenkins |
| **Rule Governance** | Enable/disable rules, override severity, per-project config (TOML/YAML) |
| **Ignore System** | `# cy:ignore CY008` inline and block suppression with audit trail |
| **Rich Reporting** | Terminal (colored), JSON, Markdown, HTML (dark theme), **SARIF v2.1.0** |
| **Cloud Cost Estimator** | Translates anti-patterns into **monthly/yearly $$$** waste on AWS EMR / Databricks / GCP / Azure |
| **Audit History** | Longitudinal tracking with regression detection + CSV export for academic analysis |

---

## Installation

```bash
# Core (PyPI)
pip install sparkguardian

# With local AI (Ollama integration)
pip install "sparkguardian[ai]"

# Development install (from source)
git clone https://github.com/sparkguardian/sparkguardian
cd sparkguardian
pip install -e ".[dev]"
```

Verify install:

```bash
sparkguardian --version          # SparkGuardian v0.1.0
sparkguardian --help             # list 9 commands
sparkguardian analyze app.py
```

**Requirements:** Python 3.11 or higher. Tested on Python 3.11, 3.12, 3.13.

### Build from source

```bash
pip install build
python -m build                  # creates dist/*.whl + *.tar.gz
twine check dist/*               # validate before upload
pip install dist/sparkguardian-0.1.0-py3-none-any.whl
```

---

## Quick Start

```bash
# Analyze a file
sparkguardian analyze app.py

# Analyze a directory
sparkguardian analyze src/

# Get JSON output (for CI integration)
sparkguardian analyze app.py --json

# Generate HTML report
sparkguardian analyze app.py --html report.html

# Preview auto-fixes (no file changes)
sparkguardian fix app.py --dry-run

# Apply safe auto-fixes (with backup)
sparkguardian fix app.py

# Analyze Spark execution plan
sparkguardian explain-plan plan.txt

# List all available rules
sparkguardian list-rules

# AI-powered remediation plan
sparkguardian remediation app.py

# Interactive AI chat about findings
sparkguardian chat app.py

# Check AI (Ollama) availability
sparkguardian ai-status

# Cloud cost estimation — translate issues to $/month
sparkguardian cost app.py
sparkguardian cost app.py --data-tb 5 --nodes 30 --cloud databricks

# SARIF v2.1.0 output for GitHub Code Scanning
sparkguardian analyze app.py --sarif report.sarif

# Audit history — track score evolution over time
sparkguardian analyze app.py --history          # record snapshot
sparkguardian history                            # view evolution
sparkguardian history --csv evolution.csv       # academic analysis export
```

---

## Exit Codes

| Code | Meaning | CI behavior |
|------|---------|-------------|
| `0` | No issues | Build passes |
| `1` | CRITICAL issue found | Build **fails** |
| `2` | Warnings only | Build passes (configurable) |

---

## Rules

### Static Rules — AST Analysis (16 rules)

| Rule | Title | Severity | Impact | Autofix |
|------|-------|----------|--------|:-------:|
| CY001 | `.collect()` — driver memory risk | HIGH | RELIABILITY | |
| CY002 | `.toPandas()` — large local materialization | HIGH | RELIABILITY | |
| CY003 | `.count()` abuse — full scan triggered | MEDIUM | COST | |
| CY004 | Spark action inside loop | **CRITICAL** | RELIABILITY | |
| CY005 | `.show()` in production code | LOW | RELIABILITY | |
| CY008 | `repartition(1)` detected | HIGH | COST | ✓ |
| CY009 | Chained `withColumn()` (5+) | MEDIUM | COST | |
| CY010 | Implicit join — missing `how=` | LOW | CLARITY | |
| CY011 | `select("*")` — broad column selection | LOW | CLARITY | |
| CY012 | `cache()`/`persist()` — verify necessity | LOW | COST | |
| CY013 | Repeated scan of same data source | HIGH | COST | |
| CY014 | `filter()` after `groupBy()` — reorder | MEDIUM | COST | |
| CY016 | `crossJoin()` — explicit CartesianProduct | HIGH | COST | |
| CY017 | `distinct()` — full shuffle | MEDIUM | COST | |
| CY018 | `orderBy()` without `limit()` — global sort | MEDIUM | COST | |
| CY020 | Python UDF — use native SQL functions | HIGH | COST | |

### Dynamic Rules — Execution Plan (5 rules)

| Rule | Title | Severity | Impact |
|------|-------|----------|--------|
| CY050 | `CartesianProduct` node detected | **CRITICAL** | COST |
| CY051 | Excessive shuffle (`Exchange`) nodes | HIGH | COST |
| CY052 | `SortMergeJoin` — broadcast opportunity | MEDIUM | COST |
| CY053 | Window function without `partitionBy` | HIGH | COST |
| CY054 | AQE (`AdaptiveSparkPlan`) not active | MEDIUM | COST |

---

## Ignore System

Suppress rules inline or by block, with full audit trail in reports.

```python
# Inline — suppress on this line only
df.repartition(1)  # cy:ignore CY008

# Block — suppress across multiple lines
# cy:ignore-start CY008
df_single = df.repartition(1)
df_single.write.parquet("output/")
# cy:ignore-end CY008
```

Ignored findings still appear in reports as suppressed (use `--show-ignored`).

---

## Configuration

Create `.sparkguardian.toml` at project root:

```toml
[sparkguardian]
fail_on_critical = true
fail_on_high = false
warning_threshold = 10
ignore_paths = ["tests/", "examples/"]

[rules.CY003]
enabled = true
severity = "MEDIUM"

[rules.CY008]
enabled = true
severity = "HIGH"

# Disable a rule project-wide
[rules.CY005]
enabled = false
```

YAML format also supported (`.sparkguardian.yaml`).

---

## Local AI Setup

All inference runs locally via [Ollama](https://ollama.ai). No data leaves your machine.

```bash
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a supported model
ollama pull deepseek-coder:6.7b   # recommended
# or: ollama pull codellama:7b
# or: ollama pull llama3:8b
# or: ollama pull mistral:7b

# Start Ollama server
ollama serve

# Verify setup
sparkguardian ai-status
```

### AI Commands

```bash
# AI explanations with each finding
sparkguardian analyze app.py --ai

# AI-powered prioritized remediation plan
sparkguardian remediation app.py
sparkguardian remediation app.py --json   # machine-readable

# Interactive chat about findings
sparkguardian chat app.py
```

### AI Architecture

```
LocalLLM
├── LRU response cache (128 entries, hit-rate tracking)
├── System prompt injection (role anchor)
├── Structured JSON output parsing (AIAnalysis dataclass)
├── Streaming support (token-by-token terminal output)
└── Auto model selection (deepseek-coder → codellama → llama3 → mistral)

AISession (multi-turn chat)
├── Conversation history with automatic trimming
├── Context injection (RAG knowledge base)
└── Reset / turn counting

RemediationPlanner
├── LLM-driven: priority_order, effort_estimate, quick_wins, blockers
└── Heuristic fallback (deterministic, works without Ollama)

SparkKnowledgeBase (RAG)
├── 24 entries from official Apache Spark 4.1.1 documentation
│   Sources: tuning.html · sql-performance-tuning.html · rdd-programming-guide.html
├── query(question, rule_id) — lookup by rule_id or keyword scoring
├── query_with_config(rule_id) → (content, config_params[]) — includes spark.* params
├── get_config_params(rule_id) → list[str] — exact config keys with defaults
└── sources() → list[str] — traceable documentation URLs per entry
```

---

## Cloud Cost Estimator

Translates detected anti-patterns into **estimated monthly/yearly USD waste** on major cloud providers. Cost models are derived from official pricing (AWS EMR, Databricks, GCP Dataproc, Azure Synapse) and published Spark performance research.

```bash
sparkguardian cost examples/bad_pipeline.py
# 💸 Estimated waste: $2,407/month ($28,884/year)

sparkguardian cost prod_pipeline/ --data-tb 50 --nodes 100 --cloud databricks
# 💸 Estimated waste: $156,420/month ($1,877,040/year)

sparkguardian cost app.py --json   # machine-readable for FinOps integration
```

**Pricing methodology (2025-2026 baseline):**
- AWS EMR on-demand: ~$0.063/vCPU-hour
- Databricks DBU: $0.40/DBU-hour (Jobs Compute) — ×1.3 vs EMR
- GCP Dataproc: ×0.9 vs EMR
- Azure Synapse: ×1.1 vs EMR
- S3: $0.023/GB stored + $0.005/1k GET requests
- Cross-AZ egress: $0.09/GB

Each rule has a calibrated cost model factoring in severity, occurrence count (sub-linear), and workload scale.

---

## Audit History (Longitudinal Analysis)

Track code quality evolution across commits, releases, or daily runs. Supports CSV export for academic and research analysis.

```bash
# Record a snapshot after analysis
sparkguardian analyze . --history

# View evolution
sparkguardian history
# ↘️ Latest delta: -8.5
# ⚠️ REGRESSION DETECTED (delta > -5)

# Export to CSV for plotting / longitudinal study
sparkguardian history --csv evolution.csv

# Filter by file
sparkguardian history --target src/pipeline.py
```

Snapshots stored in `.sparkguardian-history.json` with git commit/branch metadata for traceability.

---

## SARIF Output (GitHub Code Scanning)

Native integration with GitHub Code Scanning, GitLab SAST, Azure DevOps, and SonarQube:

```bash
sparkguardian analyze . --sarif report.sarif
```

In `.github/workflows/sparkguardian.yml`:

```yaml
- name: SparkGuardian
  run: sparkguardian analyze . --sarif sparkguardian.sarif
- uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: sparkguardian.sarif
```

Findings appear directly in the **Security** tab of GitHub with severity-graded annotations on pull requests.

---

## Architecture

```
sparkguardian/
├── cli.py                    # Typer CLI (9 commands)
├── config.py                 # TOML/YAML loader
├── constants.py              # Exit codes, defaults
│
├── parser/
│   ├── ast_analyzer.py       # libcst static analysis
│   ├── plan_analyzer.py      # Execution plan analysis
│   ├── execution_parser.py   # Plan parser
│   ├── ignore_handler.py     # cy:ignore engine
│   └── node_extractors.py    # NodeType classification
│
├── rules/
│   ├── base.py               # BaseRule ABC
│   ├── registry.py           # @register decorator system
│   ├── static_rules.py       # 16 AST rules
│   ├── dynamic_rules.py      # 5 plan rules
│   ├── categories.py
│   └── severity.py
│
├── models/
│   ├── issue.py              # Issue, Location, Severity, ImpactType
│   ├── report.py             # AnalysisReport
│   └── execution_node.py     # ExecutionNode, NodeType
│
├── rewriter/
│   ├── safe_rewriter.py      # SafeRewriter (backup + apply)
│   ├── transformer.py        # libcst CSTTransformers
│   └── diff_generator.py     # Unified diff output
│
├── ai/
│   ├── __init__.py           # Public exports: LocalLLM, AISession, SparkKnowledgeBase…
│   ├── local_llm.py          # LocalLLM (cache + streaming + structured output)
│   ├── session.py            # AISession (multi-turn history, context injection)
│   ├── remediation.py        # RemediationPlanner (LLM + heuristic fallback)
│   ├── structured.py         # AIAnalysis, RemediationPlan dataclasses + JSON parser
│   ├── cache.py              # LRU AIResponseCache (128 entries, hit-rate tracking)
│   ├── rag.py                # SparkKnowledgeBase (24 entries, official Spark 4.1.1 docs)
│   └── prompts.py            # Versioned prompt library (system, structured, chat, remediation)
│
├── reporting/
│   ├── terminal_reporter.py  # Rich colored output
│   ├── json_reporter.py
│   ├── markdown_reporter.py
│   ├── html_reporter.py      # Dark-theme HTML
│   └── sarif_reporter.py     # SARIF v2.1.0 (GitHub Code Scanning)
│
├── scoring/
│   └── score_engine.py       # 0–100 performance score
│
├── cost/                     # NEW
│   └── estimator.py          # Cloud cost models (EMR/Databricks/GCP/Azure)
│
├── history/                  # NEW
│   └── tracker.py            # Longitudinal audit snapshots + CSV export
│
├── ci/
│   └── exit_codes.py         # resolve_exit_code()
│
└── utils/
    ├── logger.py
    └── file_utils.py         # backup_file, collect_python_files
```

### Design for Future Rust Acceleration

The `parser/` layer is intentionally isolated for future rewrite in Rust via **PyO3 / maturin**. The rule engine and reporting layers have no parser dependency, enabling drop-in acceleration without breaking the public API.

---

## CI/CD Integration

### GitHub Actions

```yaml
- name: SparkGuardian Audit
  run: sparkguardian analyze . --json > report.json
  # Exit 1 on CRITICAL → build fails automatically
```

See [`.github/workflows/sparkguardian.yml`](.github/workflows/sparkguardian.yml) for full example.

### GitLab CI

```yaml
sparkguardian:
  script: sparkguardian analyze . --json > report.json
```

See [`ci/gitlab-ci.yml`](ci/gitlab-ci.yml).

### Jenkins

See [`ci/Jenkinsfile`](ci/Jenkinsfile) for a pipeline with HTML report archiving.

---

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_ast.py -v

# With coverage
pytest tests/ --cov=sparkguardian --cov-report=term-missing
```

**373 tests | 0 failures | 0 warnings | 0 ruff errors | ~0.5s**

| Test file | Covers |
|-----------|--------|
| `test_ai.py` | LLM cache, session, structured output, remediation, prompts |
| `test_ast.py` | All 16 static rules, analyzer edge cases, ignore system |
| `test_ci.py` | Exit codes (0/1/2) |
| `test_config.py` | TOML, YAML, defaults, rule overrides |
| `test_file_utils.py` | backup_file, collect_python_files |
| `test_ignore.py` | cy:ignore inline and block directives |
| `test_models.py` | Issue, Location, AnalysisReport, ExecutionNode |
| `test_node_extractors.py` | NodeType classification, plan parser |
| `test_plan.py` | Dynamic rules, execution plan analysis |
| `test_rag.py` | Knowledge base entries, keyword search, config params, sources |
| `test_registry.py` | Rule registry, @register decorator |
| `test_reporting.py` | JSON, Markdown, HTML reporters |
| `test_rewriter.py` | CST transformer, diff generator |
| `test_safe_rewriter.py` | SafeRewriter dry-run, apply, backup |
| `test_scoring.py` | 0–100 score engine, penalty matrix |
| `test_cost.py` | Cloud cost estimator (4 clouds, workload scaling, severity weights) |
| `test_sarif.py` | SARIF v2.1.0 format compliance, severity mapping |
| `test_history.py` | Audit history, CSV export, regression detection, persistence |

---

## Supported Models (Local AI)

| Model | Pull command | Notes |
|-------|-------------|-------|
| DeepSeek-Coder 6.7B | `ollama pull deepseek-coder:6.7b` | **Recommended** |
| CodeLlama 7B | `ollama pull codellama:7b` | Good code quality |
| Llama 3 8B | `ollama pull llama3:8b` | General purpose |
| Mistral 7B | `ollama pull mistral:7b` | Fast inference |

SparkGuardian auto-selects the first available model from this list.
