Metadata-Version: 2.4
Name: crdb-sql-audit
Version: 0.2.6
Summary: Analyze PostgreSQL SQL logs for CockroachDB compatibility
Author-email: Virag Tripathi <virag.tripathi@gmail.com>
License-Expression: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: matplotlib
Requires-Dist: pyyaml
Dynamic: license-file

[![PyPI version](https://img.shields.io/pypi/v/crdb-sql-audit)](https://pypi.org/project/crdb-sql-audit/)
[![Python version](https://img.shields.io/pypi/pyversions/crdb-sql-audit)](https://pypi.org/project/crdb-sql-audit/)
[![License](https://img.shields.io/pypi/l/crdb-sql-audit)](https://pypi.org/project/crdb-sql-audit/)
[![Build status](https://github.com/viragtripathi/crdb-sql-audit/actions/workflows/python-ci.yml/badge.svg)](https://github.com/viragtripathi/crdb-sql-audit/actions)

# crdb-sql-audit

A powerful CLI tool to extract, deduplicate, and analyze SQL logs for **CockroachDB compatibility** using a flexible, rule-based engine.

## 🚀 Features
- Works with **any SQL dialect** (PostgreSQL, MySQL, Oracle, etc.)
- Extracts SQL and function calls using customizable search terms (e.g. `execute`, `pg_`)
- Deduplicates repeated SQL statements from logs
- Analyzes SQL using a **YAML-based rule engine**
- Supports default compatibility rules (PostgreSQL ➜ CockroachDB)
- Allows **custom rule sets** via `--rules`
- Logs analysis output to both terminal and `crdb_sql_audit.log`
- Automatically detects SQL statement types (e.g. SELECT, DELETE)
- Friendly CLI with `--help` and `--version`
- Export full reports in multiple formats:
  - `.sql`: Deduplicated queries
  - `.csv`: Raw compatibility issue list
  - `.md`: Developer-friendly Markdown report
  - `.html`: Interactive browser report with sorting/filtering
  - `.png`: Visual bar chart of issues

## 🖼 Sample Output

| Report Type | Preview                                                                                                               |
|-------------|-----------------------------------------------------------------------------------------------------------------------|
| HTML        | ![HTML Report Screenshot](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/sample_report.png) |
| Chart       | ![Bar Chart](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/sample_chart.png)               |
| CSV         | ![CSV Snippet](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/sample_csv.png)               |
| SQL         | ![SQL Snippet](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/sample_sql.png)               |
| Markdown    | ![Markdown Snippet](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/sample_md.png)           |


## 📦 Installation

### Option A: Local Dev Install
```bash
git clone https://github.com/your-org/crdb-sql-audit.git
cd crdb-sql-audit
python -m venv venv
source venv/bin/activate
pip install .
```

### Option B: Build via `pyproject.toml`
```bash
python -m build
pip install dist/crdb_sql_audit-0.2.0-py3-none-any.whl
```

## 🧪 Usage

```bash
crdb-sql-audit \
  --dir /path/to/logs \
  --terms execute,pg_ \
  --out output/report
```

You can also analyze a single file:

```bash
crdb-sql-audit \
  --file /path/to/logfile.log \
  --terms SELECT,INSERT \
  --raw \
  --out output/single_file_report
```

> ⚠️ You must provide either `--dir` or `--file`, but not both.

### 🔧 Additional Options

```bash
--dir       Directory containing SQL log files (mutually exclusive with --file)
--file      Single SQL log file (mutually exclusive with --dir)
--terms     Comma-separated search keywords to extract SQL (default: 'execute,pg_')
--raw       Treat each matching line as a raw SQL statement (default: False)
--rules     Path to YAML rules file (optional, default: built-in PostgreSQL rules)
--out       Output file prefix (default: crdb_audit_output/report)
--help      Show usage help
--version   Show current version
```

### 📘 CLI Help Example

```bash
crdb-sql-audit --help
```

![CLI help screenshot](https://raw.githubusercontent.com/viragtripathi/crdb-sql-audit/main/docs/cli_help.png)

### Custom Rules Example

```bash
crdb-sql-audit \
  --dir ./logs \
  --terms execute,pg_ \
  --rules ./rules/mysql_to_crdb.yaml \
  --out output/mysql_report
```

> 💡 This tool supports auditing **any SQL dialect** — just provide a rule set for your source database (e.g., PostgreSQL, MySQL, Oracle).

## 📁 Output
```
output/
├── report.sql          # Deduplicated SQL
├── report.csv          # Compatibility issues
├── report.md           # Markdown summary
├── report.html         # Interactive dashboard
├── report_chart.png    # Visual chart of issues
├── crdb_sql_audit.log  # Full run log
```

## 🧹 Preparing Your Log Files

To analyze SQL logs effectively, we recommend the following preprocessing steps:

### 1. Extract SQL-related Lines
```bash
grep "execute" app.log > sql_only.log
# or to include pg_ built-in function usage:
grep -E "execute|pg_" app.log > sql_only.log
```

### 2. Split Into Manageable Chunks (Optional but Recommended)
```bash
split -b 50M sql_only.log chunks/sql_chunk_
```

### 3. Run the Audit
```bash
crdb-sql-audit --dir chunks --terms execute,pg_ --out output/report
```

## 📚 Rule Engine Format

Rules are written in YAML and matched against each SQL line. Example:
> 💡 This is also the default rule if you don't provide `--rules` param.

```yaml
- id: malformed_dml_statements
  match: '^(SELECT|INSERT|UPDATE|DELETE FROM)\s*$'
  message: "Possibly malformed or incomplete SQL statement"
  level: warning
  tags: [syntax]

- id: special_char_in_identifier
  match: '"[^"]*#\w*"'
  message: "Table name contains unsupported special character (#)"
  level: error
  tags: [table, identifier]

- id: pg_builtins
  match: '^.*\bpg_\w+\s*\(.*$'
  message: "PostgreSQL pg_* function not supported in CockroachDB"
  level: error
  tags: [function]

#- id: low_token_sql
  #match: '^(\S+\s*){1,2}$'
  #message: "Extremely short SQL likely malformed"
  #level: warning
  #tags: [syntax]
```

> 📦 Multiple rule sets can be created to target different SQL dialects (e.g., `postgres_to_crdb.yaml`, `mysql_to_crdb.yaml`, etc.)

## 🧪 Validate Your Regex Rules

### 🔍 Online (Recommended)
Use [regex101.com](https://regex101.com/?flavor=python) to test your patterns:
- Set the **flavor to Python**
- Paste your rule into the regex field
- Paste a sample SQL line into the test area

### 🐍 In Python
You can also test your rules directly:
```python
import re
pattern = re.compile(r'^.*\bpg_\w+\s*\(.*$', re.IGNORECASE)
sql = "SELECT pg_backend_pid()"
print(bool(pattern.search(sql)))  # ✅ True
```

### 🛠 Validate with Shell
You can use basic Unix commands to check for patterns like pg_ functions directly in your log chunks:


| Task                               | Command                                                                   |
|------------------------------------|---------------------------------------------------------------------------|
| Total matches across chunks        | `grep -oE '\bpg_[a-zA-Z0-9_]+\(' chunks/* \| wc -l`                       |
| Unique function names              | `grep -oE '\bpg_[a-zA-Z0-9_]+\(' chunks/* \| sort \| uniq`                |
| Count occurrences of each function | `grep -oE '\bpg_[a-zA-Z0-9_]+\(' chunks/* \| sort \| uniq -c \| sort -nr` |
| Full SQL lines containing pg\_\*   | `grep -E '\bpg_[a-zA-Z0-9_]+\(' chunks/*`                                 |


---

## 🧪 Running Tests

This project includes a test suite using sample logs and rules to validate behavior.

### 🔧 To run locally:

```bash
python tests/test_runner.py
```

### 🧪 What it does:

* Runs `crdb-sql-audit` on a small sample of PostgreSQL-style logs
* Uses `tests/rules/test_rules.yaml`
* Verifies that a CSV report is created with expected issues

✅ This runs automatically in GitHub Actions on every commit to `main`.

