Metadata-Version: 2.4
Name: sigmagen
Version: 0.1.0
Summary: AI-powered Sigma rule generator using MITRE ATT&CK and RAG
Author: SigmaGen Contributors
License: MIT
Project-URL: Homepage, https://github.com/sigmagen-project/sigmagen
Project-URL: Issues, https://github.com/sigmagen-project/sigmagen/issues
Keywords: sigma,detection,mitre,att&ck,security,siem
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Security
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: click==8.1.7
Requires-Dist: python-dotenv==1.0.1
Requires-Dist: PyYAML==6.0.2
Requires-Dist: requests==2.32.3
Requires-Dist: rich==13.7.1
Requires-Dist: chromadb==0.5.3
Requires-Dist: sentence-transformers==3.1.1
Requires-Dist: anthropic==0.34.0
Requires-Dist: openai==1.40.0
Requires-Dist: fastapi==0.111.1
Requires-Dist: uvicorn[standard]==0.30.3
Requires-Dist: pydantic==2.8.2
Requires-Dist: pydantic-settings==2.4.0
Provides-Extra: dev
Requires-Dist: pytest==8.3.2; extra == "dev"
Requires-Dist: httpx==0.27.0; extra == "dev"

<p align="center">
  <h1 align="center">SigmaGen</h1>
  <p align="center">
    <strong>Generate production-ready Sigma detection rules from MITRE ATT&CK technique IDs or raw attack telemetry — powered by RAG.</strong>
  </p>
  <p align="center">
    <a href="#quickstart">Quickstart</a> &bull;
    <a href="#how-it-works">How It Works</a> &bull;
    <a href="#cli-reference">CLI Reference</a> &bull;
    <a href="#rest-api">REST API</a> &bull;
    <a href="#contributing">Contributing</a>
  </p>
</p>

---

## The Problem

CISA and every major threat intel framework agree: the #1 gap in enterprise security is the time between a new technique appearing in the wild and a detection rule landing in your SIEM. Most SOC teams don't have enough detection engineers. Writing a high-quality Sigma rule from scratch — with the right logsource, field mappings, false positive filters, and ATT&CK tags — takes 30-60 minutes per technique.

## The Solution

SigmaGen closes this gap. Give it a technique ID or describe the attack behavior, and it generates deployable Sigma YAML in seconds:

```bash
$ sigmagen generate --technique T1059.001
```

```
──────────────────── SigmaGen Rule Generation ────────────────────
Retrieving context from knowledge base...
  Retrieved 5 techniques, 5 existing rules
Generating rules via anthropic...
  LLM returned 3 rule(s)
──────────────────────── Rule 1 ────────────────────────────────
  title: Suspicious PowerShell Encoded Command Execution
  id: 7f3a2c1e-84b6-4d9f-a031-5e8c7b2f9d14
  status: experimental
  logsource:
    category: process_creation
    product: windows
  detection:
    selection_image:
      Image|endswith:
        - '\powershell.exe'
        - '\pwsh.exe'
    selection_encoded:
      CommandLine|contains:
        - ' -EncodedCommand '
        - ' -enc '
        - ' -EC '
    filter_known_tools:
      ParentImage|endswith: '\msiexec.exe'
    condition: selection_image and selection_encoded
              and not filter_known_tools
  level: medium

  Validation: PASSED

──────────────────────── Summary ───────────────────────────────
  3 rules generated  |  3 passed  |  0 failed
  Output: output/
    v suspicious_powershell_encoded_command_execution.yml   [medium]
    v powershell_suspicious_download_cradle_execution.yml   [high]
    v powershell_amsi_bypass_attempt_detected.yml           [high]
```

Every generated rule includes specific detection logic with field-value conditions, false positive filters, ATT&CK tags, and logsource mappings — not generic templates.

---

## How It Works

SigmaGen is **not** a wrapper around "write me a Sigma rule." It's a RAG pipeline that retrieves real ATT&CK detection guidance and existing Sigma rules from a local vector store, then uses that context to ground the LLM's output in production patterns.

```
          User Input                    Knowledge Base
     (T1059.001 or text)            ┌──────────────────┐
              │                     │  ATT&CK (691)    │
              ▼                     │  Sigma  (3110)   │
     ┌────────────────┐             └────────┬─────────┘
     │   Retriever    │◄────────────────────┘
     │  (ChromaDB)    │   cosine similarity
     └───────┬────────┘   + metadata filter
             │
             ▼
     ┌────────────────┐
     │ Prompt Builder │  packs ATT&CK context
     │                │  + 3 best Sigma examples
     └───────┬────────┘
             │
             ▼
     ┌────────────────┐
     │  Claude / GPT  │  generates 1-3 rules
     └───────┬────────┘
             │
             ▼
     ┌────────────────┐
     │   Validator    │  schema + field checks
     └───────┬────────┘
             │
             ▼
       .yml files in output/
```

**Ingestion** — The ATT&CK STIX bundle (690+ techniques with detection guidance, data sources, tactics) and SigmaHQ's stable rules (3100+ community rules) are parsed and embedded into ChromaDB using `all-MiniLM-L6-v2`.

**Retrieval** — Technique IDs hit an exact metadata filter first, then semantic similarity for related context. Free-text queries use pure semantic search across both collections. Results are deduplicated and ranked.

**Generation** — The prompt includes the ATT&CK technique's detection guidance, data sources, and platforms, plus up to 3 existing Sigma rules as structural examples. The system prompt enforces specific detection conditions — no `selection: *` or match-all patterns.

**Validation** — Every rule is checked for required fields, valid levels/statuses, UUID format, logsource structure, detection logic (must have named selections + condition), and ATT&CK tags.

---

## Quickstart

### Install

```bash
git clone https://github.com/sigmagen-project/sigmagen.git
cd sigmagen
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
```

### Setup

The setup wizard handles everything:

```bash
$ sigmagen init
```

```
╭─ SigmaGen Setup Wizard ─╮
╰────────────────────────╯

Step 1/3  Checking API key...
  v ANTHROPIC_API_KEY is set (provider: anthropic)

Step 2/3  Checking knowledge base...
  v ATT&CK techniques: 691
  v Sigma rules: 3110
  Knowledge base is ready.

Step 3/3  Verifying setup...
  v Everything looks good!

╭─ Next steps ──────────────────────────────────────────╮
│ Ready to generate.                                    │
│                                                       │
│   sigmagen generate --technique T1059.001               │
│   sigmagen generate --description "certutil download"   │
╰───────────────────────────────────────────────────────╯
```

Or manually:

```bash
cp .env.example .env          # add your ANTHROPIC_API_KEY or OPENAI_API_KEY
sigmagen ingest all              # one-time, ~5 min
```

### Generate

Three ways to generate rules:

```bash
# By technique ID (supports tab completion)
sigmagen generate --technique T1059.001

# By description
sigmagen generate --description "attacker used certutil to download a payload"

# Interactive — just run it, get prompted
sigmagen generate
```

```
  Input type (technique, description, telemetry): technique
  Technique ID (e.g. T1059.001): T1059.001
  How many rules? (1-5) [1]: 3
```

### Search the Knowledge Base

Query what ATT&CK techniques and existing Sigma rules match your input:

```bash
$ sigmagen retrieve --query "credential dumping lsass"
```

```
             ATT&CK Techniques
┌───────────┬───────────────────────┬───────────────────┬───────┐
│ ID        │ Name                  │ Tactics           │ Score │
├───────────┼───────────────────────┼───────────────────┼───────┤
│ T1003.001 │ LSASS Memory          │ credential-access │ 0.631 │
│ T1003.004 │ LSA Secrets           │ credential-access │ 0.607 │
│ T1547.008 │ LSASS Driver          │ persistence       │ 0.540 │
│ T1003     │ OS Credential Dumping │ credential-access │ 0.512 │
│ T1110.004 │ Credential Stuffing   │ credential-access │ 0.480 │
└───────────┴───────────────────────┴───────────────────┴───────┘
                 Sigma Rules
┌─────────────────────────────────────┬──────────┬───────────┬───────┐
│ Title                               │ Level    │ Technique │ Score │
├─────────────────────────────────────┼──────────┼───────────┼───────┤
│ Credential Dumping Via LSASS        │ medium   │ T1003.001 │ 0.787 │
│ LSASS Process Clone                 │ critical │ T1003     │ 0.765 │
│ Credential Dumping By Python Tool   │ high     │ T1003.001 │ 0.763 │
│ LSASS SilentProcessExit Technique   │ critical │ T1003.001 │ 0.741 │
│ Password Dumper Activity on LSASS   │ high     │ T1003.001 │ 0.739 │
└─────────────────────────────────────┴──────────┴───────────┴───────┘
```

### Validate

```bash
$ sigmagen validate output/suspicious_powershell_encoded_command.yml
```

```
──── Validating suspicious_powershell_encoded_command.yml ────
Validation PASSED
```

### Status Dashboard

```bash
$ sigmagen status
```

```
SigmaGen v0.1.0
───────────────────────────────────────────────────────
Knowledge Base     Documents  Status
attack_techniques        691  v Ready
sigma_rules             3110  v Ready

LLM Provider     anthropic
API Key          v Set
Model            claude-sonnet-4-6
Embedding Model  all-MiniLM-L6-v2
───────────────────────────────────────────────────────
Ready to generate. Run: sigmagen generate --technique T1059.001
```

### Error Handling

Invalid inputs are caught early with clear guidance:

```
$ sigmagen generate --technique fdsb
x 'fdsb' is not a valid ATT&CK technique ID.
  Expected format: T1059 or T1059.001

$ sigmagen generate --technique T1059.001   # before running ingest
x Knowledge base is not ready.
  - attack_techniques collection is empty
  Run: sigmagen ingest all

$ sigmagen generate                          # without an API key
x No API key found.
  Add ANTHROPIC_API_KEY to your .env file.
  Run: cp .env.example .env
```

---

## CLI Reference

```
sigmagen
├── init             First-run setup wizard
├── ingest
│   ├── attack       Download and embed ATT&CK techniques  [--force]
│   ├── sigma        Clone and embed Sigma rules           [--force] [--full-corpus]
│   └── all          Both                                  [--force] [--full-corpus]
├── generate         Generate Sigma rules via RAG + LLM
│   ├── -t T1059.001       by technique ID (tab-completable)
│   ├── -d "description"   by free text
│   ├── -T ./log.xml       by telemetry file
│   ├── -o ./output        output directory
│   ├── -n 3               number of rules (1-5)
│   └── -p openai          override LLM provider
├── retrieve         Search the knowledge base             -q "query" [-n 5]
├── validate         Validate a Sigma YAML file            <path>
├── status           Dashboard: collections + config
├── serve            Start REST API server                 [--host] [--port]
└── setup-shell      Enable tab completion                 [bash|zsh|fish|powershell]
```

---

## REST API

Start with `sigmagen serve`. Swagger docs at `http://localhost:8000/docs`.

| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/generate` | Generate Sigma rules |
| `GET` | `/retrieve?q=...` | Search the knowledge base |
| `GET` | `/status` | Collection stats + config |
| `GET` | `/health` | Health check |

```bash
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"technique_id": "T1059.001", "n_rules": 2}'
```

---

## Tech Stack

| Component | Technology | Why |
|-----------|-----------|-----|
| CLI | Click + Rich | Subcommands, tab completion, syntax-highlighted output |
| Vector store | ChromaDB (local, persistent) | No external database needed |
| Embeddings | sentence-transformers (`all-MiniLM-L6-v2`) | Local, no API key, fast |
| LLM | Anthropic Claude / OpenAI GPT | Swappable via env var |
| API | FastAPI + Pydantic | Type-safe, auto-documented |
| Validation | Pure Python + optional sigma-cli | Zero external deps for core validation |

No LangChain. No LlamaIndex. The RAG pipeline is built directly on ChromaDB queries.

---

## Knowledge Bases

| Source | Documents | What's embedded |
|--------|-----------|-----------------|
| [MITRE ATT&CK Enterprise](https://github.com/mitre/cti) | 691 techniques | ID, name, tactics, platforms, data sources, detection guidance |
| [SigmaHQ/sigma](https://github.com/SigmaHQ/sigma) | 3110 rules | Title, description, logsource, detection logic, level, technique tags |

Both are downloaded and embedded locally during `sigmagen ingest all`. No data leaves your machine except the LLM API call during generation.

---

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `LLM_PROVIDER` | `anthropic` | `anthropic` or `openai` |
| `ANTHROPIC_API_KEY` | — | Required if provider is anthropic |
| `ANTHROPIC_MODEL` | `claude-sonnet-4-6` | Claude model ID |
| `OPENAI_API_KEY` | — | Required if provider is openai |
| `OPENAI_MODEL` | `gpt-4o` | OpenAI model ID |
| `EMBEDDING_MODEL` | `all-MiniLM-L6-v2` | Local embedding model |
| `SIGMAGEN_DATA_DIR` | `./data` | Where ATT&CK JSON, Sigma repo, and ChromaDB live |

---

## Contributing

```bash
git clone https://github.com/sigmagen-project/sigmagen.git
cd sigmagen
pip install -e ".[dev]"
pytest tests/ -v   # 38 tests, all passing
```

1. Fork the repo
2. Create a feature branch
3. Make sure tests pass
4. Submit a PR

---

## License

MIT
