Metadata-Version: 2.4
Name: stix2tabular
Version: 0.1.0
Summary: Convert STIX cyber threat intelligence bundles to Pandas DataFrames
Project-URL: Homepage, https://github.com/mabayan/stix2tabular
Project-URL: Repository, https://github.com/mabayan/stix2tabular
Project-URL: Issues, https://github.com/mabayan/stix2tabular/issues
Author-email: Marlon Abayan <mabayan@users.noreply.github.com>
License-Expression: MIT
License-File: LICENSE
Keywords: cti,cybersecurity,dataframe,mitre-attack,pandas,stix,stix2,threat-intelligence
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Security
Requires-Python: >=3.10
Requires-Dist: pandas>=2.0
Requires-Dist: pyarrow>=12.0
Requires-Dist: stix2>=3.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: requests>=2.28; extra == 'dev'
Description-Content-Type: text/markdown

# stix2tabular

Convert STIX cyber threat intelligence bundles to Pandas DataFrames.

## Installation

```bash
pip install stix2tabular
```

## Quick Start

```python
from stix2tabular import stix_to_tables, save_tables

tables = stix_to_tables("enterprise-attack.json")

print(tables.keys())
# → dict_keys(['attack-pattern', 'intrusion-set', 'malware', 'tool', 'relationships', ...])

print(tables["malware"].head())
#                              id       type            name  ...
# 0  malware--abc123             malware     CHOPSTICK  ...
# 1  malware--def456             malware     X-Agent    ...

# Save to Parquet for later use
save_tables(tables, "attack_tables/")
```

## Before / After

**Before (without stix2tabular):**

```python
import json
import pandas as pd

with open("enterprise-attack.json") as f:
    bundle = json.load(f)

objects_by_type = {}
relationships = []

for obj in bundle["objects"]:
    obj_type = obj.get("type")
    if obj_type == "marking-definition":
        continue
    if obj_type == "relationship":
        relationships.append({
            "id": obj["id"],
            "type": obj["type"],
            "relationship_type": obj["relationship_type"],
            "source_ref": obj["source_ref"],
            "target_ref": obj["target_ref"],
            "created": obj.get("created"),
            "modified": obj.get("modified"),
        })
        continue
    if obj_type not in objects_by_type:
        objects_by_type[obj_type] = []
    row = {}
    for key, value in obj.items():
        row[key] = value
    objects_by_type[obj_type].append(row)

tables = {}
for obj_type, rows in objects_by_type.items():
    tables[obj_type] = pd.DataFrame(rows)
tables["relationships"] = pd.DataFrame(relationships)
# Still missing: sightings, SCO handling, STIX 2.0 embedded observables,
# deduplication, multi-bundle merging, error handling...
```

**After (with stix2tabular):**

```python
from stix2tabular import stix_to_tables

tables = stix_to_tables("enterprise-attack.json")
```

## What You Get

```python
tables = stix_to_tables("enterprise-attack.json")

# One DataFrame per STIX type
tables["attack-pattern"]     # 680 rows × 15 columns
tables["intrusion-set"]      # 138 rows × 12 columns
tables["malware"]            # 490 rows × 14 columns
tables["tool"]               # 78 rows × 11 columns
tables["campaign"]           # 23 rows × 10 columns

# Relationships as a lean edge table
tables["relationships"]      # 18,400 rows × 9 columns

# Sightings
tables["sightings"]          # 42 rows × 8 columns

# SCO types (when include_scos=True)
tables["ipv4-addr"]          # 12 rows × 4 columns
```

## API Reference

### `stix_to_tables(source, include_scos=True)`

Convert STIX bundles into a dict of Pandas DataFrames.

- **`source`**: `str | list[str] | list[dict]`
  - File path (`.json`): reads and parses a single file
  - Directory path: globs all `*.json` files, merges into one set of tables
  - `list[str]`: each string is parsed as a full STIX bundle JSON
  - `list[dict]`: each dict is treated as a parsed STIX bundle
- **`include_scos`**: `bool` (default `True`)
  - When `True`, STIX Cyber-observable Objects (IP addresses, domain names, file hashes, etc.) get their own DataFrames
  - When `False`, only SDOs, relationships, and sightings are included
- **Returns**: `dict[str, pd.DataFrame]`

### `save_tables(tables, directory)`

Save all DataFrames to a directory as Parquet files.

- **`tables`**: dict returned by `stix_to_tables()`
- **`directory`**: path to output directory (created if it doesn't exist)
- Writes one `{type}.parquet` file per key (e.g., `malware.parquet`, `relationships.parquet`)

### `load_tables(directory)`

Load DataFrames from a directory of Parquet files.

- **`directory`**: path to directory containing `.parquet` files from `save_tables()`
- **Returns**: `dict[str, pd.DataFrame]` — dict keys derived from filenames

## Working with the Data

```python
# All techniques used by APT28
rels = tables["relationships"]
apt28_id = tables["intrusion-set"].query("name == 'APT28'")["id"].iloc[0]
technique_ids = rels.query(
    "source_ref == @apt28_id and relationship_type == 'uses'"
)["target_ref"]
techniques = tables["attack-pattern"][
    tables["attack-pattern"]["id"].isin(technique_ids)
]["name"]
```

```python
# Most common relationship types
tables["relationships"]["relationship_type"].value_counts()
```

```python
# Explode aliases to find all names for threat actors
tables["intrusion-set"].explode("aliases")[["name", "aliases"]]
```

```python
# Merge bundles from a directory of STIX feeds
tables = stix_to_tables("/path/to/stix_feeds/")
```

```python
# Join source names onto relationships for a denormalized view
import pandas as pd

rels = tables["relationships"].copy()
names = pd.concat([df[["id", "name"]] for df in tables.values() if "name" in df.columns])
rels = rels.merge(names, left_on="source_ref", right_on="id", suffixes=("", "_source"))
rels = rels.merge(names, left_on="target_ref", right_on="id", suffixes=("", "_target"))
```

## Saving and Loading

The library includes built-in Parquet persistence for lossless round-tripping:

```python
from stix2tabular import stix_to_tables, save_tables, load_tables

tables = stix_to_tables("enterprise-attack.json")

# Save all DataFrames to a directory (one .parquet file per type)
save_tables(tables, "output/attack_tables/")
# Creates: attack-pattern.parquet, intrusion-set.parquet, malware.parquet,
#          relationships.parquet, sightings.parquet, ...

# Load them back — identical DataFrames, including list/dict columns
tables = load_tables("output/attack_tables/")
```

Parquet preserves Python lists and dicts natively — no serialization needed, no data loss.

**CSV note:** If you need CSV, you'll need to serialize list/dict columns yourself before exporting:

```python
import json
df = tables["malware"].copy()
for col in df.columns:
    df[col] = df[col].apply(lambda x: json.dumps(x) if isinstance(x, (list, dict)) else x)
df.to_csv("malware.csv", index=False)
```

## Comparison with stix2nx

| Need                        | Use           |
|-----------------------------|---------------|
| Graph traversal, centrality | stix2nx       |
| Filtering, aggregation, ML  | stix2tabular  |
| Both                        | Install both  |

Same input API. Same STIX version support. Independent libraries — no cross-dependency.

## Running Tests

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run all tests (integration test downloads live ATT&CK data, falls back to curated subset if offline)
pytest

# Run in offline mode (uses curated ~1MB ATT&CK subset only, no network needed)
STIX2TABULAR_OFFLINE=true pytest

# Regenerate the curated subset from latest ATT&CK (requires network)
python tests/data/build_subset.py
```

## STIX Version Support

Supports both STIX 2.0 and STIX 2.1 bundles. STIX 2.0 `observed-data` objects with embedded observables are automatically extracted into their respective type DataFrames when `include_scos=True`.

## License

MIT
