Skip to content

Getting Started

Installation

Install from PyPI:

pip install safefeat

Install (editable, with dev tools):

pip install -e ".[dev]"

1. Prepare the spine and events

The spine defines the prediction scenarios as rows of (entity_id, cutoff_time). Events contain historical records tied to entities.

import pandas as pd

spine = pd.DataFrame({
    "entity_id": ["u1", "u2"],
    "cutoff_time": ["2024-01-10", "2024-01-31"],
})

events = pd.DataFrame({
    "entity_id": ["u1", "u1", "u2", "u2"],
    "event_time": ["2024-01-05", "2024-01-06", "2024-01-10", "2024-01-30"],
    "amount": [10.0, 20.0, 5.0, 25.0],
    "event_type": ["click", "purchase", "purchase", "click"],
})

2. Define the Feature Specification

You declare features using WindowAgg.

from safefeat import WindowAgg

spec = [
    WindowAgg(
        table="events",
        windows=["7D", "30D"],
        metrics={
            "*": ["count"],              # total events
            "amount": ["sum", "mean"],   # numeric aggregations
            "event_type": ["nunique"],   # categorical unique counts
        },
    )
]

3. Build features

from safefeat import build_features

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    allowed_lag="0s",  # prevent future leakage
)

print(X)

Expected output (approximate):

| entity_id | cutoff_time | events__n_events__7d | events__amount__sum__30d |
| --------- | ----------- | -------------------- | ------------------------ |
| u1        | 2024-01-10  | 2                    | 30                       |
| u2        | 2024-01-10  | 1                    | 5                        |

How Leakage Prevention Works

safefeat enforces:

event_time <= cutoff_time

This guarantees that no future events are used when building features.

If allowed_lag is set (e.g. "5s"), a small tolerance is allowed to handle timestamp precision issues.

4. Inspect the AuditReport

If return_report=True, build_features returns an AuditReport mapping table names to TableAudit objects. The audit shows how many event–cutoff pairs were joined, how many were kept, how many were dropped for being in the future, and the largest future delta observed.

events_audit = report.tables.get("events")
print("total joined", events_audit.total_joined_pairs)
print("kept", events_audit.kept_pairs)
print("dropped (future)", events_audit.dropped_future_pairs)
print("max future delta", events_audit.max_future_delta)