{% extends "base.html" %} {% block styles %} {{ super() }} {% endblock %} {% block content %} {% if page.file.src_path == "index.md" %}
SafeFeat builds ML features from event logs using only data available at prediction time — so your model learns what it'd actually know in production.
When you compute features like "total purchases in the last 30 days" without anchoring to a specific point in time, you accidentally include future data. Your model looks great in training — then falls apart in production.
Naively grouping by user leaks future events into past rows. Your validation AUC is 0.91. Production AUC is 0.73.
# ❌ Leaky — uses ALL events, including ones after cutoff
features = events.groupby("user_id")["amount"].sum()
df = spine.merge(features, on="user_id")
SafeFeat enforces point-in-time correctness automatically. No manual filtering. No accidental leakage.
# ✅ Safe — only uses events before each cutoff_time
from safefeat import build_features, WindowAgg
X = build_features(spine, tables, spec,
event_time_cols={"events": "event_time"})
Every feature is computed relative to each entity's cutoff time. Future data physically cannot be included.
See exactly which events were joined, kept, and dropped for each prediction point. Debug in minutes, not days.
Define features once. Run them across any spine. No duplicated filtering logic across your codebase.
Vectorised operations on standard DataFrames. No new infrastructure required.
A DataFrame with entity_id and cutoff_time. Each row is one prediction point — e.g. "what did we know about user u1 on Jan 10?"
A time-series DataFrame with entity IDs, event timestamps, and attributes. Purchases, logins, clicks — anything with a timestamp.
A list of feature blocks like WindowAgg or RecencyBlock. Declarative, readable, and reusable across projects.
Computing 7-day and 30-day purchase features for a set of users, safely anchored to their churn prediction date.
import pandas as pd
from safefeat import build_features, WindowAgg
spine = pd.DataFrame({
"entity_id": ["u1", "u2"],
"cutoff_time": ["2024-01-10", "2024-01-31"],
})
events = pd.DataFrame({
"entity_id": ["u1", "u1", "u2", "u2"],
"event_time": ["2024-01-05", "2024-01-06", "2024-01-10", "2024-01-30"],
"amount": [10.0, 20.0, 5.0, 25.0],
"event_type": ["click", "purchase", "purchase", "click"],
})
spec = [
WindowAgg(
table="events",
windows=["7D", "30D"],
metrics={
"*": ["count"],
"amount": ["sum", "mean"],
"event_type": ["nunique"],
},
)
]
X = build_features(
spine=spine,
tables={"events": events},
spec=spec,
event_time_cols={"events": "event_time"},
allowed_lag="0s",
)
from safefeat import RecencyBlock
spec = [RecencyBlock(table="events")]
X = build_features(
spine=spine,
tables={"events": events},
spec=spec,
event_time_cols={"events": "event_time"},
)
# Adds: events__recency (days since last event before cutoff)
# NaN if no events exist before cutoff
X, audit = build_features(
spine=spine,
tables={"events": events},
spec=spec,
event_time_cols={"events": "event_time"},
return_audit=True,
)
# entity_id | cutoff_time | events_joined | events_kept | events_dropped
# u1 | 2024-01-10 | 4 | 2 | 2 ✓
allowed_lag actually do? ▾
allowed_lag defines a tolerance window for future timestamps when enforcing leakage safety.
In real-world systems, timestamps are messy — for example, database writes are slightly delayed.
Core rule: event_time <= cutoff_time + allowed_lag, so when allowed_lag="5s"
events within 5 seconds of the cutoff are still included.
pd.to_datetime() on both cutoff_time and event_time before calling build_features. Also check timezone consistency — mixing tz-aware and tz-naive timestamps causes silent join failures.
tables dict and reference each by name in your spec. All results are merged onto the spine automatically.