{% extends "base.html" %} {% block styles %} {{ super() }} {% endblock %} {% block content %} {% if page.file.src_path == "index.md" %}
✦ v0.1.1 — now on PyPI

Feature engineering
without the leakage

SafeFeat builds ML features from event logs using only data available at prediction time — so your model learns what it'd actually know in production.

Read the docs → View on GitHub
$ pip install safefeat

The problem

Data leakage silently destroys your model

When you compute features like "total purchases in the last 30 days" without anchoring to a specific point in time, you accidentally include future data. Your model looks great in training — then falls apart in production.

⚠ Without SafeFeat

Naively grouping by user leaks future events into past rows. Your validation AUC is 0.91. Production AUC is 0.73.

# ❌ Leaky — uses ALL events, including ones after cutoff
features = events.groupby("user_id")["amount"].sum()
df = spine.merge(features, on="user_id")
✓ With SafeFeat

SafeFeat enforces point-in-time correctness automatically. No manual filtering. No accidental leakage.

# ✅ Safe — only uses events before each cutoff_time
from safefeat import build_features, WindowAgg

X = build_features(spine, tables, spec,
      event_time_cols={"events": "event_time"})

Why SafeFeat

Built for real ML pipelines

🔒

Leakage-proof by design

Every feature is computed relative to each entity's cutoff time. Future data physically cannot be included.

📋

Audit trail

See exactly which events were joined, kept, and dropped for each prediction point. Debug in minutes, not days.

🧩

Declarative specs

Define features once. Run them across any spine. No duplicated filtering logic across your codebase.

Fast & pandas-native

Vectorised operations on standard DataFrames. No new infrastructure required.


Core concepts

Three things you need to know

spine

When to predict

A DataFrame with entity_id and cutoff_time. Each row is one prediction point — e.g. "what did we know about user u1 on Jan 10?"

events

Your raw data

A time-series DataFrame with entity IDs, event timestamps, and attributes. Purchases, logins, clicks — anything with a timestamp.

spec

What to compute

A list of feature blocks like WindowAgg or RecencyBlock. Declarative, readable, and reusable across projects.


Quick start

Real-world example: churn prediction

Computing 7-day and 30-day purchase features for a set of users, safely anchored to their churn prediction date.

import pandas as pd
from safefeat import build_features, WindowAgg

spine = pd.DataFrame({
    "entity_id":   ["u1", "u2"],
    "cutoff_time": ["2024-01-10", "2024-01-31"],
})

events = pd.DataFrame({
    "entity_id":  ["u1",         "u1",         "u2",         "u2"],
    "event_time": ["2024-01-05", "2024-01-06", "2024-01-10", "2024-01-30"],
    "amount":     [10.0, 20.0, 5.0, 25.0],
    "event_type": ["click", "purchase", "purchase", "click"],
})

spec = [
    WindowAgg(
        table="events",
        windows=["7D", "30D"],
        metrics={
            "*":          ["count"],
            "amount":     ["sum", "mean"],
            "event_type": ["nunique"],
        },
    )
]

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    allowed_lag="0s",
)
from safefeat import RecencyBlock

spec = [RecencyBlock(table="events")]

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
)
# Adds: events__recency (days since last event before cutoff)
# NaN if no events exist before cutoff
X, audit = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    return_audit=True,
)

# entity_id | cutoff_time | events_joined | events_kept | events_dropped
# u1        | 2024-01-10  | 4             | 2           | 2  ✓

🏷 Column naming convention

events__amount__sum__7d
source table column (* = row count) aggregation time window

Common questions

Things users run into

What does allowed_lag actually do?
allowed_lag defines a tolerance window for future timestamps when enforcing leakage safety. In real-world systems, timestamps are messy — for example, database writes are slightly delayed. Core rule: event_time <= cutoff_time + allowed_lag, so when allowed_lag="5s" events within 5 seconds of the cutoff are still included.
My features are all NaN — what's wrong?
Almost always a datetime format mismatch. Run pd.to_datetime() on both cutoff_time and event_time before calling build_features. Also check timezone consistency — mixing tz-aware and tz-naive timestamps causes silent join failures.
Can I use multiple event tables?
Yes — pass multiple tables to the tables dict and reference each by name in your spec. All results are merged onto the spine automatically.
{% else %} {{ super() }} {% endif %} {% endblock %}