Concepts
Point-in-Time
The core principle of SafeFeat is point-in-time correctness: when computing features for a prediction at cutoff time $t$, only use information available before time $t$.
Example: If you predict churn on 2024-02-15, don't use events from 2024-02-16.
Spine
The spine defines your prediction scenarios:
entity_id | cutoff_time
----------|------------
user_1 | 2024-01-31
user_1 | 2024-02-15
user_2 | 2024-02-15
Each row is an "as-of" moment: compute features for that entity looking back from that date.
Aggregations
WindowAgg
Count or summarize events in a rolling time window before the cutoff.
WindowAgg(
table="events",
windows=["7D", "30D"],
metrics={"*": ["count"], "amount": ["sum", "mean"]}
)
For cutoff 2024-02-15: - 7D window: 2024-02-08 to 2024-02-15 - 30D window: 2024-01-16 to 2024-02-15
Only events falling in these windows before the cutoff are included.
Supported Aggregations
"count"— number of events (on wildcard"*"only)"sum"— total of a numeric column"mean"— average of a numeric column"nunique"— count distinct values
RecencyBlock
Time since the last event:
RecencyBlock(
table="events",
filter_col="event_type",
filter_value="purchase"
)
Returns the number of days between the cutoff and the most recent matching event.
Useful for: - Churn modeling (days since purchase) - Fraud detection (days since login) - Activity recency
Data Quality & Leakage
Future Events
Events that occur after the cutoff are automatically excluded. This prevents leakage.
allowed_lag = "0s" # Default: strict point-in-time
# or
allowed_lag = "24h" # Allow events up to 24h after cutoff
Use allowed_lag if your event data has slight delays in ingestion.
Audit Report
The report tracks data quality by table:
features, report = build_features(..., return_report=True)
audit = report.tables["events"]
Available metrics:
- total_joined_pairs — events matched to entities/cutoffs
- kept_pairs — events within cutoff
- dropped_future_pairs — events after cutoff (leakage risk)
- max_future_delta — largest lateness of dropped events
High dropped_future_pairs may indicate:
- Ingestion delays
- Timezone issues
- Data quality problems
Entity-Cutoff Pairs
Each row in the spine creates an entity-cutoff pair. Events are grouped by: - entity_id — which user/customer/entity - cutoff_time — the prediction moment
When computing features, we:
- Join events to entity-cutoff pairs by entity
- Filter to events up to the cutoff time
- Aggregate by entity-cutoff pair
Example:
Spine row: user_1, 2024-02-15
Join events for user_1:
2024-01-10: 10 ← included
2024-01-20: 20 ← included
2024-02-01: 15 ← included
2024-02-20: 5 ← excluded (after cutoff)
30D window (back to 2024-01-16):
Count: 3
Sum: 45