scitex_ml.classification.timeseries
Time series cross-validation utilities for classification.
This module provides specialized cross-validation strategies for time series data, ensuring proper temporal ordering and preventing data leakage.
- class scitex_ml.classification.timeseries.TimeSeriesStratifiedSplit(n_splits=5, test_ratio=0.2, val_ratio=0.1, gap=0, stratify=True, random_state=None)[source]
Time series cross-validation with stratification support.
This splitter ensures: 1. Test data is always chronologically after training data 2. Optional validation set between train and test 3. Class balance preservation in splits 4. Gap period between train and test to avoid leakage
- Parameters:
n_splits (int) – Number of splits (folds)
test_ratio (float) – Proportion of data for test set (default: 0.2)
val_ratio (float) – Proportion of data for validation set (default: 0.1)
gap (int) – Number of samples to exclude between train and test (default: 0)
stratify (bool) – Whether to maintain class proportions (default: True)
random_state (int, optional) – Random seed for reproducibility (default: None)
Examples
>>> from scitex_ml.classification import TimeSeriesStratifiedSplit >>> import numpy as np >>> >>> X = np.random.randn(100, 10) >>> y = np.random.randint(0, 2, 100) >>> timestamps = np.arange(100) >>> >>> tscv = TimeSeriesStratifiedSplit(n_splits=3) >>> for train_idx, test_idx in tscv.split(X, y, timestamps): ... print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")
- __init__(n_splits=5, test_ratio=0.2, val_ratio=0.1, gap=0, stratify=True, random_state=None)[source]
- split(X, y=None, timestamps=None, groups=None)[source]
Generate indices to split data into training and test sets.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data
y (array-like, shape (n_samples,)) – Target variable
timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)
groups (array-like, shape (n_samples,), optional) – Group labels for grouped CV
- Yields:
train (ndarray) – Training set indices
test (ndarray) – Test set indices
- Return type:
- split_with_val(X, y=None, timestamps=None, groups=None)[source]
Generate indices with separate validation set.
- get_n_splits(X=None, y=None, groups=None)[source]
Returns the number of splitting iterations in the CV.
- plot_splits(X, y=None, timestamps=None, figsize=(12, 6), save_path=None)[source]
Visualize the stratified time series splits.
Shows train (blue), validation (green), and test (red) sets. When val_ratio=0, only shows train and test.
- Parameters:
- Returns:
fig – The created figure
- Return type:
matplotlib.figure.Figure
- set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesStratifiedSplit
Configure whether metadata should be requested to be passed to the
splitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tosplitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tosplit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- class scitex_ml.classification.timeseries.TimeSeriesBlockingSplit(n_splits=5, test_ratio=0.2, val_ratio=0.0, random_state=None)[source]
Time series split with blocking to handle multiple subjects/groups.
This splitter ensures temporal integrity within each subject while allowing cross-subject generalization. Each subject’s data is kept temporally coherent, but subjects can appear in both training and test sets at different time periods.
Key Features: - Temporal order preserved within each subject - No data leakage within individual subject timelines - Expanding window approach: more training data in later folds - Cross-subject generalization: subjects can be in both train and test
Use Cases: - Multiple patients with longitudinal medical data - Multiple stocks with time series financial data - Multiple sensors with temporal measurements - Any scenario with grouped time series data
- Parameters:
Examples
>>> from scitex_ml.classification import TimeSeriesBlockingSplit >>> import numpy as np >>> >>> # Create data: 100 samples, 4 subjects (25 samples each) >>> X = np.random.randn(100, 10) >>> y = np.random.randint(0, 2, 100) >>> timestamps = np.arange(100) >>> groups = np.repeat([0, 1, 2, 3], 25) # Subject IDs >>> >>> # Each subject gets temporal split: early samples → train, later → test >>> splitter = TimeSeriesBlockingSplit(n_splits=3, test_ratio=0.3) >>> for train_idx, test_idx in splitter.split(X, y, timestamps, groups): ... train_subjects = set(groups[train_idx]) ... test_subjects = set(groups[test_idx]) ... print(f"Train subjects: {train_subjects}, Test subjects: {test_subjects}") ... # Output shows same subjects in both sets but different time periods
- split(X, y=None, timestamps=None, groups=None)[source]
Generate indices respecting group boundaries.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data
y (array-like, shape (n_samples,)) – Target variable
timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)
groups (array-like, shape (n_samples,)) – Group labels (e.g., patient IDs) - required
- Yields:
train (ndarray) – Training set indices
test (ndarray) – Test set indices
- Return type:
- split_with_val(X, y=None, timestamps=None, groups=None)[source]
Generate indices with separate validation set respecting group boundaries.
Each subject gets its own train/val/test split maintaining temporal order.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data
y (array-like, shape (n_samples,)) – Target variable
timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)
groups (array-like, shape (n_samples,)) – Group labels (e.g., patient IDs) - required
- Yields:
train (ndarray) – Training set indices
val (ndarray) – Validation set indices
test (ndarray) – Test set indices
- Return type:
- plot_splits(X, y=None, timestamps=None, groups=None, figsize=(12, 6), save_path=None)[source]
Visualize the blocking splits showing subject separation.
This visualization shows how data from different subjects/groups is allocated to training and test sets while maintaining temporal order within each subject.
Color Scheme: - Rectangle border: Blue = Training set, Red = Test set - Rectangle fill: Different colors represent different subjects/groups - Each subject gets a unique color (cycling through colormap)
Key Features: - No mixing: Each subject’s data stays within temporal boundaries - Subject separation: Same subject can appear in both train/test but at different times - Temporal integrity: Time flows left to right for each subject
- Parameters:
X (array-like) – Training data
y (array-like, optional) – Target variable (not used)
timestamps (array-like, optional) – Timestamps (if None, uses sample indices)
groups (array-like) – Group labels (required for blocking split) - each unique value represents a subject
figsize (tuple, default (12, 6)) – Figure size
save_path (str, optional) – Path to save the plot
- Returns:
fig – The created figure with proper legend showing subject colors
- Return type:
matplotlib.figure.Figure
Examples
>>> splitter = TimeSeriesBlockingSplit(n_splits=3) >>> fig = splitter.plot_splits(X, timestamps=timestamps, groups=subject_ids) >>> fig.show() # Will show train (blue border) vs test (red border) by subject
- set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesBlockingSplit
Configure whether metadata should be requested to be passed to the
splitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tosplitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tosplit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- class scitex_ml.classification.timeseries.TimeSeriesSlidingWindowSplit(window_size=None, step_size=None, test_size=None, gap=0, val_ratio=0.0, random_state=None, overlapping_tests=False, expanding_window=False, undersample=False, n_splits=None)[source]
Sliding window cross-validation for time series.
Creates train/test windows that slide through time with configurable behavior.
- Parameters:
window_size (int, optional) – Size of training window (ignored if expanding_window=True or n_splits is set). Required if n_splits is None.
step_size (int, optional) – Step between windows (overridden if overlapping_tests=False)
test_size (int, optional) – Size of test window. Required if n_splits is None.
gap (int, default=0) – Number of samples to skip between train and test windows
val_ratio (float, default=0.0) – Ratio of validation set from training window
random_state (int, optional) – Random seed for reproducibility
overlapping_tests (bool, default=False) – If False, automatically sets step_size=test_size to ensure each sample is tested exactly once (like K-fold for time series)
expanding_window (bool, default=False) – If True, training window grows to include all past data (like sklearn’s TimeSeriesSplit). If False, uses fixed sliding window of size window_size.
undersample (bool, default=False) – If True, balance classes in training sets by randomly undersampling the majority class to match the minority class count. Temporal order is maintained. Requires y labels in split().
n_splits (int, optional) – Number of splits to generate. If specified, window_size and test_size are automatically calculated to create exactly n_splits folds. Cannot be used together with manual window_size/test_size specification.
Examples
>>> from scitex_ml.classification import TimeSeriesSlidingWindowSplit >>> import numpy as np >>> >>> X = np.random.randn(100, 10) >>> y = np.random.randint(0, 2, 100) >>> timestamps = np.arange(100) >>> >>> # Fixed window, non-overlapping tests (default) >>> swcv = TimeSeriesSlidingWindowSplit(window_size=50, test_size=10, gap=5) >>> for train_idx, test_idx in swcv.split(X, y, timestamps): ... print(f"Train: {len(train_idx)}, Test: {len(test_idx)}") >>> >>> # Expanding window (use all past data) >>> swcv = TimeSeriesSlidingWindowSplit( ... window_size=50, test_size=10, gap=5, expanding_window=True ... ) >>> for train_idx, test_idx in swcv.split(X, y, timestamps): ... print(f"Train: {len(train_idx)}, Test: {len(test_idx)}") # Train grows! >>> >>> # Using n_splits (automatically calculates window and test sizes) >>> swcv = TimeSeriesSlidingWindowSplit( ... n_splits=5, gap=0, expanding_window=True, undersample=True ... ) >>> for train_idx, test_idx in swcv.split(X, y, timestamps): ... print(f"Train: {len(train_idx)}, Test: {len(test_idx)}") >>> >>> # Visualize splits >>> fig = swcv.plot_splits(X, y, timestamps)
- __init__(window_size=None, step_size=None, test_size=None, gap=0, val_ratio=0.0, random_state=None, overlapping_tests=False, expanding_window=False, undersample=False, n_splits=None)[source]
- set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesSlidingWindowSplit
Configure whether metadata should be requested to be passed to the
splitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tosplitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tosplit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- class scitex_ml.classification.timeseries.TimeSeriesCalendarSplit(interval='M', n_train_intervals=12, n_test_intervals=1, n_val_intervals=0, gap_intervals=0, step_intervals=1, random_state=None)[source]
Calendar-based time series cross-validation splitter.
Splits data based on calendar intervals (e.g., months, weeks, days). Ensures temporal order is preserved and no data leakage occurs.
- Parameters:
interval (str) – Time interval for splitting. Options: - ‘D’: Daily - ‘W’: Weekly - ‘M’: Monthly - ‘Q’: Quarterly - ‘Y’: Yearly Or any pandas frequency string
n_train_intervals (int) – Number of intervals to use for training
n_test_intervals (int) – Number of intervals to use for testing (default: 1)
gap_intervals (int) – Number of intervals to skip between train and test (default: 0)
step_intervals (int) – Number of intervals to step forward for next fold (default: 1)
Examples
>>> from scitex_ml.classification import TimeSeriesCalendarSplit >>> import pandas as pd >>> import numpy as np >>> >>> # Create sample data with daily timestamps >>> dates = pd.date_range('2023-01-01', '2023-12-31', freq='D') >>> X = np.random.randn(len(dates), 10) >>> y = np.random.randint(0, 2, len(dates)) >>> >>> # Monthly splits: 6 months train, 1 month test >>> tscal = TimeSeriesCalendarSplit(interval='M', n_train_intervals=6) >>> for train_idx, test_idx in tscal.split(X, y, timestamps=dates): ... print(f"Train: {dates[train_idx[0]]:%Y-%m} to {dates[train_idx[-1]]:%Y-%m}") ... print(f"Test: {dates[test_idx[0]]:%Y-%m} to {dates[test_idx[-1]]:%Y-%m}")
- __init__(interval='M', n_train_intervals=12, n_test_intervals=1, n_val_intervals=0, gap_intervals=0, step_intervals=1, random_state=None)[source]
- split(X, y=None, timestamps=None, groups=None)[source]
Generate calendar-based train/test splits.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data
y (array-like, shape (n_samples,), optional) – Target variable
timestamps (array-like or pd.DatetimeIndex, shape (n_samples,)) – Timestamps for each sample (required)
groups (array-like, shape (n_samples,), optional) – Group labels (not used in this splitter)
- Yields:
train (ndarray) – Training set indices
test (ndarray) – Test set indices
- Return type:
- split_with_val(X, y=None, timestamps=None, groups=None)[source]
Generate calendar-based train/validation/test splits.
The validation set comes after training but before test, maintaining temporal order: train < val < test.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data
y (array-like, shape (n_samples,), optional) – Target variable
timestamps (array-like or pd.DatetimeIndex, shape (n_samples,)) – Timestamps for each sample (required)
groups (array-like, shape (n_samples,), optional) – Group labels (not used in this splitter)
- Yields:
train (ndarray) – Training set indices
val (ndarray) – Validation set indices
test (ndarray) – Test set indices
- Return type:
- get_n_splits(X=None, y=None, timestamps=None)[source]
Calculate number of splits.
- Parameters:
X (array-like, optional) – Not used directly
y (array-like, optional) – Not used
timestamps (array-like or pd.DatetimeIndex, optional) – Timestamps to determine number of possible splits
- Returns:
n_splits – Number of splits. Returns -1 if timestamps is None.
- Return type:
- plot_splits(X, y=None, timestamps=None, figsize=(12, 6), save_path=None)[source]
Visualize the train/test splits as timeline rectangles with scatter plots.
- Parameters:
X (array-like) – Training data (used to determine data size)
y (array-like, optional) – Target variable (used for color-coding scatter points)
timestamps (array-like or pd.DatetimeIndex) – Timestamps for each sample
figsize (tuple, default (12, 6)) – Figure size (width, height)
save_path (str, optional) – Path to save the plot
- Returns:
fig – The created figure
- Return type:
matplotlib.figure.Figure
Examples
>>> splitter = TimeSeriesCalendarSplit(interval='M', n_train_intervals=6) >>> fig = splitter.plot_splits(X, timestamps=dates) >>> fig.savefig('calendar_splits.png')
- set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesCalendarSplit
Configure whether metadata should be requested to be passed to the
splitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tosplitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tosplit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- class scitex_ml.classification.timeseries.TimeSeriesStrategy(value)[source]
Available time series CV strategies.
- STRATIFIED = 'stratified'
- BLOCKING = 'blocking'
- SLIDING = 'sliding'
- EXPANDING = 'expanding'
- FIXED = 'fixed'
- classmethod from_string(value)[source]
Create strategy from string value.
- Parameters:
value (str) – String representation of strategy
- Returns:
Corresponding enum value
- Return type:
- Raises:
ValueError – If value doesn’t match any strategy
- class scitex_ml.classification.timeseries.TimeSeriesMetadata(n_samples, n_features, n_classes=None, has_groups=False, group_sizes=None, time_range=None, sampling_rate=None, has_gaps=False, max_gap_size=None, is_balanced=True, class_distribution=None)[source]
Metadata about the time series data.
This dataclass captures essential characteristics of time series data that inform the selection of appropriate cross-validation strategies.
Examples
>>> import numpy as np >>> from scitex_ml.classification import TimeSeriesMetadata >>> >>> # Create metadata for a dataset >>> metadata = TimeSeriesMetadata( ... n_samples=1000, ... n_features=10, ... n_classes=2, ... has_groups=True, ... group_sizes={0: 250, 1: 250, 2: 250, 3: 250}, ... time_range=(0.0, 999.0), ... sampling_rate=1.0, ... has_gaps=False, ... max_gap_size=None, ... is_balanced=True, ... class_distribution={0: 0.5, 1: 0.5} ... ) >>> >>> print(f"Dataset has {metadata.n_samples} samples") >>> print(f"Number of groups: {len(metadata.group_sizes) if metadata.group_sizes else 0}")
- get_summary()[source]
Generate human-readable summary of the metadata.
- Returns:
Formatted summary string
- Return type:
- suggest_strategy()[source]
Suggest appropriate CV strategy based on metadata.
- Returns:
Suggested strategy name
- Return type:
- __init__(n_samples, n_features, n_classes=None, has_groups=False, group_sizes=None, time_range=None, sampling_rate=None, has_gaps=False, max_gap_size=None, is_balanced=True, class_distribution=None)
- scitex_ml.classification.timeseries.normalize_timestamp(timestamp, return_as='str', normalize_utc=True)[source]
Standardize any timestamp format to requested output type.
- Parameters:
- Returns:
Standardized timestamp in requested format: - “str”: String in standard format - “datetime”: datetime object - “unix”: Unix timestamp (float)
- Return type:
Examples
>>> from datetime import datetime >>> dt = datetime(2010, 6, 18, 10, 15, 0)
>>> normalize_timestamp(dt, return_as="str") "2010-06-18 10:15:00.000000"
>>> normalize_timestamp(dt, return_as="datetime") datetime(2010, 6, 18, 10, 15, 0, tzinfo=timezone.utc)
>>> normalize_timestamp(dt, return_as="unix") 1276856100.0
>>> normalize_timestamp("2010/06/18 10:15:00", return_as="str") "2010-06-18 10:15:00.000000"