scitex_ml.classification.timeseries

Time series cross-validation utilities for classification.

This module provides specialized cross-validation strategies for time series data, ensuring proper temporal ordering and preventing data leakage.

class scitex_ml.classification.timeseries.TimeSeriesStratifiedSplit(n_splits=5, test_ratio=0.2, val_ratio=0.1, gap=0, stratify=True, random_state=None)[source]

Time series cross-validation with stratification support.

This splitter ensures: 1. Test data is always chronologically after training data 2. Optional validation set between train and test 3. Class balance preservation in splits 4. Gap period between train and test to avoid leakage

Parameters:
  • n_splits (int) – Number of splits (folds)

  • test_ratio (float) – Proportion of data for test set (default: 0.2)

  • val_ratio (float) – Proportion of data for validation set (default: 0.1)

  • gap (int) – Number of samples to exclude between train and test (default: 0)

  • stratify (bool) – Whether to maintain class proportions (default: True)

  • random_state (int, optional) – Random seed for reproducibility (default: None)

Examples

>>> from scitex_ml.classification import TimeSeriesStratifiedSplit
>>> import numpy as np
>>>
>>> X = np.random.randn(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> timestamps = np.arange(100)
>>>
>>> tscv = TimeSeriesStratifiedSplit(n_splits=3)
>>> for train_idx, test_idx in tscv.split(X, y, timestamps):
...     print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")
__init__(n_splits=5, test_ratio=0.2, val_ratio=0.1, gap=0, stratify=True, random_state=None)[source]
split(X, y=None, timestamps=None, groups=None)[source]

Generate indices to split data into training and test sets.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data

  • y (array-like, shape (n_samples,)) – Target variable

  • timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)

  • groups (array-like, shape (n_samples,), optional) – Group labels for grouped CV

Yields:
  • train (ndarray) – Training set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray]]

split_with_val(X, y=None, timestamps=None, groups=None)[source]

Generate indices with separate validation set.

Yields:
  • train (ndarray) – Training set indices

  • val (ndarray) – Validation set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray, ndarray]]

get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations in the CV.

plot_splits(X, y=None, timestamps=None, figsize=(12, 6), save_path=None)[source]

Visualize the stratified time series splits.

Shows train (blue), validation (green), and test (red) sets. When val_ratio=0, only shows train and test.

Parameters:
  • X (array-like) – Training data

  • y (array-like, optional) – Target variable

  • timestamps (array-like, optional) – Timestamps (if None, uses sample indices)

  • figsize (tuple, default (12, 6)) – Figure size

  • save_path (str, optional) – Path to save the plot

Returns:

fig – The created figure

Return type:

matplotlib.figure.Figure

set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesStratifiedSplit

Configure whether metadata should be requested to be passed to the split method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to split.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

timestamps (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for timestamps parameter in split.

Returns:

self – The updated object.

Return type:

object

class scitex_ml.classification.timeseries.TimeSeriesBlockingSplit(n_splits=5, test_ratio=0.2, val_ratio=0.0, random_state=None)[source]

Time series split with blocking to handle multiple subjects/groups.

This splitter ensures temporal integrity within each subject while allowing cross-subject generalization. Each subject’s data is kept temporally coherent, but subjects can appear in both training and test sets at different time periods.

Key Features: - Temporal order preserved within each subject - No data leakage within individual subject timelines - Expanding window approach: more training data in later folds - Cross-subject generalization: subjects can be in both train and test

Use Cases: - Multiple patients with longitudinal medical data - Multiple stocks with time series financial data - Multiple sensors with temporal measurements - Any scenario with grouped time series data

Parameters:
  • n_splits (int, default=5) – Number of splits (folds)

  • test_ratio (float, default=0.2) – Proportion of data for test set per subject

Examples

>>> from scitex_ml.classification import TimeSeriesBlockingSplit
>>> import numpy as np
>>>
>>> # Create data: 100 samples, 4 subjects (25 samples each)
>>> X = np.random.randn(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> timestamps = np.arange(100)
>>> groups = np.repeat([0, 1, 2, 3], 25)  # Subject IDs
>>>
>>> # Each subject gets temporal split: early samples → train, later → test
>>> splitter = TimeSeriesBlockingSplit(n_splits=3, test_ratio=0.3)
>>> for train_idx, test_idx in splitter.split(X, y, timestamps, groups):
...     train_subjects = set(groups[train_idx])
...     test_subjects = set(groups[test_idx])
...     print(f"Train subjects: {train_subjects}, Test subjects: {test_subjects}")
...     # Output shows same subjects in both sets but different time periods
__init__(n_splits=5, test_ratio=0.2, val_ratio=0.0, random_state=None)[source]
split(X, y=None, timestamps=None, groups=None)[source]

Generate indices respecting group boundaries.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data

  • y (array-like, shape (n_samples,)) – Target variable

  • timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)

  • groups (array-like, shape (n_samples,)) – Group labels (e.g., patient IDs) - required

Yields:
  • train (ndarray) – Training set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray]]

split_with_val(X, y=None, timestamps=None, groups=None)[source]

Generate indices with separate validation set respecting group boundaries.

Each subject gets its own train/val/test split maintaining temporal order.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data

  • y (array-like, shape (n_samples,)) – Target variable

  • timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)

  • groups (array-like, shape (n_samples,)) – Group labels (e.g., patient IDs) - required

Yields:
  • train (ndarray) – Training set indices

  • val (ndarray) – Validation set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray, ndarray]]

get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations.

plot_splits(X, y=None, timestamps=None, groups=None, figsize=(12, 6), save_path=None)[source]

Visualize the blocking splits showing subject separation.

This visualization shows how data from different subjects/groups is allocated to training and test sets while maintaining temporal order within each subject.

Color Scheme: - Rectangle border: Blue = Training set, Red = Test set - Rectangle fill: Different colors represent different subjects/groups - Each subject gets a unique color (cycling through colormap)

Key Features: - No mixing: Each subject’s data stays within temporal boundaries - Subject separation: Same subject can appear in both train/test but at different times - Temporal integrity: Time flows left to right for each subject

Parameters:
  • X (array-like) – Training data

  • y (array-like, optional) – Target variable (not used)

  • timestamps (array-like, optional) – Timestamps (if None, uses sample indices)

  • groups (array-like) – Group labels (required for blocking split) - each unique value represents a subject

  • figsize (tuple, default (12, 6)) – Figure size

  • save_path (str, optional) – Path to save the plot

Returns:

fig – The created figure with proper legend showing subject colors

Return type:

matplotlib.figure.Figure

Examples

>>> splitter = TimeSeriesBlockingSplit(n_splits=3)
>>> fig = splitter.plot_splits(X, timestamps=timestamps, groups=subject_ids)
>>> fig.show()  # Will show train (blue border) vs test (red border) by subject
set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesBlockingSplit

Configure whether metadata should be requested to be passed to the split method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to split.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

timestamps (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for timestamps parameter in split.

Returns:

self – The updated object.

Return type:

object

class scitex_ml.classification.timeseries.TimeSeriesSlidingWindowSplit(window_size=None, step_size=None, test_size=None, gap=0, val_ratio=0.0, random_state=None, overlapping_tests=False, expanding_window=False, undersample=False, n_splits=None)[source]

Sliding window cross-validation for time series.

Creates train/test windows that slide through time with configurable behavior.

Parameters:
  • window_size (int, optional) – Size of training window (ignored if expanding_window=True or n_splits is set). Required if n_splits is None.

  • step_size (int, optional) – Step between windows (overridden if overlapping_tests=False)

  • test_size (int, optional) – Size of test window. Required if n_splits is None.

  • gap (int, default=0) – Number of samples to skip between train and test windows

  • val_ratio (float, default=0.0) – Ratio of validation set from training window

  • random_state (int, optional) – Random seed for reproducibility

  • overlapping_tests (bool, default=False) – If False, automatically sets step_size=test_size to ensure each sample is tested exactly once (like K-fold for time series)

  • expanding_window (bool, default=False) – If True, training window grows to include all past data (like sklearn’s TimeSeriesSplit). If False, uses fixed sliding window of size window_size.

  • undersample (bool, default=False) – If True, balance classes in training sets by randomly undersampling the majority class to match the minority class count. Temporal order is maintained. Requires y labels in split().

  • n_splits (int, optional) – Number of splits to generate. If specified, window_size and test_size are automatically calculated to create exactly n_splits folds. Cannot be used together with manual window_size/test_size specification.

Examples

>>> from scitex_ml.classification import TimeSeriesSlidingWindowSplit
>>> import numpy as np
>>>
>>> X = np.random.randn(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> timestamps = np.arange(100)
>>>
>>> # Fixed window, non-overlapping tests (default)
>>> swcv = TimeSeriesSlidingWindowSplit(window_size=50, test_size=10, gap=5)
>>> for train_idx, test_idx in swcv.split(X, y, timestamps):
...     print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")
>>>
>>> # Expanding window (use all past data)
>>> swcv = TimeSeriesSlidingWindowSplit(
...     window_size=50, test_size=10, gap=5, expanding_window=True
... )
>>> for train_idx, test_idx in swcv.split(X, y, timestamps):
...     print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")  # Train grows!
>>>
>>> # Using n_splits (automatically calculates window and test sizes)
>>> swcv = TimeSeriesSlidingWindowSplit(
...     n_splits=5, gap=0, expanding_window=True, undersample=True
... )
>>> for train_idx, test_idx in swcv.split(X, y, timestamps):
...     print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")
>>>
>>> # Visualize splits
>>> fig = swcv.plot_splits(X, y, timestamps)
__init__(window_size=None, step_size=None, test_size=None, gap=0, val_ratio=0.0, random_state=None, overlapping_tests=False, expanding_window=False, undersample=False, n_splits=None)[source]
set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesSlidingWindowSplit

Configure whether metadata should be requested to be passed to the split method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to split.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

timestamps (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for timestamps parameter in split.

Returns:

self – The updated object.

Return type:

object

class scitex_ml.classification.timeseries.TimeSeriesCalendarSplit(interval='M', n_train_intervals=12, n_test_intervals=1, n_val_intervals=0, gap_intervals=0, step_intervals=1, random_state=None)[source]

Calendar-based time series cross-validation splitter.

Splits data based on calendar intervals (e.g., months, weeks, days). Ensures temporal order is preserved and no data leakage occurs.

Parameters:
  • interval (str) – Time interval for splitting. Options: - ‘D’: Daily - ‘W’: Weekly - ‘M’: Monthly - ‘Q’: Quarterly - ‘Y’: Yearly Or any pandas frequency string

  • n_train_intervals (int) – Number of intervals to use for training

  • n_test_intervals (int) – Number of intervals to use for testing (default: 1)

  • gap_intervals (int) – Number of intervals to skip between train and test (default: 0)

  • step_intervals (int) – Number of intervals to step forward for next fold (default: 1)

Examples

>>> from scitex_ml.classification import TimeSeriesCalendarSplit
>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample data with daily timestamps
>>> dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
>>> X = np.random.randn(len(dates), 10)
>>> y = np.random.randint(0, 2, len(dates))
>>>
>>> # Monthly splits: 6 months train, 1 month test
>>> tscal = TimeSeriesCalendarSplit(interval='M', n_train_intervals=6)
>>> for train_idx, test_idx in tscal.split(X, y, timestamps=dates):
...     print(f"Train: {dates[train_idx[0]]:%Y-%m} to {dates[train_idx[-1]]:%Y-%m}")
...     print(f"Test:  {dates[test_idx[0]]:%Y-%m} to {dates[test_idx[-1]]:%Y-%m}")
__init__(interval='M', n_train_intervals=12, n_test_intervals=1, n_val_intervals=0, gap_intervals=0, step_intervals=1, random_state=None)[source]
split(X, y=None, timestamps=None, groups=None)[source]

Generate calendar-based train/test splits.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data

  • y (array-like, shape (n_samples,), optional) – Target variable

  • timestamps (array-like or pd.DatetimeIndex, shape (n_samples,)) – Timestamps for each sample (required)

  • groups (array-like, shape (n_samples,), optional) – Group labels (not used in this splitter)

Yields:
  • train (ndarray) – Training set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray]]

split_with_val(X, y=None, timestamps=None, groups=None)[source]

Generate calendar-based train/validation/test splits.

The validation set comes after training but before test, maintaining temporal order: train < val < test.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data

  • y (array-like, shape (n_samples,), optional) – Target variable

  • timestamps (array-like or pd.DatetimeIndex, shape (n_samples,)) – Timestamps for each sample (required)

  • groups (array-like, shape (n_samples,), optional) – Group labels (not used in this splitter)

Yields:
  • train (ndarray) – Training set indices

  • val (ndarray) – Validation set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray, ndarray]]

get_n_splits(X=None, y=None, timestamps=None)[source]

Calculate number of splits.

Parameters:
  • X (array-like, optional) – Not used directly

  • y (array-like, optional) – Not used

  • timestamps (array-like or pd.DatetimeIndex, optional) – Timestamps to determine number of possible splits

Returns:

n_splits – Number of splits. Returns -1 if timestamps is None.

Return type:

int

plot_splits(X, y=None, timestamps=None, figsize=(12, 6), save_path=None)[source]

Visualize the train/test splits as timeline rectangles with scatter plots.

Parameters:
  • X (array-like) – Training data (used to determine data size)

  • y (array-like, optional) – Target variable (used for color-coding scatter points)

  • timestamps (array-like or pd.DatetimeIndex) – Timestamps for each sample

  • figsize (tuple, default (12, 6)) – Figure size (width, height)

  • save_path (str, optional) – Path to save the plot

Returns:

fig – The created figure

Return type:

matplotlib.figure.Figure

Examples

>>> splitter = TimeSeriesCalendarSplit(interval='M', n_train_intervals=6)
>>> fig = splitter.plot_splits(X, timestamps=dates)
>>> fig.savefig('calendar_splits.png')
set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesCalendarSplit

Configure whether metadata should be requested to be passed to the split method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to split.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

timestamps (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for timestamps parameter in split.

Returns:

self – The updated object.

Return type:

object

class scitex_ml.classification.timeseries.TimeSeriesStrategy(value)[source]

Available time series CV strategies.

STRATIFIED

Single time series with class balance preservation

Type:

str

BLOCKING

Multiple independent time series (e.g., different patients)

Type:

str

SLIDING

Sliding window approach with fixed-size windows

Type:

str

EXPANDING

Expanding window where training set grows over time

Type:

str

FIXED

Fixed train/test split at specific time point

Type:

str

STRATIFIED = 'stratified'
BLOCKING = 'blocking'
SLIDING = 'sliding'
EXPANDING = 'expanding'
FIXED = 'fixed'
classmethod from_string(value)[source]

Create strategy from string value.

Parameters:

value (str) – String representation of strategy

Returns:

Corresponding enum value

Return type:

TimeSeriesStrategy

Raises:

ValueError – If value doesn’t match any strategy

get_description()[source]

Get human-readable description of the strategy.

Returns:

Description of the strategy

Return type:

str

class scitex_ml.classification.timeseries.TimeSeriesMetadata(n_samples, n_features, n_classes=None, has_groups=False, group_sizes=None, time_range=None, sampling_rate=None, has_gaps=False, max_gap_size=None, is_balanced=True, class_distribution=None)[source]

Metadata about the time series data.

This dataclass captures essential characteristics of time series data that inform the selection of appropriate cross-validation strategies.

n_samples

Total number of samples in the dataset

Type:

int

n_features

Number of features per sample

Type:

int

n_classes

Number of unique classes (None for regression)

Type:

Optional[int]

has_groups

Whether data contains group/subject identifiers

Type:

bool

group_sizes

Mapping of group IDs to their sample counts

Type:

Optional[Dict[Any, int]]

time_range

Minimum and maximum timestamp values

Type:

Optional[Tuple[float, float]]

sampling_rate

Samples per time unit (e.g., Hz for sensor data)

Type:

Optional[float]

has_gaps

Whether the time series has temporal gaps

Type:

bool

max_gap_size

Maximum gap between consecutive timestamps

Type:

Optional[float]

is_balanced

Whether classes are balanced (for classification)

Type:

bool

class_distribution

Mapping of class labels to their proportions

Type:

Optional[Dict[Any, float]]

Examples

>>> import numpy as np
>>> from scitex_ml.classification import TimeSeriesMetadata
>>>
>>> # Create metadata for a dataset
>>> metadata = TimeSeriesMetadata(
...     n_samples=1000,
...     n_features=10,
...     n_classes=2,
...     has_groups=True,
...     group_sizes={0: 250, 1: 250, 2: 250, 3: 250},
...     time_range=(0.0, 999.0),
...     sampling_rate=1.0,
...     has_gaps=False,
...     max_gap_size=None,
...     is_balanced=True,
...     class_distribution={0: 0.5, 1: 0.5}
... )
>>>
>>> print(f"Dataset has {metadata.n_samples} samples")
>>> print(f"Number of groups: {len(metadata.group_sizes) if metadata.group_sizes else 0}")
n_samples: int
n_features: int
n_classes: int | None = None
has_groups: bool = False
group_sizes: Dict[Any, int] | None = None
time_range: Tuple[float, float] | None = None
sampling_rate: float | None = None
has_gaps: bool = False
max_gap_size: float | None = None
is_balanced: bool = True
class_distribution: Dict[Any, float] | None = None
get_summary()[source]

Generate human-readable summary of the metadata.

Returns:

Formatted summary string

Return type:

str

suggest_strategy()[source]

Suggest appropriate CV strategy based on metadata.

Returns:

Suggested strategy name

Return type:

str

__init__(n_samples, n_features, n_classes=None, has_groups=False, group_sizes=None, time_range=None, sampling_rate=None, has_gaps=False, max_gap_size=None, is_balanced=True, class_distribution=None)
scitex_ml.classification.timeseries.normalize_timestamp(timestamp, return_as='str', normalize_utc=True)[source]

Standardize any timestamp format to requested output type.

Parameters:
  • timestamp (datetime, str, int, or float) – Timestamp in any supported format

  • return_as (str) – Output format: “str” (default), “datetime”, or “unix”

  • normalize_utc (bool) – If True, normalize to UTC timezone

Returns:

Standardized timestamp in requested format: - “str”: String in standard format - “datetime”: datetime object - “unix”: Unix timestamp (float)

Return type:

str, datetime, or float

Examples

>>> from datetime import datetime
>>> dt = datetime(2010, 6, 18, 10, 15, 0)
>>> normalize_timestamp(dt, return_as="str")
"2010-06-18 10:15:00.000000"
>>> normalize_timestamp(dt, return_as="datetime")
datetime(2010, 6, 18, 10, 15, 0, tzinfo=timezone.utc)
>>> normalize_timestamp(dt, return_as="unix")
1276856100.0
>>> normalize_timestamp("2010/06/18 10:15:00", return_as="str")
"2010-06-18 10:15:00.000000"