scitex_ml.classification

Classification utilities with unified API.

class scitex_ml.classification.ClassificationReporter(output_dir, tasks=None, precision=3, required_metrics=['balanced_accuracy', 'mcc', 'confusion_matrix', 'classification_report', 'roc_auc', 'roc_curve', 'pre_rec_auc', 'pre_rec_curve'], verbose=True, **kwargs)[source]

Unified classification reporter for single and multi-task scenarios.

This reporter automatically adapts to your use case: - Single task: Just use it without specifying tasks - Multiple tasks: Specify tasks upfront or create them dynamically - Seamless switching between single and multi-task workflows

Features: - Comprehensive metrics calculation (balanced accuracy, MCC, ROC-AUC, PR-AUC, etc.) - Automated visualization generation:

  • Confusion matrices

  • ROC and Precision-Recall curves

  • Feature importance plots (via plotter)

  • CV aggregation plots with faded fold lines

  • Comprehensive metrics dashboard

  • Multi-format report generation (Org, Markdown, LaTeX, HTML, DOCX, PDF)

  • Cross-validation support with automatic fold aggregation

  • Multi-task classification tracking

Parameters:
  • output_dir (Union[str, Path]) – Base directory for outputs. If None, creates timestamped directory.

  • tasks (List[str], optional) – List of task names. If None, tasks are created dynamically as needed.

  • precision (int, default 3) – Number of decimal places for numerical outputs

  • required_metrics (List[str], optional) – List of metrics to calculate. Defaults to comprehensive set.

  • verbose (bool, default True) – Whether to print initialization messages

  • **kwargs – Additional arguments passed to base class

Examples

>>> # Single task usage (no tasks specified)
>>> reporter = ClassificationReporter("./results")
>>> reporter.calculate_metrics(y_true, y_pred, y_proba)
>>> # Multi-task with predefined tasks
>>> reporter = ClassificationReporter("./results", tasks=["binary", "multiclass"])
>>> reporter.calculate_metrics(y_true, y_pred, task="binary")
>>> # Dynamic task creation
>>> reporter = ClassificationReporter("./results")
>>> reporter.calculate_metrics(y_true1, y_pred1, task="task1")
>>> reporter.calculate_metrics(y_true2, y_pred2, task="task2")
>>> # Feature importance visualization (via plotter)
>>> reporter._single_reporter.plotter.create_feature_importance_plot(
...     feature_importance=importances,
...     feature_names=feature_names,
...     save_path="./results/feature_importance.png"
... )
>>> # CV aggregation plots (automatically created on save_summary)
>>> for fold in range(5):
...     metrics = reporter.calculate_metrics(y_true, y_pred, y_proba, fold=fold)
>>> reporter.save_summary()  # Creates CV aggregation plots with faded fold lines
__init__(output_dir, tasks=None, precision=3, required_metrics=['balanced_accuracy', 'mcc', 'confusion_matrix', 'classification_report', 'roc_auc', 'roc_curve', 'pre_rec_auc', 'pre_rec_curve'], verbose=True, **kwargs)[source]
calculate_metrics(y_true, y_pred, y_proba=None, labels=None, fold=None, task=None, verbose=True, model=None, feature_names=None)[source]

Calculate metrics for classification.

Automatically handles single vs multi-task scenarios: - If no task specified and no tasks defined: creates “default” task - If no task specified but tasks exist: uses first task - If task specified: uses/creates that specific task

Parameters:
  • y_true (np.ndarray) – True class labels

  • y_pred (np.ndarray) – Predicted class labels

  • y_proba (np.ndarray, optional) – Prediction probabilities (required for AUC metrics)

  • labels (List[str], optional) – Class labels for display

  • fold (int, optional) – Fold index for cross-validation

  • task (str, optional) – Task identifier. If None and no tasks exist, creates “default” task.

  • verbose (bool, default True) – Whether to print progress

  • model (object, optional) – Trained model for automatic feature importance extraction

  • feature_names (List[str], optional) – Feature names for feature importance (required if model is provided)

Returns:

Dictionary of calculated metrics

Return type:

Dict[str, Any]

save(data, relative_path, task=None, fold=None)[source]

Save custom data with automatic task/fold organization.

Parameters:
  • data (Any) – Data to save (any format supported by scitex_io.save)

  • relative_path (Union[str, Path]) – Relative path from output directory

  • task (Optional[str], default None) – Task name. If provided, saves to task-specific directory

  • fold (Optional[int], default None) – If provided, automatically prepends “fold_{fold:02d}/” to path

Returns:

Absolute path to the saved file

Return type:

Path

Examples

>>> # Single task mode (no task specified)
>>> reporter.save({"accuracy": 0.95}, "metrics.json")
>>> # Multi-task mode
>>> reporter.save(results, "results.csv", task="binary", fold=0)
get_summary()[source]

Get summary of all calculated metrics.

Returns:

Summary of metrics across all tasks and folds

Return type:

Dict[str, Any]

save_summary(filename='summary.json', verbose=True)[source]

Save summary to file.

Parameters:
  • filename (str) – Filename for summary

  • verbose (bool) – Whether to print summary

Returns:

Path to saved summary file

Return type:

Path

save_feature_importance(model, feature_names, fold=None, task=None)[source]

Calculate and save feature importance for tree-based models.

Parameters:
  • model (object) – Fitted classifier (must have feature_importances_)

  • feature_names (List[str]) – Names of features

  • fold (int, optional) – Fold number for tracking

  • task (str, optional) – Task name for multi-task mode

Returns:

Dictionary of feature importances {feature_name: importance}

Return type:

Dict[str, float]

save_feature_importance_summary(all_importances, task=None)[source]

Create summary visualization of feature importances across all folds.

Parameters:
  • all_importances (List[Dict[str, float]]) – List of feature importance dicts from each fold

  • task (str, optional) – Task name for multi-task mode

Return type:

None

class scitex_ml.classification.SingleTaskClassificationReporter(output_dir, config=None, verbose=True, **kwargs)[source]

Improved single-task classification reporter with unified API.

Key improvements: - Inherits from BaseClassificationReporter for consistent API - Lazy directory creation (no empty folders) - Numerical precision control - Graceful plotting with proper error handling - Consistent parameter names across all methods

Features: - Comprehensive metrics calculation (balanced accuracy, MCC, ROC-AUC, PR-AUC, etc.) - Automated visualization generation:

  • Confusion matrices

  • ROC and Precision-Recall curves

  • Feature importance plots

  • CV aggregation plots with faded fold lines

  • Comprehensive metrics dashboard

  • Multi-format report generation (Org, Markdown, LaTeX, HTML, DOCX, PDF)

  • Cross-validation support with automatic fold aggregation

Parameters:
  • output_dir (Union[str, Path]) – Base directory for outputs. If None, creates timestamped directory.

  • config (ReporterConfig, optional) – Configuration object for advanced settings

  • verbose (bool, default True) – Print initialization message

  • **kwargs – Additional arguments passed to base class

Examples

>>> # Basic usage
>>> reporter = SingleTaskClassificationReporter("./results")
>>> metrics = reporter.calculate_metrics(y_true, y_pred, y_proba, labels=['A', 'B'])
>>> reporter.save_summary()
>>> # Cross-validation with automatic CV aggregation plots
>>> for fold, (train_idx, test_idx) in enumerate(cv.split(X, y)):
...     metrics = reporter.calculate_metrics(
...         y_test, y_pred, y_proba, fold=fold
...     )
>>> reporter.save_summary()  # Automatically creates CV aggregation visualizations
>>> # Feature importance visualization
>>> reporter.plotter.create_feature_importance_plot(
...     feature_importance=importances,
...     feature_names=feature_names,
...     save_path=output_dir / "feature_importance.png"
... )
__init__(output_dir, config=None, verbose=True, **kwargs)[source]
set_session_config(config)[source]

Set the SciTeX session CONFIG object for inclusion in reports.

Parameters:

config (Any) – The SciTeX session CONFIG object

Return type:

None

save_summary(filename='cv_summary/summary.json', verbose=True)[source]

Save summary to file, create CV summary visualizations, and generate reports.

Parameters:
  • filename (str, default "cv_summary/summary.json") – Filename for summary (now in cv_summary directory)

  • verbose (bool, default True) – Print summary to console

Returns:

Path to saved summary file

Return type:

Path

class scitex_ml.classification.Classifier(class_weight=None, random_state=42)[source]

Server for initializing various scikit-learn classifiers with consistent interface.

Example

>>> clf_server = Classifier(class_weight={0: 1.0, 1: 2.0}, random_state=42)
>>> clf = clf_server("SVC", scaler=_StandardScaler())
>>> print(clf_server.list)
['CatBoostClassifier', 'Perceptron', ...]
Parameters:
  • class_weight (Optional[Dict[int, float]]) – Class weights for handling imbalanced datasets

  • random_state (int) – Random seed for reproducibility

__init__(class_weight=None, random_state=42)[source]
property list: List[str]
class scitex_ml.classification.CrossValidationExperiment(name, model_fn, cv=None, output_dir=None, metrics=None, save_models=True, verbose=True)[source]

Streamlined cross-validation experiment runner.

This class handles: - Cross-validation splitting - Model training and evaluation - Automatic metric calculation - Hyperparameter tracking - Progress monitoring - Report generation

Parameters:
  • name (str) – Experiment name

  • model_fn (Callable) – Function that returns a model instance

  • cv (BaseCrossValidator, optional) – Cross-validation splitter (default: 5-fold stratified)

  • output_dir (Union[str, Path], optional) – Output directory for results

  • metrics (List[str], optional) – List of metrics to calculate

  • save_models (bool) – Whether to save trained models

  • verbose (bool) – Whether to print progress

__init__(name, model_fn, cv=None, output_dir=None, metrics=None, save_models=True, verbose=True)[source]
set_hyperparameters(**kwargs)[source]

Set hyperparameters for tracking.

Parameters:

**kwargs – Hyperparameter key-value pairs

Return type:

None

describe_dataset(X, y, feature_names=None, class_names=None)[source]

Record dataset information.

Parameters:
  • X (np.ndarray) – Features

  • y (np.ndarray) – Labels

  • feature_names (List[str], optional) – Feature names

  • class_names (List[str], optional) – Class names

Return type:

None

run(X, y, feature_names=None, class_names=None, calculate_curves=True)[source]

Run complete cross-validation experiment.

Parameters:
  • X (np.ndarray) – Features

  • y (np.ndarray) – Labels

  • feature_names (List[str], optional) – Feature names

  • class_names (List[str], optional) – Class names

  • calculate_curves (bool) – Whether to calculate and plot ROC/PR curves

Returns:

Experiment results and paths

Return type:

Dict[str, Any]

get_summary()[source]

Get summary statistics across folds.

Return type:

DataFrame

get_validation_report()[source]

Get validation report.

Return type:

Dict[str, Any]

scitex_ml.classification.CVExperiment

alias of CrossValidationExperiment

scitex_ml.classification.quick_experiment(X, y, model, name='quick_experiment', n_folds=5, **kwargs)[source]

Run a quick cross-validation experiment.

This is a convenience function for rapid experimentation.

Parameters:
  • X (np.ndarray) – Features

  • y (np.ndarray) – Labels

  • model (sklearn estimator or callable) – Model instance or function that returns model

  • name (str) – Experiment name

  • n_folds (int) – Number of CV folds

  • **kwargs – Additional arguments for CrossValidationExperiment

Returns:

Experiment results

Return type:

Dict[str, Any]

Examples

>>> from sklearn.svm import SVC
>>> results = quick_experiment(X, y, SVC(), name="svm_test")
>>> print(f"Report saved to: {results['paths']['final_report']}")
class scitex_ml.classification.TimeSeriesStratifiedSplit(n_splits=5, test_ratio=0.2, val_ratio=0.1, gap=0, stratify=True, random_state=None)[source]

Time series cross-validation with stratification support.

This splitter ensures: 1. Test data is always chronologically after training data 2. Optional validation set between train and test 3. Class balance preservation in splits 4. Gap period between train and test to avoid leakage

Parameters:
  • n_splits (int) – Number of splits (folds)

  • test_ratio (float) – Proportion of data for test set (default: 0.2)

  • val_ratio (float) – Proportion of data for validation set (default: 0.1)

  • gap (int) – Number of samples to exclude between train and test (default: 0)

  • stratify (bool) – Whether to maintain class proportions (default: True)

  • random_state (int, optional) – Random seed for reproducibility (default: None)

Examples

>>> from scitex_ml.classification import TimeSeriesStratifiedSplit
>>> import numpy as np
>>>
>>> X = np.random.randn(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> timestamps = np.arange(100)
>>>
>>> tscv = TimeSeriesStratifiedSplit(n_splits=3)
>>> for train_idx, test_idx in tscv.split(X, y, timestamps):
...     print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")
__init__(n_splits=5, test_ratio=0.2, val_ratio=0.1, gap=0, stratify=True, random_state=None)[source]
split(X, y=None, timestamps=None, groups=None)[source]

Generate indices to split data into training and test sets.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data

  • y (array-like, shape (n_samples,)) – Target variable

  • timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)

  • groups (array-like, shape (n_samples,), optional) – Group labels for grouped CV

Yields:
  • train (ndarray) – Training set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray]]

split_with_val(X, y=None, timestamps=None, groups=None)[source]

Generate indices with separate validation set.

Yields:
  • train (ndarray) – Training set indices

  • val (ndarray) – Validation set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray, ndarray]]

get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations in the CV.

plot_splits(X, y=None, timestamps=None, figsize=(12, 6), save_path=None)[source]

Visualize the stratified time series splits.

Shows train (blue), validation (green), and test (red) sets. When val_ratio=0, only shows train and test.

Parameters:
  • X (array-like) – Training data

  • y (array-like, optional) – Target variable

  • timestamps (array-like, optional) – Timestamps (if None, uses sample indices)

  • figsize (tuple, default (12, 6)) – Figure size

  • save_path (str, optional) – Path to save the plot

Returns:

fig – The created figure

Return type:

matplotlib.figure.Figure

set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesStratifiedSplit

Configure whether metadata should be requested to be passed to the split method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to split.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

timestamps (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for timestamps parameter in split.

Returns:

self – The updated object.

Return type:

object

class scitex_ml.classification.TimeSeriesBlockingSplit(n_splits=5, test_ratio=0.2, val_ratio=0.0, random_state=None)[source]

Time series split with blocking to handle multiple subjects/groups.

This splitter ensures temporal integrity within each subject while allowing cross-subject generalization. Each subject’s data is kept temporally coherent, but subjects can appear in both training and test sets at different time periods.

Key Features: - Temporal order preserved within each subject - No data leakage within individual subject timelines - Expanding window approach: more training data in later folds - Cross-subject generalization: subjects can be in both train and test

Use Cases: - Multiple patients with longitudinal medical data - Multiple stocks with time series financial data - Multiple sensors with temporal measurements - Any scenario with grouped time series data

Parameters:
  • n_splits (int, default=5) – Number of splits (folds)

  • test_ratio (float, default=0.2) – Proportion of data for test set per subject

Examples

>>> from scitex_ml.classification import TimeSeriesBlockingSplit
>>> import numpy as np
>>>
>>> # Create data: 100 samples, 4 subjects (25 samples each)
>>> X = np.random.randn(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> timestamps = np.arange(100)
>>> groups = np.repeat([0, 1, 2, 3], 25)  # Subject IDs
>>>
>>> # Each subject gets temporal split: early samples → train, later → test
>>> splitter = TimeSeriesBlockingSplit(n_splits=3, test_ratio=0.3)
>>> for train_idx, test_idx in splitter.split(X, y, timestamps, groups):
...     train_subjects = set(groups[train_idx])
...     test_subjects = set(groups[test_idx])
...     print(f"Train subjects: {train_subjects}, Test subjects: {test_subjects}")
...     # Output shows same subjects in both sets but different time periods
__init__(n_splits=5, test_ratio=0.2, val_ratio=0.0, random_state=None)[source]
split(X, y=None, timestamps=None, groups=None)[source]

Generate indices respecting group boundaries.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data

  • y (array-like, shape (n_samples,)) – Target variable

  • timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)

  • groups (array-like, shape (n_samples,)) – Group labels (e.g., patient IDs) - required

Yields:
  • train (ndarray) – Training set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray]]

split_with_val(X, y=None, timestamps=None, groups=None)[source]

Generate indices with separate validation set respecting group boundaries.

Each subject gets its own train/val/test split maintaining temporal order.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data

  • y (array-like, shape (n_samples,)) – Target variable

  • timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)

  • groups (array-like, shape (n_samples,)) – Group labels (e.g., patient IDs) - required

Yields:
  • train (ndarray) – Training set indices

  • val (ndarray) – Validation set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray, ndarray]]

get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations.

plot_splits(X, y=None, timestamps=None, groups=None, figsize=(12, 6), save_path=None)[source]

Visualize the blocking splits showing subject separation.

This visualization shows how data from different subjects/groups is allocated to training and test sets while maintaining temporal order within each subject.

Color Scheme: - Rectangle border: Blue = Training set, Red = Test set - Rectangle fill: Different colors represent different subjects/groups - Each subject gets a unique color (cycling through colormap)

Key Features: - No mixing: Each subject’s data stays within temporal boundaries - Subject separation: Same subject can appear in both train/test but at different times - Temporal integrity: Time flows left to right for each subject

Parameters:
  • X (array-like) – Training data

  • y (array-like, optional) – Target variable (not used)

  • timestamps (array-like, optional) – Timestamps (if None, uses sample indices)

  • groups (array-like) – Group labels (required for blocking split) - each unique value represents a subject

  • figsize (tuple, default (12, 6)) – Figure size

  • save_path (str, optional) – Path to save the plot

Returns:

fig – The created figure with proper legend showing subject colors

Return type:

matplotlib.figure.Figure

Examples

>>> splitter = TimeSeriesBlockingSplit(n_splits=3)
>>> fig = splitter.plot_splits(X, timestamps=timestamps, groups=subject_ids)
>>> fig.show()  # Will show train (blue border) vs test (red border) by subject
set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesBlockingSplit

Configure whether metadata should be requested to be passed to the split method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to split.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

timestamps (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for timestamps parameter in split.

Returns:

self – The updated object.

Return type:

object

class scitex_ml.classification.TimeSeriesSlidingWindowSplit(window_size=None, step_size=None, test_size=None, gap=0, val_ratio=0.0, random_state=None, overlapping_tests=False, expanding_window=False, undersample=False, n_splits=None)[source]

Sliding window cross-validation for time series.

Creates train/test windows that slide through time with configurable behavior.

Parameters:
  • window_size (int, optional) – Size of training window (ignored if expanding_window=True or n_splits is set). Required if n_splits is None.

  • step_size (int, optional) – Step between windows (overridden if overlapping_tests=False)

  • test_size (int, optional) – Size of test window. Required if n_splits is None.

  • gap (int, default=0) – Number of samples to skip between train and test windows

  • val_ratio (float, default=0.0) – Ratio of validation set from training window

  • random_state (int, optional) – Random seed for reproducibility

  • overlapping_tests (bool, default=False) – If False, automatically sets step_size=test_size to ensure each sample is tested exactly once (like K-fold for time series)

  • expanding_window (bool, default=False) – If True, training window grows to include all past data (like sklearn’s TimeSeriesSplit). If False, uses fixed sliding window of size window_size.

  • undersample (bool, default=False) – If True, balance classes in training sets by randomly undersampling the majority class to match the minority class count. Temporal order is maintained. Requires y labels in split().

  • n_splits (int, optional) – Number of splits to generate. If specified, window_size and test_size are automatically calculated to create exactly n_splits folds. Cannot be used together with manual window_size/test_size specification.

Examples

>>> from scitex_ml.classification import TimeSeriesSlidingWindowSplit
>>> import numpy as np
>>>
>>> X = np.random.randn(100, 10)
>>> y = np.random.randint(0, 2, 100)
>>> timestamps = np.arange(100)
>>>
>>> # Fixed window, non-overlapping tests (default)
>>> swcv = TimeSeriesSlidingWindowSplit(window_size=50, test_size=10, gap=5)
>>> for train_idx, test_idx in swcv.split(X, y, timestamps):
...     print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")
>>>
>>> # Expanding window (use all past data)
>>> swcv = TimeSeriesSlidingWindowSplit(
...     window_size=50, test_size=10, gap=5, expanding_window=True
... )
>>> for train_idx, test_idx in swcv.split(X, y, timestamps):
...     print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")  # Train grows!
>>>
>>> # Using n_splits (automatically calculates window and test sizes)
>>> swcv = TimeSeriesSlidingWindowSplit(
...     n_splits=5, gap=0, expanding_window=True, undersample=True
... )
>>> for train_idx, test_idx in swcv.split(X, y, timestamps):
...     print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")
>>>
>>> # Visualize splits
>>> fig = swcv.plot_splits(X, y, timestamps)
__init__(window_size=None, step_size=None, test_size=None, gap=0, val_ratio=0.0, random_state=None, overlapping_tests=False, expanding_window=False, undersample=False, n_splits=None)[source]
set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesSlidingWindowSplit

Configure whether metadata should be requested to be passed to the split method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to split.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

timestamps (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for timestamps parameter in split.

Returns:

self – The updated object.

Return type:

object

class scitex_ml.classification.TimeSeriesCalendarSplit(interval='M', n_train_intervals=12, n_test_intervals=1, n_val_intervals=0, gap_intervals=0, step_intervals=1, random_state=None)[source]

Calendar-based time series cross-validation splitter.

Splits data based on calendar intervals (e.g., months, weeks, days). Ensures temporal order is preserved and no data leakage occurs.

Parameters:
  • interval (str) – Time interval for splitting. Options: - ‘D’: Daily - ‘W’: Weekly - ‘M’: Monthly - ‘Q’: Quarterly - ‘Y’: Yearly Or any pandas frequency string

  • n_train_intervals (int) – Number of intervals to use for training

  • n_test_intervals (int) – Number of intervals to use for testing (default: 1)

  • gap_intervals (int) – Number of intervals to skip between train and test (default: 0)

  • step_intervals (int) – Number of intervals to step forward for next fold (default: 1)

Examples

>>> from scitex_ml.classification import TimeSeriesCalendarSplit
>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample data with daily timestamps
>>> dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
>>> X = np.random.randn(len(dates), 10)
>>> y = np.random.randint(0, 2, len(dates))
>>>
>>> # Monthly splits: 6 months train, 1 month test
>>> tscal = TimeSeriesCalendarSplit(interval='M', n_train_intervals=6)
>>> for train_idx, test_idx in tscal.split(X, y, timestamps=dates):
...     print(f"Train: {dates[train_idx[0]]:%Y-%m} to {dates[train_idx[-1]]:%Y-%m}")
...     print(f"Test:  {dates[test_idx[0]]:%Y-%m} to {dates[test_idx[-1]]:%Y-%m}")
__init__(interval='M', n_train_intervals=12, n_test_intervals=1, n_val_intervals=0, gap_intervals=0, step_intervals=1, random_state=None)[source]
split(X, y=None, timestamps=None, groups=None)[source]

Generate calendar-based train/test splits.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data

  • y (array-like, shape (n_samples,), optional) – Target variable

  • timestamps (array-like or pd.DatetimeIndex, shape (n_samples,)) – Timestamps for each sample (required)

  • groups (array-like, shape (n_samples,), optional) – Group labels (not used in this splitter)

Yields:
  • train (ndarray) – Training set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray]]

split_with_val(X, y=None, timestamps=None, groups=None)[source]

Generate calendar-based train/validation/test splits.

The validation set comes after training but before test, maintaining temporal order: train < val < test.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training data

  • y (array-like, shape (n_samples,), optional) – Target variable

  • timestamps (array-like or pd.DatetimeIndex, shape (n_samples,)) – Timestamps for each sample (required)

  • groups (array-like, shape (n_samples,), optional) – Group labels (not used in this splitter)

Yields:
  • train (ndarray) – Training set indices

  • val (ndarray) – Validation set indices

  • test (ndarray) – Test set indices

Return type:

Iterator[Tuple[ndarray, ndarray, ndarray]]

get_n_splits(X=None, y=None, timestamps=None)[source]

Calculate number of splits.

Parameters:
  • X (array-like, optional) – Not used directly

  • y (array-like, optional) – Not used

  • timestamps (array-like or pd.DatetimeIndex, optional) – Timestamps to determine number of possible splits

Returns:

n_splits – Number of splits. Returns -1 if timestamps is None.

Return type:

int

plot_splits(X, y=None, timestamps=None, figsize=(12, 6), save_path=None)[source]

Visualize the train/test splits as timeline rectangles with scatter plots.

Parameters:
  • X (array-like) – Training data (used to determine data size)

  • y (array-like, optional) – Target variable (used for color-coding scatter points)

  • timestamps (array-like or pd.DatetimeIndex) – Timestamps for each sample

  • figsize (tuple, default (12, 6)) – Figure size (width, height)

  • save_path (str, optional) – Path to save the plot

Returns:

fig – The created figure

Return type:

matplotlib.figure.Figure

Examples

>>> splitter = TimeSeriesCalendarSplit(interval='M', n_train_intervals=6)
>>> fig = splitter.plot_splits(X, timestamps=dates)
>>> fig.savefig('calendar_splits.png')
set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesCalendarSplit

Configure whether metadata should be requested to be passed to the split method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to split.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

timestamps (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for timestamps parameter in split.

Returns:

self – The updated object.

Return type:

object

class scitex_ml.classification.TimeSeriesStrategy(value)[source]

Available time series CV strategies.

STRATIFIED

Single time series with class balance preservation

Type:

str

BLOCKING

Multiple independent time series (e.g., different patients)

Type:

str

SLIDING

Sliding window approach with fixed-size windows

Type:

str

EXPANDING

Expanding window where training set grows over time

Type:

str

FIXED

Fixed train/test split at specific time point

Type:

str

STRATIFIED = 'stratified'
BLOCKING = 'blocking'
SLIDING = 'sliding'
EXPANDING = 'expanding'
FIXED = 'fixed'
classmethod from_string(value)[source]

Create strategy from string value.

Parameters:

value (str) – String representation of strategy

Returns:

Corresponding enum value

Return type:

TimeSeriesStrategy

Raises:

ValueError – If value doesn’t match any strategy

get_description()[source]

Get human-readable description of the strategy.

Returns:

Description of the strategy

Return type:

str

class scitex_ml.classification.TimeSeriesMetadata(n_samples, n_features, n_classes=None, has_groups=False, group_sizes=None, time_range=None, sampling_rate=None, has_gaps=False, max_gap_size=None, is_balanced=True, class_distribution=None)[source]

Metadata about the time series data.

This dataclass captures essential characteristics of time series data that inform the selection of appropriate cross-validation strategies.

n_samples

Total number of samples in the dataset

Type:

int

n_features

Number of features per sample

Type:

int

n_classes

Number of unique classes (None for regression)

Type:

Optional[int]

has_groups

Whether data contains group/subject identifiers

Type:

bool

group_sizes

Mapping of group IDs to their sample counts

Type:

Optional[Dict[Any, int]]

time_range

Minimum and maximum timestamp values

Type:

Optional[Tuple[float, float]]

sampling_rate

Samples per time unit (e.g., Hz for sensor data)

Type:

Optional[float]

has_gaps

Whether the time series has temporal gaps

Type:

bool

max_gap_size

Maximum gap between consecutive timestamps

Type:

Optional[float]

is_balanced

Whether classes are balanced (for classification)

Type:

bool

class_distribution

Mapping of class labels to their proportions

Type:

Optional[Dict[Any, float]]

Examples

>>> import numpy as np
>>> from scitex_ml.classification import TimeSeriesMetadata
>>>
>>> # Create metadata for a dataset
>>> metadata = TimeSeriesMetadata(
...     n_samples=1000,
...     n_features=10,
...     n_classes=2,
...     has_groups=True,
...     group_sizes={0: 250, 1: 250, 2: 250, 3: 250},
...     time_range=(0.0, 999.0),
...     sampling_rate=1.0,
...     has_gaps=False,
...     max_gap_size=None,
...     is_balanced=True,
...     class_distribution={0: 0.5, 1: 0.5}
... )
>>>
>>> print(f"Dataset has {metadata.n_samples} samples")
>>> print(f"Number of groups: {len(metadata.group_sizes) if metadata.group_sizes else 0}")
n_samples: int
n_features: int
n_classes: int | None = None
has_groups: bool = False
group_sizes: Dict[Any, int] | None = None
time_range: Tuple[float, float] | None = None
sampling_rate: float | None = None
has_gaps: bool = False
max_gap_size: float | None = None
is_balanced: bool = True
class_distribution: Dict[Any, float] | None = None
get_summary()[source]

Generate human-readable summary of the metadata.

Returns:

Formatted summary string

Return type:

str

suggest_strategy()[source]

Suggest appropriate CV strategy based on metadata.

Returns:

Suggested strategy name

Return type:

str

__init__(n_samples, n_features, n_classes=None, has_groups=False, group_sizes=None, time_range=None, sampling_rate=None, has_gaps=False, max_gap_size=None, is_balanced=True, class_distribution=None)

Modules

Classifier([class_weight, random_state])

Server for initializing various scikit-learn classifiers with consistent interface.

CrossValidationExperiment(name, model_fn[, ...])

Streamlined cross-validation experiment runner.

reporters

Reporter implementations for classification.

timeseries

Time series cross-validation utilities for classification.