scitex_ml.classification
Classification utilities with unified API.
- class scitex_ml.classification.ClassificationReporter(output_dir, tasks=None, precision=3, required_metrics=['balanced_accuracy', 'mcc', 'confusion_matrix', 'classification_report', 'roc_auc', 'roc_curve', 'pre_rec_auc', 'pre_rec_curve'], verbose=True, **kwargs)[source]
Unified classification reporter for single and multi-task scenarios.
This reporter automatically adapts to your use case: - Single task: Just use it without specifying tasks - Multiple tasks: Specify tasks upfront or create them dynamically - Seamless switching between single and multi-task workflows
Features: - Comprehensive metrics calculation (balanced accuracy, MCC, ROC-AUC, PR-AUC, etc.) - Automated visualization generation:
Confusion matrices
ROC and Precision-Recall curves
Feature importance plots (via plotter)
CV aggregation plots with faded fold lines
Comprehensive metrics dashboard
Multi-format report generation (Org, Markdown, LaTeX, HTML, DOCX, PDF)
Cross-validation support with automatic fold aggregation
Multi-task classification tracking
- Parameters:
output_dir (Union[str, Path]) – Base directory for outputs. If None, creates timestamped directory.
tasks (List[str], optional) – List of task names. If None, tasks are created dynamically as needed.
precision (int, default 3) – Number of decimal places for numerical outputs
required_metrics (List[str], optional) – List of metrics to calculate. Defaults to comprehensive set.
verbose (bool, default True) – Whether to print initialization messages
**kwargs – Additional arguments passed to base class
Examples
>>> # Single task usage (no tasks specified) >>> reporter = ClassificationReporter("./results") >>> reporter.calculate_metrics(y_true, y_pred, y_proba)
>>> # Multi-task with predefined tasks >>> reporter = ClassificationReporter("./results", tasks=["binary", "multiclass"]) >>> reporter.calculate_metrics(y_true, y_pred, task="binary")
>>> # Dynamic task creation >>> reporter = ClassificationReporter("./results") >>> reporter.calculate_metrics(y_true1, y_pred1, task="task1") >>> reporter.calculate_metrics(y_true2, y_pred2, task="task2")
>>> # Feature importance visualization (via plotter) >>> reporter._single_reporter.plotter.create_feature_importance_plot( ... feature_importance=importances, ... feature_names=feature_names, ... save_path="./results/feature_importance.png" ... )
>>> # CV aggregation plots (automatically created on save_summary) >>> for fold in range(5): ... metrics = reporter.calculate_metrics(y_true, y_pred, y_proba, fold=fold) >>> reporter.save_summary() # Creates CV aggregation plots with faded fold lines
- __init__(output_dir, tasks=None, precision=3, required_metrics=['balanced_accuracy', 'mcc', 'confusion_matrix', 'classification_report', 'roc_auc', 'roc_curve', 'pre_rec_auc', 'pre_rec_curve'], verbose=True, **kwargs)[source]
- calculate_metrics(y_true, y_pred, y_proba=None, labels=None, fold=None, task=None, verbose=True, model=None, feature_names=None)[source]
Calculate metrics for classification.
Automatically handles single vs multi-task scenarios: - If no task specified and no tasks defined: creates “default” task - If no task specified but tasks exist: uses first task - If task specified: uses/creates that specific task
- Parameters:
y_true (np.ndarray) – True class labels
y_pred (np.ndarray) – Predicted class labels
y_proba (np.ndarray, optional) – Prediction probabilities (required for AUC metrics)
labels (List[str], optional) – Class labels for display
fold (int, optional) – Fold index for cross-validation
task (str, optional) – Task identifier. If None and no tasks exist, creates “default” task.
verbose (bool, default True) – Whether to print progress
model (object, optional) – Trained model for automatic feature importance extraction
feature_names (List[str], optional) – Feature names for feature importance (required if model is provided)
- Returns:
Dictionary of calculated metrics
- Return type:
Dict[str, Any]
- save(data, relative_path, task=None, fold=None)[source]
Save custom data with automatic task/fold organization.
- Parameters:
data (Any) – Data to save (any format supported by scitex_io.save)
relative_path (Union[str, Path]) – Relative path from output directory
task (Optional[str], default None) – Task name. If provided, saves to task-specific directory
fold (Optional[int], default None) – If provided, automatically prepends “fold_{fold:02d}/” to path
- Returns:
Absolute path to the saved file
- Return type:
Path
Examples
>>> # Single task mode (no task specified) >>> reporter.save({"accuracy": 0.95}, "metrics.json")
>>> # Multi-task mode >>> reporter.save(results, "results.csv", task="binary", fold=0)
- get_summary()[source]
Get summary of all calculated metrics.
- Returns:
Summary of metrics across all tasks and folds
- Return type:
Dict[str, Any]
- save_feature_importance(model, feature_names, fold=None, task=None)[source]
Calculate and save feature importance for tree-based models.
- Parameters:
model (object) – Fitted classifier (must have feature_importances_)
feature_names (List[str]) – Names of features
fold (int, optional) – Fold number for tracking
task (str, optional) – Task name for multi-task mode
- Returns:
Dictionary of feature importances {feature_name: importance}
- Return type:
- class scitex_ml.classification.SingleTaskClassificationReporter(output_dir, config=None, verbose=True, **kwargs)[source]
Improved single-task classification reporter with unified API.
Key improvements: - Inherits from BaseClassificationReporter for consistent API - Lazy directory creation (no empty folders) - Numerical precision control - Graceful plotting with proper error handling - Consistent parameter names across all methods
Features: - Comprehensive metrics calculation (balanced accuracy, MCC, ROC-AUC, PR-AUC, etc.) - Automated visualization generation:
Confusion matrices
ROC and Precision-Recall curves
Feature importance plots
CV aggregation plots with faded fold lines
Comprehensive metrics dashboard
Multi-format report generation (Org, Markdown, LaTeX, HTML, DOCX, PDF)
Cross-validation support with automatic fold aggregation
- Parameters:
Examples
>>> # Basic usage >>> reporter = SingleTaskClassificationReporter("./results") >>> metrics = reporter.calculate_metrics(y_true, y_pred, y_proba, labels=['A', 'B']) >>> reporter.save_summary()
>>> # Cross-validation with automatic CV aggregation plots >>> for fold, (train_idx, test_idx) in enumerate(cv.split(X, y)): ... metrics = reporter.calculate_metrics( ... y_test, y_pred, y_proba, fold=fold ... ) >>> reporter.save_summary() # Automatically creates CV aggregation visualizations
>>> # Feature importance visualization >>> reporter.plotter.create_feature_importance_plot( ... feature_importance=importances, ... feature_names=feature_names, ... save_path=output_dir / "feature_importance.png" ... )
- set_session_config(config)[source]
Set the SciTeX session CONFIG object for inclusion in reports.
- Parameters:
config (Any) – The SciTeX session CONFIG object
- Return type:
- class scitex_ml.classification.Classifier(class_weight=None, random_state=42)[source]
Server for initializing various scikit-learn classifiers with consistent interface.
Example
>>> clf_server = Classifier(class_weight={0: 1.0, 1: 2.0}, random_state=42) >>> clf = clf_server("SVC", scaler=_StandardScaler()) >>> print(clf_server.list) ['CatBoostClassifier', 'Perceptron', ...]
- Parameters:
- class scitex_ml.classification.CrossValidationExperiment(name, model_fn, cv=None, output_dir=None, metrics=None, save_models=True, verbose=True)[source]
Streamlined cross-validation experiment runner.
This class handles: - Cross-validation splitting - Model training and evaluation - Automatic metric calculation - Hyperparameter tracking - Progress monitoring - Report generation
- Parameters:
name (str) – Experiment name
model_fn (Callable) – Function that returns a model instance
cv (BaseCrossValidator, optional) – Cross-validation splitter (default: 5-fold stratified)
output_dir (Union[str, Path], optional) – Output directory for results
metrics (List[str], optional) – List of metrics to calculate
save_models (bool) – Whether to save trained models
verbose (bool) – Whether to print progress
- __init__(name, model_fn, cv=None, output_dir=None, metrics=None, save_models=True, verbose=True)[source]
- set_hyperparameters(**kwargs)[source]
Set hyperparameters for tracking.
- Parameters:
**kwargs – Hyperparameter key-value pairs
- Return type:
- scitex_ml.classification.CVExperiment
alias of
CrossValidationExperiment
- scitex_ml.classification.quick_experiment(X, y, model, name='quick_experiment', n_folds=5, **kwargs)[source]
Run a quick cross-validation experiment.
This is a convenience function for rapid experimentation.
- Parameters:
- Returns:
Experiment results
- Return type:
Dict[str, Any]
Examples
>>> from sklearn.svm import SVC >>> results = quick_experiment(X, y, SVC(), name="svm_test") >>> print(f"Report saved to: {results['paths']['final_report']}")
- class scitex_ml.classification.TimeSeriesStratifiedSplit(n_splits=5, test_ratio=0.2, val_ratio=0.1, gap=0, stratify=True, random_state=None)[source]
Time series cross-validation with stratification support.
This splitter ensures: 1. Test data is always chronologically after training data 2. Optional validation set between train and test 3. Class balance preservation in splits 4. Gap period between train and test to avoid leakage
- Parameters:
n_splits (int) – Number of splits (folds)
test_ratio (float) – Proportion of data for test set (default: 0.2)
val_ratio (float) – Proportion of data for validation set (default: 0.1)
gap (int) – Number of samples to exclude between train and test (default: 0)
stratify (bool) – Whether to maintain class proportions (default: True)
random_state (int, optional) – Random seed for reproducibility (default: None)
Examples
>>> from scitex_ml.classification import TimeSeriesStratifiedSplit >>> import numpy as np >>> >>> X = np.random.randn(100, 10) >>> y = np.random.randint(0, 2, 100) >>> timestamps = np.arange(100) >>> >>> tscv = TimeSeriesStratifiedSplit(n_splits=3) >>> for train_idx, test_idx in tscv.split(X, y, timestamps): ... print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")
- __init__(n_splits=5, test_ratio=0.2, val_ratio=0.1, gap=0, stratify=True, random_state=None)[source]
- split(X, y=None, timestamps=None, groups=None)[source]
Generate indices to split data into training and test sets.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data
y (array-like, shape (n_samples,)) – Target variable
timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)
groups (array-like, shape (n_samples,), optional) – Group labels for grouped CV
- Yields:
train (ndarray) – Training set indices
test (ndarray) – Test set indices
- Return type:
- split_with_val(X, y=None, timestamps=None, groups=None)[source]
Generate indices with separate validation set.
- get_n_splits(X=None, y=None, groups=None)[source]
Returns the number of splitting iterations in the CV.
- plot_splits(X, y=None, timestamps=None, figsize=(12, 6), save_path=None)[source]
Visualize the stratified time series splits.
Shows train (blue), validation (green), and test (red) sets. When val_ratio=0, only shows train and test.
- Parameters:
- Returns:
fig – The created figure
- Return type:
matplotlib.figure.Figure
- set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesStratifiedSplit
Configure whether metadata should be requested to be passed to the
splitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tosplitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tosplit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- class scitex_ml.classification.TimeSeriesBlockingSplit(n_splits=5, test_ratio=0.2, val_ratio=0.0, random_state=None)[source]
Time series split with blocking to handle multiple subjects/groups.
This splitter ensures temporal integrity within each subject while allowing cross-subject generalization. Each subject’s data is kept temporally coherent, but subjects can appear in both training and test sets at different time periods.
Key Features: - Temporal order preserved within each subject - No data leakage within individual subject timelines - Expanding window approach: more training data in later folds - Cross-subject generalization: subjects can be in both train and test
Use Cases: - Multiple patients with longitudinal medical data - Multiple stocks with time series financial data - Multiple sensors with temporal measurements - Any scenario with grouped time series data
- Parameters:
Examples
>>> from scitex_ml.classification import TimeSeriesBlockingSplit >>> import numpy as np >>> >>> # Create data: 100 samples, 4 subjects (25 samples each) >>> X = np.random.randn(100, 10) >>> y = np.random.randint(0, 2, 100) >>> timestamps = np.arange(100) >>> groups = np.repeat([0, 1, 2, 3], 25) # Subject IDs >>> >>> # Each subject gets temporal split: early samples → train, later → test >>> splitter = TimeSeriesBlockingSplit(n_splits=3, test_ratio=0.3) >>> for train_idx, test_idx in splitter.split(X, y, timestamps, groups): ... train_subjects = set(groups[train_idx]) ... test_subjects = set(groups[test_idx]) ... print(f"Train subjects: {train_subjects}, Test subjects: {test_subjects}") ... # Output shows same subjects in both sets but different time periods
- split(X, y=None, timestamps=None, groups=None)[source]
Generate indices respecting group boundaries.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data
y (array-like, shape (n_samples,)) – Target variable
timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)
groups (array-like, shape (n_samples,)) – Group labels (e.g., patient IDs) - required
- Yields:
train (ndarray) – Training set indices
test (ndarray) – Test set indices
- Return type:
- split_with_val(X, y=None, timestamps=None, groups=None)[source]
Generate indices with separate validation set respecting group boundaries.
Each subject gets its own train/val/test split maintaining temporal order.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data
y (array-like, shape (n_samples,)) – Target variable
timestamps (array-like, shape (n_samples,)) – Timestamps for temporal ordering (required)
groups (array-like, shape (n_samples,)) – Group labels (e.g., patient IDs) - required
- Yields:
train (ndarray) – Training set indices
val (ndarray) – Validation set indices
test (ndarray) – Test set indices
- Return type:
- plot_splits(X, y=None, timestamps=None, groups=None, figsize=(12, 6), save_path=None)[source]
Visualize the blocking splits showing subject separation.
This visualization shows how data from different subjects/groups is allocated to training and test sets while maintaining temporal order within each subject.
Color Scheme: - Rectangle border: Blue = Training set, Red = Test set - Rectangle fill: Different colors represent different subjects/groups - Each subject gets a unique color (cycling through colormap)
Key Features: - No mixing: Each subject’s data stays within temporal boundaries - Subject separation: Same subject can appear in both train/test but at different times - Temporal integrity: Time flows left to right for each subject
- Parameters:
X (array-like) – Training data
y (array-like, optional) – Target variable (not used)
timestamps (array-like, optional) – Timestamps (if None, uses sample indices)
groups (array-like) – Group labels (required for blocking split) - each unique value represents a subject
figsize (tuple, default (12, 6)) – Figure size
save_path (str, optional) – Path to save the plot
- Returns:
fig – The created figure with proper legend showing subject colors
- Return type:
matplotlib.figure.Figure
Examples
>>> splitter = TimeSeriesBlockingSplit(n_splits=3) >>> fig = splitter.plot_splits(X, timestamps=timestamps, groups=subject_ids) >>> fig.show() # Will show train (blue border) vs test (red border) by subject
- set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesBlockingSplit
Configure whether metadata should be requested to be passed to the
splitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tosplitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tosplit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- class scitex_ml.classification.TimeSeriesSlidingWindowSplit(window_size=None, step_size=None, test_size=None, gap=0, val_ratio=0.0, random_state=None, overlapping_tests=False, expanding_window=False, undersample=False, n_splits=None)[source]
Sliding window cross-validation for time series.
Creates train/test windows that slide through time with configurable behavior.
- Parameters:
window_size (int, optional) – Size of training window (ignored if expanding_window=True or n_splits is set). Required if n_splits is None.
step_size (int, optional) – Step between windows (overridden if overlapping_tests=False)
test_size (int, optional) – Size of test window. Required if n_splits is None.
gap (int, default=0) – Number of samples to skip between train and test windows
val_ratio (float, default=0.0) – Ratio of validation set from training window
random_state (int, optional) – Random seed for reproducibility
overlapping_tests (bool, default=False) – If False, automatically sets step_size=test_size to ensure each sample is tested exactly once (like K-fold for time series)
expanding_window (bool, default=False) – If True, training window grows to include all past data (like sklearn’s TimeSeriesSplit). If False, uses fixed sliding window of size window_size.
undersample (bool, default=False) – If True, balance classes in training sets by randomly undersampling the majority class to match the minority class count. Temporal order is maintained. Requires y labels in split().
n_splits (int, optional) – Number of splits to generate. If specified, window_size and test_size are automatically calculated to create exactly n_splits folds. Cannot be used together with manual window_size/test_size specification.
Examples
>>> from scitex_ml.classification import TimeSeriesSlidingWindowSplit >>> import numpy as np >>> >>> X = np.random.randn(100, 10) >>> y = np.random.randint(0, 2, 100) >>> timestamps = np.arange(100) >>> >>> # Fixed window, non-overlapping tests (default) >>> swcv = TimeSeriesSlidingWindowSplit(window_size=50, test_size=10, gap=5) >>> for train_idx, test_idx in swcv.split(X, y, timestamps): ... print(f"Train: {len(train_idx)}, Test: {len(test_idx)}") >>> >>> # Expanding window (use all past data) >>> swcv = TimeSeriesSlidingWindowSplit( ... window_size=50, test_size=10, gap=5, expanding_window=True ... ) >>> for train_idx, test_idx in swcv.split(X, y, timestamps): ... print(f"Train: {len(train_idx)}, Test: {len(test_idx)}") # Train grows! >>> >>> # Using n_splits (automatically calculates window and test sizes) >>> swcv = TimeSeriesSlidingWindowSplit( ... n_splits=5, gap=0, expanding_window=True, undersample=True ... ) >>> for train_idx, test_idx in swcv.split(X, y, timestamps): ... print(f"Train: {len(train_idx)}, Test: {len(test_idx)}") >>> >>> # Visualize splits >>> fig = swcv.plot_splits(X, y, timestamps)
- __init__(window_size=None, step_size=None, test_size=None, gap=0, val_ratio=0.0, random_state=None, overlapping_tests=False, expanding_window=False, undersample=False, n_splits=None)[source]
- set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesSlidingWindowSplit
Configure whether metadata should be requested to be passed to the
splitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tosplitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tosplit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- class scitex_ml.classification.TimeSeriesCalendarSplit(interval='M', n_train_intervals=12, n_test_intervals=1, n_val_intervals=0, gap_intervals=0, step_intervals=1, random_state=None)[source]
Calendar-based time series cross-validation splitter.
Splits data based on calendar intervals (e.g., months, weeks, days). Ensures temporal order is preserved and no data leakage occurs.
- Parameters:
interval (str) – Time interval for splitting. Options: - ‘D’: Daily - ‘W’: Weekly - ‘M’: Monthly - ‘Q’: Quarterly - ‘Y’: Yearly Or any pandas frequency string
n_train_intervals (int) – Number of intervals to use for training
n_test_intervals (int) – Number of intervals to use for testing (default: 1)
gap_intervals (int) – Number of intervals to skip between train and test (default: 0)
step_intervals (int) – Number of intervals to step forward for next fold (default: 1)
Examples
>>> from scitex_ml.classification import TimeSeriesCalendarSplit >>> import pandas as pd >>> import numpy as np >>> >>> # Create sample data with daily timestamps >>> dates = pd.date_range('2023-01-01', '2023-12-31', freq='D') >>> X = np.random.randn(len(dates), 10) >>> y = np.random.randint(0, 2, len(dates)) >>> >>> # Monthly splits: 6 months train, 1 month test >>> tscal = TimeSeriesCalendarSplit(interval='M', n_train_intervals=6) >>> for train_idx, test_idx in tscal.split(X, y, timestamps=dates): ... print(f"Train: {dates[train_idx[0]]:%Y-%m} to {dates[train_idx[-1]]:%Y-%m}") ... print(f"Test: {dates[test_idx[0]]:%Y-%m} to {dates[test_idx[-1]]:%Y-%m}")
- __init__(interval='M', n_train_intervals=12, n_test_intervals=1, n_val_intervals=0, gap_intervals=0, step_intervals=1, random_state=None)[source]
- split(X, y=None, timestamps=None, groups=None)[source]
Generate calendar-based train/test splits.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data
y (array-like, shape (n_samples,), optional) – Target variable
timestamps (array-like or pd.DatetimeIndex, shape (n_samples,)) – Timestamps for each sample (required)
groups (array-like, shape (n_samples,), optional) – Group labels (not used in this splitter)
- Yields:
train (ndarray) – Training set indices
test (ndarray) – Test set indices
- Return type:
- split_with_val(X, y=None, timestamps=None, groups=None)[source]
Generate calendar-based train/validation/test splits.
The validation set comes after training but before test, maintaining temporal order: train < val < test.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Training data
y (array-like, shape (n_samples,), optional) – Target variable
timestamps (array-like or pd.DatetimeIndex, shape (n_samples,)) – Timestamps for each sample (required)
groups (array-like, shape (n_samples,), optional) – Group labels (not used in this splitter)
- Yields:
train (ndarray) – Training set indices
val (ndarray) – Validation set indices
test (ndarray) – Test set indices
- Return type:
- get_n_splits(X=None, y=None, timestamps=None)[source]
Calculate number of splits.
- Parameters:
X (array-like, optional) – Not used directly
y (array-like, optional) – Not used
timestamps (array-like or pd.DatetimeIndex, optional) – Timestamps to determine number of possible splits
- Returns:
n_splits – Number of splits. Returns -1 if timestamps is None.
- Return type:
- plot_splits(X, y=None, timestamps=None, figsize=(12, 6), save_path=None)[source]
Visualize the train/test splits as timeline rectangles with scatter plots.
- Parameters:
X (array-like) – Training data (used to determine data size)
y (array-like, optional) – Target variable (used for color-coding scatter points)
timestamps (array-like or pd.DatetimeIndex) – Timestamps for each sample
figsize (tuple, default (12, 6)) – Figure size (width, height)
save_path (str, optional) – Path to save the plot
- Returns:
fig – The created figure
- Return type:
matplotlib.figure.Figure
Examples
>>> splitter = TimeSeriesCalendarSplit(interval='M', n_train_intervals=6) >>> fig = splitter.plot_splits(X, timestamps=dates) >>> fig.savefig('calendar_splits.png')
- set_split_request(*, timestamps: bool | None | str = '$UNCHANGED$') TimeSeriesCalendarSplit
Configure whether metadata should be requested to be passed to the
splitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tosplitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tosplit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- class scitex_ml.classification.TimeSeriesStrategy(value)[source]
Available time series CV strategies.
- STRATIFIED = 'stratified'
- BLOCKING = 'blocking'
- SLIDING = 'sliding'
- EXPANDING = 'expanding'
- FIXED = 'fixed'
- classmethod from_string(value)[source]
Create strategy from string value.
- Parameters:
value (str) – String representation of strategy
- Returns:
Corresponding enum value
- Return type:
- Raises:
ValueError – If value doesn’t match any strategy
- class scitex_ml.classification.TimeSeriesMetadata(n_samples, n_features, n_classes=None, has_groups=False, group_sizes=None, time_range=None, sampling_rate=None, has_gaps=False, max_gap_size=None, is_balanced=True, class_distribution=None)[source]
Metadata about the time series data.
This dataclass captures essential characteristics of time series data that inform the selection of appropriate cross-validation strategies.
Examples
>>> import numpy as np >>> from scitex_ml.classification import TimeSeriesMetadata >>> >>> # Create metadata for a dataset >>> metadata = TimeSeriesMetadata( ... n_samples=1000, ... n_features=10, ... n_classes=2, ... has_groups=True, ... group_sizes={0: 250, 1: 250, 2: 250, 3: 250}, ... time_range=(0.0, 999.0), ... sampling_rate=1.0, ... has_gaps=False, ... max_gap_size=None, ... is_balanced=True, ... class_distribution={0: 0.5, 1: 0.5} ... ) >>> >>> print(f"Dataset has {metadata.n_samples} samples") >>> print(f"Number of groups: {len(metadata.group_sizes) if metadata.group_sizes else 0}")
- get_summary()[source]
Generate human-readable summary of the metadata.
- Returns:
Formatted summary string
- Return type:
- suggest_strategy()[source]
Suggest appropriate CV strategy based on metadata.
- Returns:
Suggested strategy name
- Return type:
- __init__(n_samples, n_features, n_classes=None, has_groups=False, group_sizes=None, time_range=None, sampling_rate=None, has_gaps=False, max_gap_size=None, is_balanced=True, class_distribution=None)
Modules
|
Server for initializing various scikit-learn classifiers with consistent interface. |
|
Streamlined cross-validation experiment runner. |
Reporter implementations for classification. |
|
Time series cross-validation utilities for classification. |