scitex_ml.feature_selection
Feature selection utilities for machine learning.
This module provides comprehensive feature selection and importance analysis: - Feature importance extraction from various model types - Univariate feature selection (ANOVA F-test, chi2, mutual_info) - Model-based feature selection (tree importances, L1 coefficients) - Cross-fold feature consistency analysis - Feature importance aggregation and visualization
- scitex_ml.feature_selection.extract_feature_importance(model, feature_names, method='auto')[source]
Extract feature importance from trained model.
- Parameters:
model – Trained classifier
method (
str) – Method to extract importance: - “auto”: Automatically detect best method - “tree”: Use feature_importances_ (tree-based models) - “coef”: Use coefficients (linear models) - “permutation”: Use permutation importance (any model)
- Return type:
- Returns:
Dictionary mapping feature names to importance values, or None if extraction fails
- scitex_ml.feature_selection.select_features_univariate(X_train, y_train, X_val, feature_names, k=10, score_func='f_classif', impute_strategy='median')[source]
Select top k features using univariate statistical tests.
This prevents data leakage by: 1. Fitting the selector only on training data 2. Transforming validation/test data with the fitted selector
- Parameters:
X_train (
ndarray) – Training featuresy_train (
ndarray) – Training labelsX_val (
ndarray) – Validation featuresk (
int) – Number of features to selectscore_func (
str) – Scoring function: - “f_classif”: ANOVA F-test (default) - “chi2”: Chi-squared test (requires non-negative features) - “mutual_info”: Mutual informationimpute_strategy (
str) – Strategy for imputing missing values: - “median” (default), “mean”, “most_frequent”, “constant”
- Returns:
Selected training features X_val_selected: Selected validation features feature_indices: Indices of selected features selected_names: Names of selected features imputer: Fitted imputer for test data
- Return type:
X_train_selected
- scitex_ml.feature_selection.analyze_feature_consistency(selected_features_per_fold)[source]
Analyze feature selection consistency across CV folds.
- Parameters:
selected_features_per_fold (
List[List[str]]) – List of feature lists, one per fold- Returns:
“feature_frequency”: Dict mapping features to selection count
”n_folds”: Total number of folds
”n_unique_features”: Number of unique features selected
”consistency_score”: Average selection frequency (0-1)
”stable_features”: Features selected in all folds
”unstable_features”: Features selected in only one fold
- Return type:
Dictionary containing
- scitex_ml.feature_selection.aggregate_feature_importances(importances_per_fold, method='mean')[source]
Aggregate feature importances across CV folds.
- Parameters:
- Returns:
“mean”: Mean importance per feature
”std”: Standard deviation per feature
”min”: Minimum importance per feature
”max”: Maximum importance per feature
”cv”: Coefficient of variation (std/mean) per feature
- Return type:
Dictionary containing
- scitex_ml.feature_selection.create_feature_importance_dataframe(aggregated_importances)[source]
Create a formatted DataFrame from aggregated feature importances.
Modules