scitex_ml.feature_selection

Feature selection utilities for machine learning.

This module provides comprehensive feature selection and importance analysis: - Feature importance extraction from various model types - Univariate feature selection (ANOVA F-test, chi2, mutual_info) - Model-based feature selection (tree importances, L1 coefficients) - Cross-fold feature consistency analysis - Feature importance aggregation and visualization

scitex_ml.feature_selection.extract_feature_importance(model, feature_names, method='auto')[source]

Extract feature importance from trained model.

Parameters:
  • model – Trained classifier

  • feature_names (List[str]) – List of feature names

  • method (str) – Method to extract importance: - “auto”: Automatically detect best method - “tree”: Use feature_importances_ (tree-based models) - “coef”: Use coefficients (linear models) - “permutation”: Use permutation importance (any model)

Return type:

Optional[Dict[str, float]]

Returns:

Dictionary mapping feature names to importance values, or None if extraction fails

scitex_ml.feature_selection.select_features_univariate(X_train, y_train, X_val, feature_names, k=10, score_func='f_classif', impute_strategy='median')[source]

Select top k features using univariate statistical tests.

This prevents data leakage by: 1. Fitting the selector only on training data 2. Transforming validation/test data with the fitted selector

Parameters:
  • X_train (ndarray) – Training features

  • y_train (ndarray) – Training labels

  • X_val (ndarray) – Validation features

  • feature_names (List[str]) – List of feature names

  • k (int) – Number of features to select

  • score_func (str) – Scoring function: - “f_classif”: ANOVA F-test (default) - “chi2”: Chi-squared test (requires non-negative features) - “mutual_info”: Mutual information

  • impute_strategy (str) – Strategy for imputing missing values: - “median” (default), “mean”, “most_frequent”, “constant”

Returns:

Selected training features X_val_selected: Selected validation features feature_indices: Indices of selected features selected_names: Names of selected features imputer: Fitted imputer for test data

Return type:

X_train_selected

scitex_ml.feature_selection.analyze_feature_consistency(selected_features_per_fold)[source]

Analyze feature selection consistency across CV folds.

Parameters:

selected_features_per_fold (List[List[str]]) – List of feature lists, one per fold

Returns:

  • “feature_frequency”: Dict mapping features to selection count

  • ”n_folds”: Total number of folds

  • ”n_unique_features”: Number of unique features selected

  • ”consistency_score”: Average selection frequency (0-1)

  • ”stable_features”: Features selected in all folds

  • ”unstable_features”: Features selected in only one fold

Return type:

Dictionary containing

scitex_ml.feature_selection.aggregate_feature_importances(importances_per_fold, method='mean')[source]

Aggregate feature importances across CV folds.

Parameters:
  • importances_per_fold (List[Dict[str, float]]) – List of importance dicts, one per fold

  • method (str) – Aggregation method: - “mean”: Average importance across folds - “median”: Median importance across folds - “max”: Maximum importance across folds

Returns:

  • “mean”: Mean importance per feature

  • ”std”: Standard deviation per feature

  • ”min”: Minimum importance per feature

  • ”max”: Maximum importance per feature

  • ”cv”: Coefficient of variation (std/mean) per feature

Return type:

Dictionary containing

scitex_ml.feature_selection.create_feature_importance_dataframe(aggregated_importances)[source]

Create a formatted DataFrame from aggregated feature importances.

Parameters:

aggregated_importances (Dict[str, Dict[str, float]]) – Output from aggregate_feature_importances()

Returns:

feature, mean, std, min, max, cv Sorted by mean importance (descending)

Return type:

DataFrame with columns

Modules

feature_selection