Example
This page illustrates how to use the MIset library by an example.
First, lets load in the Breast Cancer Wisconsin (Diagnostic) dataset.
>>> from ucimlrepo import fetch_ucirepo
>>> breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)
>>> X = breast_cancer_wisconsin_diagnostic.data.features
>>> y = breast_cancer_wisconsin_diagnostic.data.targets
Let us combine both the core data and the target variable into a single dataframe:
>>> import pandas as pd
>>> pd.set_option('display.max_columns',None)
>>> df=pd.concat([X,y],axis=1)
>>> df.head()
radius1 texture1 perimeter1 area1 smoothness1 compactness1 concavity1 concave_points1 symmetry1 fractal_dimension1 radius2 texture2 perimeter2 area2 smoothness2 compactness2 concavity2 concave_points2 symmetry2 fractal_dimension2 radius3 texture3 perimeter3 area3 smoothness3 compactness3 concavity3 concave_points3 symmetry3 fractal_dimension3 Diagnosis
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 M
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 M
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 M
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 M
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 M
This dataset contains many columns having continuous values. Before we implement our feature selection algorithm, let us first perform binning on these features.
NOTE - It is generally recommended to apply binning to both continuous and discrete variables before using this feature selection technique, as this was the approach taken by its authors.
>>> from sklearn.preprocessing import KBinsDiscretizer
>>> kbd=KBinsDiscretizer(n_bins=10,encode='ordinal',strategy='uniform')
>>> cont_columns=list(df.columns)
>>> cont_columns.remove('Diagnosis') # This is our target variable
>>> df[cont_columns]=kbd.fit_transform(df[cont_columns])
Now, let us perform feature selection. We select the Joint Mutual Information Maximization algorithm to perform feature selection and retrieve the top 3 most relevant features out of the entire dataset.
We initialize the class object and call the MIset.fit() method to fit the feature selection algorithm on our dataset.
>>> from MIset import MIset
>>> miset=MIset(max_features=3,variant='jmim',verbose=False,n_jobs=-1)
>>> miset.fit(df=df,feature_list=cont_columns,class_feature_name='Diagnosis')
Now that feature selection is done, we use MIset.feature_ranking() method to get the top 3 relevant features found by the feature selection algorithm from our entire dataset.
MIset.feature_ranking() gives a dictionary where the dictionary key is the iteration number and the dictionary value is the most important feature found in that iteration.
>>> miset.feature_selection_order()
{1: 'concave_points3', 2: 'radius3', 3: 'texture3'}