discrimintools package

Submodules

discrimintools.candisc module

class discrimintools.candisc.CANDISC(n_components=None, target=None, features=None, priors=None, parallelize=False)[source]

Bases: BaseEstimator, TransformerMixin

Canonical Discriminant Analysis (CANDISC)

Description

This class inherits from sklearn BaseEstimator and TransformerMixin class

Performs a Canonical Discriminant Analysis, computes squared Mahalanobis distances between class means, and performs both univariate and multivariate one-way analyses of variance

param n_components:

type n_components:

number of dimensions kept in the results

param target:

type target:

string, target variable

param priors:

type priors:

Class priors (sum to 1)

param parallelize:
If model should be parallelize
  • If True : parallelize using mapply

  • If False : parallelize using apply

type parallelize:

boolean, default = False

returns:
  • summary_information_ (summary information about the variables in the analysis. This information includes the number of observations,) – the number of quantitative variables in the analysis, and the number of classes in the classification variable. The frequency of each class is also displayed.

  • eig_ (a pandas dataframe containing all the eigenvalues, the difference between each eigenvalue, the percentage of variance and the cumulative percentage of variance)

  • ind_ (a dictionary of pandas dataframe containing all the results for the active individuals (coordinates))

  • statistics_ (statistics)

  • classes_ (classes informations)

  • cov_ (covariances)

  • corr_ (correlation)

  • coef_ (pandas dataframe, Weight vector(s).)

  • intercept_ (pandas dataframe, Intercept term.)

  • score_coef_

  • score_intercept_

  • svd_ (eigenv value decomposition)

  • call_ (a dictionary with some statistics)

  • model_ (string. The model fitted = ‘candisc’)

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

References

SAS Documentation, https://documentation.sas.com/doc/en/statug/15.2/statug_candisc_toc.htm https://www.rdocumentation.org/packages/candisc/versions/0.8-6/topics/candisc https://www.rdocumentation.org/packages/candisc/versions/0.8-6 Ricco Rakotomalala, Pratique de l’analyse discriminante linéaire, Version 1.0, 2020

decision_function(X)[source]

Apply decision function to a pandas dataframe of samples

The decision function is equal (up to a constant factor) to the log-posterior of the model, i.e. log p(y = k | x). In a binary classification setting this instead corresponds to the difference log p(y = 1 | x) - log p(y = 0 | x).

param X:

DataFrame of samples (test vectors).

type X:

DataFrame of shape (n_samples_, n_features)

returns:

C – Decision function values related to each class, per sample. In the two-class case, the shape is (n_samples_,), giving the log likelihood ratio of the positive class.

rtype:

DataFrame of shape (n_samples_,) or (n_samples_, n_classes)

fit(X, y=None)[source]

Fit the Canonical Discriminant Analysis model

param X:

Training Data

type X:

pandas/polars DataFrame,

param Returns:

param ——–:

param self:

Fitted estimator

type self:

object

fit_transform(X)[source]

Fit to data, then transform it

Fits transformer to x and returns a transformed version of X.

Parameters:

XDataFrame of shape (n_samples_, n_features_)

Input samples

returns:

X_new – Transformed data.

rtype:

DataFrame of shape (n_rows, n_features_)

pred_table()[source]

Prediction table

Notes

pred_table[i,j] refers to the number of times “i” was observed and the model predicted “j”. Correct predictions are along the diagonal.

predict(X)[source]

Predict class labels for samples in X

param X:

The data matrix for which we want to get the predictions.

type X:

DataFrame of shape (n_samples_, n_features_)

param Returns:

param ——–:

param y_pred:

Vectors containing the class labels for each sample

type y_pred:

ndarray of shape (n_samples)

predict_proba(X)[source]

Estimate probability

param X:

Input data.

type X:

DataFrame of shape (n_samples_,n_features_)

param Returns:

param ——–:

param C:

Estimated probabilities.

type C:

DataFrame of shape (n_samples_,n_classes_)

score(X, y, sample_weight=None)[source]

Return the mean accuracy on the given test data and labels

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

param X:

Test samples.

type X:

DataFrame of shape (n_samples_, n_features)

param y:

True labels for X.

type y:

array-like of shape (n_samples,) or (n_samples, n_outputs)

param sample_weight:

Sample weights.

type sample_weight:

array-like of shape (n_samples,), default=None

returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

rtype:

float

transform(X)[source]

Project data to maximize class separation

param X:

Input data

type X:

DataFrame of shape (n_samples_, n_features_)

param Returns:

param ——–:

param X_new:

Transformed data.

type X_new:

DataFrame of shape (n_samples_, n_components_)

discrimintools.datasets module

discrimintools.datasets.load_vins()[source]
discrimintools.datasets.load_wine()[source]

discrimintools.disca module

class discrimintools.disca.DISCA(n_components=None, target=None, features=None, priors=None, parallelize=False)[source]

Bases: BaseEstimator, TransformerMixin

Discriminant Correspondence Analysis (DISCA)

Description

This class inherits from sklearn BaseEstimator and TransformerMixin class

Performance Discriminant Correspondence Analysis

Parameters:

n_components : number of dimensions kept in the results

target : string, target variable

features : list of qualitatives variables to be included in the analysis.

priorsThe priors statement specifies the prior probabilities of group membership.
  • “equal” to set the prior probabilities equal,

  • “proportional” or “prop” to set the prior probabilities proportional to the sample sizes

  • a pandas series which specify the prior probability for each level of the classification variable.

parallelizeboolean, default = False
If model should be parallelize
  • If True : parallelize using mapply

  • If False : parallelize using apply

returns:
  • call_ (a dictionary with some statistics)

  • ind_ (a dictionary of pandas dataframe containing all the results for the active individuals (coordinates))

  • var_ (a dictionary of pandas dataframe containing all the results for the active variables (coordinates, correlation between variables and axes, square cosine, contributions))

  • statistics_ (statistics)

  • classes_ (classes informations)

  • anova_ (analyse of variance)

  • factor_model_ (correspondence analysis model)

  • coef_ (discriminant correspondence analysis coefficients)

  • model_ (string. The model fitted = ‘disca’)

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

  • Notes

  • ——

  • https (//bookdown.org/teddyswiebold/multivariate_statistical_analysis_using_r/discriminant-correspondence-analysis.html)

  • https (//search.r-project.org/CRAN/refmans/TExPosition/html/tepDICA.html)

  • http (//pbil.univ-lyon1.fr/ADE-4/ade4-html/discrimin.coa.html)

  • https (//rdrr.io/cran/ade4/man/discrimin.coa.html)

  • https (//stat.ethz.ch/pipermail/r-help/2010-December/263170.html)

  • https (//www.sciencedirect.com/science/article/pii/S259026012200011X)

decision_function(X)[source]

Apply decision function to an array of samples

param X:

DataFrame of samples (test vectors).

type X:

DataFrame of shape (n_samples_, n_features)

returns:

C – Decision function values related to each class, per sample.

rtype:

DataFrame of shape (n_samples_,) or (n_samples_, n_classes)

fit(X)[source]

Fit the Discriminant Correspondence Analysis model

param X:

Training Data

type X:

pandas/polars DataFrame,

param Returns:

param ——–:

param self:

Fitted estimator

type self:

object

fit_transform(X)[source]

Fit to data, then transform it

Fits transformer to X and returns a transformed version of X.

param X:

Input samples.

type X:

DataFrame of shape (n_samples, n_features+1)

returns:

X_new – Transformed array.

rtype:

DataFrame of shape (n_samples, n_features_new)

pred_table()[source]

Prediction table

Notes

pred_table[i,j] refers to the number of times “i” was observed and the model predicted “j”. Correct predictions are along the diagonal.

predict(X)[source]

Predict class labels for samples in X

param X:

The data matrix for which we want to get the predictions.

type X:

DataFrame of shape (n_samples_, n_features_)

param Returns:

param ——–:

param y_pred:

Vectors containing the class labels for each sample

type y_pred:

ndarray of shape (n_samples)

predict_proba(X)[source]

Estimate probability

param X:

Input data.

type X:

DataFrame of shape (n_samples_,n_features_)

param Returns:

param ——–:

param C:

Estimated probabilities.

type C:

DataFrame of shape (n_samples_,n_classes_)

score(X, y, sample_weight=None)[source]

Return the mean accuracy on the given test data and labels

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

param X:

Test samples.

type X:

array-like of shape (n_samples, n_features)

param y:

True labels for X.

type y:

array-like of shape (n_samples,) or (n_samples, n_outputs)

param sample_weight:

Sample weights.

type sample_weight:

array-like of shape (n_samples,), default=None

returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

rtype:

float

transform(X, y=None)[source]

Apply the dimensionality reduction on X

X is projected on the first axes previous extracted from a training set. :param X: New data, where n_rows_sup is the number of supplementary

row points and n_vars is the number of variables. X is a data table containing a category in each cell. Categories can be coded by strings or numeric values. X rows correspond to supplementary row points that are projected onto the axes.

type X:

array of string, int or float, shape (n_rows_sup, n_vars)

param y:

y is ignored.

type y:

None

returns:

X_new – X_new : coordinates of the projections of the supplementary row points onto the axes.

rtype:

array of float, shape (n_rows_sup, n_components_)

discrimintools.dismix module

class discrimintools.dismix.DISMIX(n_components=None, target=None, features=None, priors=None, parallelize=False)[source]

Bases: BaseEstimator, TransformerMixin

Discriminant Analysis of Mixed Data (DISMIX)

Description

This class inherits from sklearn BaseEstimator and TransformerMixin class

Performs linear discriminant analysis with both continuous and catogericals variables

Parameters:

n_components : number of dimensions kept in the results

target : The values of the classification variable define the groups for analysis.

features : list of mixed variables to be included in the analysis

priorsThe priors statement specifies the prior probabilities of group membership.
  • “equal” to set the prior probabilities equal,

  • “proportional” or “prop” to set the prior probabilities proportional to the sample sizes

  • a pandas series which specify the prior probability for each level of the classification variable.

parallelizeboolean, default = False
If model should be parallelize
  • If True : parallelize using mapply

  • If False : parallelize using apply

returns:
  • call_ (a dictionary with some statistics)

  • coef_ (DataFrame of shape (n_features,n_classes_))

  • intercept_ (DataFrame of shape (1, n_classes))

  • lda_model_ (linear discriminant analysis model)

  • factor_model_ (factor analysis of mixed data model)

  • projection_function_ (projection function)

  • coef_ (pandas dataframe of shpz (n_categories, n_classes))

  • intercept_ (pandas dataframe of shape (1, n_classes))

  • model_ (string. The model fitted = ‘dismix’)

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

  • References

  • ———–

  • Ricco Rakotomalala, Pratique de l’analyse discriminante linéaire, Version 1.0, 2020

fit(X, y=None)[source]

Fit the Linear Discriminant Analysis of Mixed Data model

Parameters:

XDataFrame of shape (n_samples, n_features+1)

Training data

y : None

Returns:

selfobject

Fitted estimator

fit_transform(X)[source]

Fit to data, then transform it

Fits transformer to X and returns a transformed version of X.

param X:

Input samples.

type X:

DataFrame of shape (n_samples, n_features+1)

returns:

X_new – Transformed array.

rtype:

DataFrame of shape (n_samples, n_features_new)

pred_table()[source]

Prediction table

Notes

pred_table[i,j] refers to the number of times “i” was observed and the model predicted “j”. Correct predictions are along the diagonal.

predict(X)[source]

Predict class labels for samples in X

Parameters:

XDataFrame of shape (n_samples, n_features)

The dataframe for which we want to get the predictions

Returns:

y_predDtaFrame of shape (n_samples, 1)

DataFrame containing the class labels for each sample.

predict_proba(X)[source]

Estimate probability

param X:

Input data

type X:

DataFrame of shape (n_samples, n_features)

param Returns:

param ——-:

param C:

Estimate probabilities

type C:

DataFrame of shape (n_samples, n_classes)

score(X, y, sample_weight=None)[source]

Return the mean accuracy on the given test data and labels

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

param X:

Test samples.

type X:

array-like of shape (n_samples, n_features)

param y:

True labels for X.

type y:

array-like of shape (n_samples,) or (n_samples, n_outputs)

param sample_weight:

Sample weights.

type sample_weight:

array-like of shape (n_samples,), default=None

returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

rtype:

float

transform(X)[source]

Project data to maximize class separation

Parameters:

XDataFrame of shape (n_samples, n_features)

Input data

Returns:

X_new : DataFrame of shape (n_samples, n_components_)

discrimintools.disqual module

class discrimintools.disqual.DISQUAL(n_components=None, target=None, features=None, priors=None, parallelize=False)[source]

Bases: BaseEstimator, TransformerMixin

Discriminant Analysis for qualitatives/categoricals variables (DISQUAL)

Description

This class inherits from sklearn BaseEstimator and TransformerMixin class

Performs discriminant analysis for categorical variables using multiple correspondence analysis (MCA) and linear discriminant analysis

Parameters:

n_components : number of dimensions kept in the results

target : The values of the classification variable define the groups for analysis.

features : list of qualitatives variables to be included in the analysis.

priorsThe priors statement specifies the prior probabilities of group membership.
  • “equal” to set the prior probabilities equal,

  • “proportional” or “prop” to set the prior probabilities proportional to the sample sizes

  • a pandas series which specify the prior probability for each level of the classification variable.

parallelizeboolean, default = False
If model should be parallelize
  • If True : parallelize using mapply

  • If False : parallelize using apply

Returns:

call_ : a dictionary with some statistics

statistics_ : Chi-square test of independence of variables in a contingency table.

coef_ : DataFrame of shape (n_features,n_classes_)

intercept_ : DataFrame of shape (1, n_classes)

lda_model_ : linear discriminant analysis model

factor_model_ : multiple correspondence analysis model

projection_function_ : projection function

coef_ : pandas dataframe of shpz (n_categories, n_classes)

intercept_ : pandas dataframe of shape (1, n_classes)

model_ : string. The model fitted = ‘disqual’

Author(s)

Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

References:

https://lemakistatheux.wordpress.com/category/outils-danalyse-supervisee/la-methode-disqual/ Ricco Rakotomalala, Pratique de l’analyse discriminante linéaire, Version 1.0, 2020 Saporta G., Probabilité, analyse des données et Statistique, Technip, 2006 Tufféry S., Data Mining et statistique décisionnelle - L’intelligence des données, Technip, 2012

# prodécure SAS: http://od-datamining.com/download/#macro Package et fonction R : http://finzi.psych.upenn.edu/library/DiscriMiner/html/disqual.html https://github.com/gastonstat/DiscriMiner

fit(X, y=None)[source]

Fit the Linear Discriminant Analysis with categories variables model

Parameters:

Xpandas/polars DataFrame of shape (n_samples, n_features+1)

Training data

y : None

Returns:

selfobject

Fitted estimator

fit_transform(X)[source]

Fit to data, then transform it

Fits transformer to X and returns a transformed version of X.

param X:

Input samples.

type X:

DataFrame of shape (n_samples, n_features+1)

returns:

X_new – Transformed array.

rtype:

DataFrame of shape (n_samples, n_features_new)

pred_table()[source]

Prediction table

Notes

pred_table[i,j] refers to the number of times “i” was observed and the model predicted “j”. Correct predictions are along the diagonal.

predict(X)[source]

Predict class labels for samples in X

Parameters:

XDataFrame of shape (n_samples, n_features)

The dataframe for which we want to get the predictions

Returns:

y_predDtaFrame of shape (n_samples, 1)

DataFrame containing the class labels for each sample.

predict_proba(X)[source]

Estimate probability

param X:

Input data

type X:

DataFrame of shape (n_samples, n_features)

param Returns:

param ——-:

param C:

Estimate probabilities

type C:

DataFrame of shape (n_samples, n_classes)

score(X, y, sample_weight=None)[source]

Return the mean accuracy on the given test data and labels

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

param X:

Test samples.

type X:

array-like of shape (n_samples, n_features)

param y:

True labels for X.

type y:

array-like of shape (n_samples,) or (n_samples, n_outputs)

param sample_weight:

Sample weights.

type sample_weight:

array-like of shape (n_samples,), default=None

returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

rtype:

float

transform(X)[source]

Project data to maximize class separation

Parameters:

XDataFrame of shape (n_samples, n_features)

Input data

Returns:

X_new : DataFrame of shape (n_samples, n_components_)

discrimintools.eta2 module

discrimintools.eta2.eta2(categories, value, digits=4)[source]

Calcul du rapport de corréltion eta carré

Description

Cette fonction calcule le rapport de corrélation eta carré qui est une mesure d’association importante entre une variable quantitative et une variable qualitative.

param categories:

type categories:

un facteur associé à la variable qualitative

param value:

type value:

un vecteur associé à la variable quantitatives

param digits:

type digits:

int, default=3. Number of decimal printed

returns:
  • a dictionary of numeric elements

  • Sum. Intra (la somme des carrés intra)

  • Sum. Inter (La somme des carrés inter)

  • Correlation ratio (La valeur du rapport de corrélation empirique)

  • F-stats (La statistique de test F de Fisher)

  • pvalue (la probabilité critique)

References

  1. Bertrand, M. Maumy-Bertrand, Initiation à la Statistique avec R, Dunod, 4ème édition, 2023.

Author(s)

Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

see also https://stackoverflow.com/questions/52083501/how-to-compute-correlation-ratio-or-eta-in-python

discrimintools.fviz_candisc module

discrimintools.fviz_candisc.fviz_candisc(self, axis=[0, 1], x_label=None, y_label=None, x_lim=None, y_lim=None, title=None, geom=['point', 'text'], point_size=1.5, text_size=8, text_type='text', add_grid=True, add_hline=True, add_vline=True, repel=False, hline_color='black', hline_style='dashed', vline_color='black', vline_style='dashed', ha='center', va='center', ggtheme=<plotnine.themes.theme_minimal.theme_minimal object>) <module 'plotnine' from 'C:\\Users\\duver\\AppData\\Roaming\\Python\\Python310\\site-packages\\plotnine\\__init__.py'>[source]

Draw the Canonical Discriminant Analysis (CANDISC) individuals graphs

Author:

Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.fviz_disca module

discrimintools.fviz_disca.fviz_disca_ind(self, axis=[0, 1], x_lim=None, y_lim=None, x_label=None, y_label=None, title=None, geom=['point', 'text'], repel=True, point_size=1.5, text_size=8, text_type='text', add_grid=True, add_hline=True, add_vline=True, ha='center', va='center', hline_color='black', hline_style='dashed', vline_color='black', vline_style='dashed', add_group=True, center_marker_size=5, ggtheme=<plotnine.themes.theme_minimal.theme_minimal object>)[source]

Draw the Discriminant Correspondence Analysis (CANDISC) individuals graphs

Author:

Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.fviz_disca.fviz_disca_mod(self, axis=[0, 1], x_lim=None, y_lim=None, x_label=None, y_label=None, title=None, color='black', geom=['point', 'text'], text_type='text', marker='o', point_size=1.5, text_size=8, add_grid=True, add_group=True, color_sup='blue', marker_sup='^', add_hline=True, add_vline=True, ha='center', va='center', hline_color='black', hline_style='dashed', vline_color='black', vline_style='dashed', repel=False, ggtheme=<plotnine.themes.theme_minimal.theme_minimal object>) <module 'plotnine' from 'C:\\Users\\duver\\AppData\\Roaming\\Python\\Python310\\site-packages\\plotnine\\__init__.py'>[source]

Visualize Discriminant Correspondence Analysis - Graph of variables/categories

Description

param self:

type self:

an object of class DISCA

param axis:

type axis:

a numeric list or vector of length 2 specifying the dimensions to be plotted, default = [0,1]

returns:
  • a plotnine graph

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_candisc module

discrimintools.get_candisc.get_candisc(self, choice='ind')[source]

Extract the results - CANDISC

param self:

type self:

an object of class CANDISC

param choice:

returns:
  • a dictionary or a pandas dataframe

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_candisc.get_candisc_coef(self, choice='absolute')[source]

Extract coefficients - CANDISC

param self:

type self:

an object of class CANDISC

param choice:

type choice:

the element to subset from the output. Allowed values are “absolute” (for canonical coefficients) or “score” (for class coefficients)

returns:
  • a pandas dataframe containing coefficients

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_candisc.get_candisc_ind(self)[source]

Extract the results for individuals - CANDISC

param self:

type self:

an object of class CANDISC

returns:
  • a dictionary of dataframes containing all the results for the active individuals including

  • - coord (coordinates for the individuals)

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_candisc.get_candisc_var(self, choice='correlation')[source]

Extract the results for variables - CANDISC

param self:

type self:

an object of class CANDISC

param choice:

type choice:

the element to subset from the output. Allowed values are “correlation” (for canonical correlation) or “covariance” (for covariance).

returns:
  • a dictionary of dataframes containings all the results for the variables

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_candisc.summaryCANDISC(self, digits=3, nb_element=10, ncp=3, to_markdown=False, tablefmt='pipe', **kwargs)[source]

Printing summaries of Canonical Discriminant Analysis model

param self:

type self:

an object of class CANDISC

param digits:

type digits:

int, default=3. Number of decimal printed

param nb_element:

type nb_element:

int, default = 10. Number of element

param ncp:

type ncp:

int, default = 3. Number of componennts

param to_markdown:

type to_markdown:

Print DataFrame in Markdown-friendly format.

param tablefmt:

type tablefmt:

Table format. For more about tablefmt, see : https://pypi.org/project/tabulate/

param **kwargs:

type **kwargs:

These parameters will be passed to tabulate.

param Author(s):

param ———:

param Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com:

discrimintools.get_disca module

discrimintools.get_disca.get_disca(self, choice='ind')[source]

Extract the results - DISCA

param self:

type self:

an object of class DISCA

param choice:

returns:
  • a dictionary or a pandas dataframe

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_disca.get_disca_classes(self)[source]

Extract the results for groups - DISCA

param self:

type self:

an object of class DISCA

returns:
  • a dictionary of dataframes containing all the results for the groups including

  • - coord (coordinates for the individuals)

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_disca.get_disca_coef(self)[source]

Extract coefficients - DISCA

param self:

type self:

an object of class DISCA

returns:
  • a pandas dataframe containing coefficients

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_disca.get_disca_ind(self)[source]

Extract the results for individuals - DISCA

param self:

type self:

an object of class DISCA

returns:
  • a dictionary of dataframes containing all the results for the active individuals including

  • - coord (coordinates for the individuals)

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_disca.get_disca_var(self)[source]

Extract the results for variables/categories - DISCA

param self:

type self:

an object of class DISCA

returns:
  • a dictionary of dataframes containing all the results for the active variables including

  • - coord (coordinates for the variables/categories)

  • - contrib (contributions for the variables/categories)

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_disca.summaryDISCA(self, digits=3, nb_element=10, ncp=3, to_markdown=False, tablefmt='pipe', **kwargs)[source]

Printing summaries of Discriminant Correspondence Analysis model

param self:

type self:

an object of class DISCA

param digits:

type digits:

int, default=3. Number of decimal printed

param nb_element:

type nb_element:

int, default = 10. Number of element

param ncp:

type ncp:

int, default = 3. Number of componennts

param to_markdown:

type to_markdown:

Print DataFrame in Markdown-friendly format.

param tablefmt:

type tablefmt:

Table format. For more about tablefmt, see : https://pypi.org/project/tabulate/

param **kwargs:

type **kwargs:

These parameters will be passed to tabulate.

param Author(s):

param ———:

param Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com:

discrimintools.get_lda module

discrimintools.get_lda.get_lda(self, choice='ind')[source]

Extract the results - LDA

param self:

type self:

an object of class LDA

param choice:

returns:
  • a dictionary or a pandas dataframe

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_lda.get_lda_coef(self)[source]

Extract coefficients - LDA

param self:

type self:

an object of class LDA

returns:
  • a pandas dataframe containing coefficients

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_lda.get_lda_cov(self)[source]

Extract the results for variables - LDA

param self:

type self:

an object of class LDA

returns:
  • a dictionary of dataframes containings all the results for the variables

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_lda.get_lda_ind(self)[source]

Extract the results for individuals - LDA

param self:

type self:

an object of class LDA

returns:
  • a dictionary of dataframes containing all the results for the active individuals including

  • - scores (scores for the individuals)

  • - generalied_dist2 (generalized distance)

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

discrimintools.get_lda.summaryLDA(self, digits=3, nb_element=10, to_markdown=False, tablefmt='pipe', **kwargs)[source]

Printing summaries of Linear Discriminant Analysis model

param self:

type self:

an object of class LDA

param digits:

type digits:

int, default=3. Number of decimal printed

param nb_element:

type nb_element:

int, default = 10. Number of element

param to_markdown:

type to_markdown:

Print DataFrame in Markdown-friendly format.

param tablefmt:

type tablefmt:

Table format. For more about tablefmt, see : https://pypi.org/project/tabulate/

param **kwargs:

type **kwargs:

These parameters will be passed to tabulate.

param Author(s):

param ———:

param Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com:

discrimintools.lda module

class discrimintools.lda.LDA(target=None, features=None, priors=None)[source]

Bases: BaseEstimator, TransformerMixin

Linear Discriminant Analysis (LDA)

Description

This class inherits from sklearn BaseEstimator and TransformerMixin class

Develops a discriminant criterion to classify each observation into groups

Parameters:

target : The values of the classification variable define the groups for analysis.

features : list of quantitative variables to be included in the analysis. The default is all numeric variables in dataset

priorsThe priors statement specifies the prior probabilities of group membership.
  • “equal” to set the prior probabilities equal,

  • “proportional” or “prop” to set the prior probabilities proportional to the sample sizes

  • a pandas series which specify the prior probability for each level of the classification variable.

returns:
  • call_ (a dictionary with some statistics)

  • coef_ (DataFrame of shape (n_features,n_classes_))

  • intercept_ (DataFrame of shape (1, n_classes))

  • summary_information_ (summary information about the variables in the analysis. This information includes the number of observations,) – the number of quantitative variables in the analysis, and the number of classes in the classification variable. The frequency of each class is also displayed.

  • ind_ (a dictionary of pandas dataframe containing all the results for the active individuals (coordinates))

  • statistics_ (statistics)

  • classes_ (classes informations)

  • cov_ (covariances)

  • model_ (string. The model fitted = ‘lda’)

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

References

SAS Documentation, https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_discrim_overview.htm Ricco Rakotomalala, Pratique de l’analyse discriminante linéaire, Version 1.0, 2020

decision_function(X)[source]

Apply decision function to an array of samples

The decision function is equal (up to a constant factor) to the log-posterior of the model, i.e. log p(y = k | x). In a binary classification setting this instead corresponds to the difference log p(y = 1 | x) - log p(y = 0 | x).

param X:

DataFrame of samples (test vectors).

type X:

DataFrame of shape (n_samples_, n_features)

returns:

C – Decision function values related to each class, per sample. In the two-class case, the shape is (n_samples_,), giving the log likelihood ratio of the positive class.

rtype:

DataFrame of shape (n_samples_,) or (n_samples_, n_classes)

fit(X, y=None)[source]

Fit the Linear Discriminant Analysis model

param X:

Training Data

type X:

pandas/polars DataFrame,

param Returns:

param ——–:

param self:

Fitted estimator

type self:

object

fit_transform(X)[source]

Fit to data, then transform it

Fits transformer to x and returns a transformed version of X.

Parameters:

XDataFrame of shape (n_samples_, n_features_+1)

Input samples

returns:

X_new – Transformed data.

rtype:

DataFrame of shape (n_rows, n_classes_)

pred_table()[source]

Prediction table

Notes

pred_table[i,j] refers to the number of times “i” was observed and the model predicted “j”. Correct predictions are along the diagonal.

predict(X)[source]

Predict class labels for samples in X

param X:

The data matrix for which we want to get the predictions.

type X:

DataFrame of shape (n_samples_, n_features_)

param Returns:

param ——–:

param y_pred:

Vectors containing the class labels for each sample

type y_pred:

ndarray of shape (n_samples)

predict_proba(X)[source]

Estimate probability

param X:

Input data.

type X:

DataFrame of shape (n_samples_,n_features_)

param Returns:

param ——–:

param C:

Estimated probabilities.

type C:

DataFrame of shape (n_samples_,n_classes_)

score(X, y, sample_weight=None)[source]

Return the mean accuracy on the given test data and labels

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

param X:

Test samples.

type X:

DataFrame of shape (n_samples_, n_features)

param y:

True labels for X.

type y:

array-like of shape (n_samples,) or (n_samples, n_outputs)

param sample_weight:

Sample weights.

type sample_weight:

array-like of shape (n_samples,), default=None

returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

rtype:

float

transform(X)[source]

Project data to maximize class separation

param X:

Input data

type X:

DataFrame of shape (n_samples_, n_features_)

param Returns:

param ——–:

param X_new:

Transformed data.

type X_new:

DataFrame of shape (n_samples_, n_classes_)

discrimintools.pcada module

class discrimintools.pcada.PCADA(n_components=None, target=None, features=None, priors=None, parallelize=False)[source]

Bases: BaseEstimator, TransformerMixin

Principal Components Analysis - Discriminant Analysis (PCADA)

Description

This class inherits from sklearn BaseEstimator and TransformerMixin class

Performs principal components analysis - discriminant analysis

Parameters:

n_components : number of dimensions kept in the results

target : The values of the classification variable define the groups for analysis.

features : list of quantitative variables to be included in the analysis. The default is all numeric variables in dataset

priorsThe priors statement specifies the prior probabilities of group membership.
  • “equal” to set the prior probabilities equal,

  • “proportional” or “prop” to set the prior probabilities proportional to the sample sizes

  • a pandas series which specify the prior probability for each level of the classification variable.

parallelizeboolean, default = False
If model should be parallelize
  • If True : parallelize using mapply

  • If False : parallelize using apply

Returns:

coef_ : DataFrame of shape (n_features,n_classes_)

intercept_ : DataFrame of shape (1, n_classes)

lda_model_ : linear discriminant analysis model

factor_model_ : principal components analysis model

projection_function_ : projection function

coef_ : pandas dataframe of shpz (n_categories, n_classes)

intercept_ : pandas dataframe of shape (1, n_classes)

model_ : string. The model fitted = ‘disqual’

Author(s)

Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

References:

Ricco Rakotomalala, Pratique de l’analyse discriminante linéaire, Version 1.0, 2020

fit(X, y=None)[source]

Fit the Linear Discriminant Analysis with categories variables model

Parameters:

Xpandas/polars DataFrame of shape (n_samples, n_features+1)

Training data

y : None

Returns:

selfobject

Fitted estimator

fit_transform(X)[source]

Fit to data, then transform it

Fits transformer to X and returns a transformed version of X.

param X:

Input samples.

type X:

DataFrame of shape (n_samples, n_features+1)

returns:

X_new – Transformed array.

rtype:

DataFrame of shape (n_samples, n_features_new)

pred_table()[source]

Prediction table

Notes

pred_table[i,j] refers to the number of times “i” was observed and the model predicted “j”. Correct predictions are along the diagonal.

predict(X)[source]

Predict class labels for samples in X

Parameters:

XDataFrame of shape (n_samples, n_features)

The dataframe for which we want to get the predictions

Returns:

y_predDtaFrame of shape (n_samples, 1)

DataFrame containing the class labels for each sample.

predict_proba(X)[source]

Estimate probability

param X:

Input data

type X:

DataFrame of shape (n_samples, n_features)

param Returns:

param ——-:

param C:

Estimate probabilities

type C:

DataFrame of shape (n_samples, n_classes)

score(X, y, sample_weight=None)[source]

Return the mean accuracy on the given test data and labels

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

param X:

Test samples.

type X:

array-like of shape (n_samples, n_features)

param y:

True labels for X.

type y:

array-like of shape (n_samples,) or (n_samples, n_outputs)

param sample_weight:

Sample weights.

type sample_weight:

array-like of shape (n_samples,), default=None

returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

rtype:

float

transform(X)[source]

Project data to maximize class separation

Parameters:

XDataFrame of shape (n_samples, n_features)

Input data

Returns:

X_new : DataFrame of shape (n_samples, n_components_)

discrimintools.revaluate_cat_variable module

discrimintools.revaluate_cat_variable.revaluate_cat_variable(X)[source]

Revaluate Categorical variable

param X:

type X:

pandas DataFrame of shape (n_rows, n_columns)

returns:

X

rtype:

pandas DataFrame of shape (n_rows, n_columns)

discrimintools.stepdisc module

class discrimintools.stepdisc.STEPDISC(model=None, method='forward', alpha=0.01, lambda_init=None, model_train=False, verbose=True)[source]

Bases: BaseEstimator, TransformerMixin

Stepwise Discriminant Analysis (STEPDISC)

Description

This class inherits from sklearn BaseEstimator and TransformerMixin class

Performs a stepwise discriminant analysis to select a subset of the quantitative variables for use in discriminating among the classes. It can be used for forward selection, backward elimination.

param model:

type model:

an object of class LDA, CANDISC

param method:
  • “forward” for forward selection,

  • “backward” for backward elimination

type method:

the feature selection method to be used :

param alpha:

type alpha:

Specifies the significance level for adding or retaining variables in stepwise variable selection, default = 0.01

param lambda_init:

type lambda_init:

Initial Wilks Lambda/ Default = None

param model_train:

type model_train:

if model should be train with selected variables

param verbose:
  • if True, print intermediary steps during feature selection (default)

  • if False

type verbose:

boolean,

returns:
  • call_ (a dictionary with some statistics)

  • results_ (a dictionary with stepwise results)

  • model_ (string. The model fitted = ‘stepdisc’)

  • Author(s)

  • ———

  • Duvérier DJIFACK ZEBAZE duverierdjifack@gmail.com

References

SAS documentation, https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_stepdisc_overview.htm Ricco Rakotomalala, Pratique de l’analyse discriminante linéaire, Version 1.0, 2020

discrimintools.text_label module

discrimintools.text_label.text_label(texttype, **kwargs)[source]

Function to choose between geom_text and geom_label

param text_type:

type text_type:

{“text”, “label”}, default = “text”

param **kwargs:

type **kwargs:

geom parameters

Module contents