pydiodon module

what it is:

a python library for linear dimension reduction.

Content:

The library is organised into several functional blocks, as follows:

block

functions

what it does

Core

mat_evd() svd_grp()

eigen-values -vectors of a matrix SVD with random projection

Methods

coa() mds() pca() pca_core() pca_met() pca_iv()

Correspondence Analysis Multidimensional Scaling Principal Component Analysis core method for PCA PCA with metrics PCA with latent variables

Pretreatments

bicenter() center_col() center_row() dis3gram() scale()

bicentering columnwise centering rowise centering computes the Gram matrix from distances scales columnwise

Plotting

Interpretation

qual_proj()

quality of a projection

I/O

loadfile()

generic function for loading datasets

Identity card

maintainer

Alain Franc

mail

alain.franc@inrae.fr

contributors
  • Olivier Coulaud

  • Alain Franc

  • Jean-Marc Frigerio

  • Romain Peressoni

  • Florent Pruvost

started

21/02/17

version

22.10.28

release

0.0.1

licence

GPL-3

This file is part of diodon project. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

pydiodon.bicentering(A)[source]

bicentering a matrix

Parameters
Aa n x p numpy array

the matrix to be bicentered

Returns
ma float

global mean of the matrix

ra 1D numpy array (n values)

rowise means

ca 1D numpy arrau (p values)

columnwise means

Ra 2D numpy array (n x p)

bicentered matrix

Notes

If A is a matrix, builds a matrix R of same dimension centered on rows and columns.

\(m = (1/np) \sum_{i,j} a_{ij}\)

\(r = (r_1,...,r_n)\) with \(r_i = (1/p)\sum_j a_{ij} - m\)

\(c = (c_1,...,c_p)\) with \(c_j = (1/n)\sum_i a_{ij} - m\)

such that

\(\sum_ir_i = \sum_jc_j=0\)

and

for any row i, \(\sum_j R_{ij}=0\) and for any column j, \(\sum_i R_{ij}=0\)

The matrix R can describe interactions in a additive model with two categorical variables, as in

\(a_{ij} = m + r_i + c_j + R_{ij}\)

example

>>> import pydiodon as dio
>>> import numpy as np
>>> n = 4 ; p = 3
>>> A = np.arange(n*p)
>>> A.shape=(n,p)
>>> m, r, c, R = dio.bicenter(A)

af, revised 21.02.19, 22.09.21, 22.10.13

pydiodon.center_col(A)[source]

centers a matrix columnwise

Parameters
Aa 2D numpy array n x p

the matrix to be centered

Returns
Aca 2D numpy array

n x p, centered columnwise

m1D numpy array, p values

means per column

example
af, rp, 21.02.19, 22.10.13
pydiodon.center_row(A)[source]

centers a matrix row-wise

Parameters
Aa 2D numpy array, n x p

the matrix to be centered

Returns
Aca 2D n x p numpy array

A centered row-wise

ma 1D numpy array (n values)

means per row

Example
af & rp, 21.02.19, 22.10.13
pydiodon.centering_operator(m)[source]

Centering operator for a matrix

will probably be deprecated (22.09.21)

argument

m : an integer ; vector or matrix dimension

returns

H: a numpy array, m x m

Notes

If A is a n x p matrix

  • \(HA\) is A with centered columns (H is n x n)

  • \(AH\) is A with centered row (H is p x p)

example

>>> import pydiodon as dio
>>> import numpy as np
>>> n = 4 ; p = 3
>>> A = np.arange(n*p)
>>> A.shape = (n,p)
>>> H_row = dio.centering_operator(p)
>>> H_col = diod.centering_operator(n)
>>> Ar = A @ H_row
>>> Ac = H_col @ A

revised 21.02.27

pydiodon.coa(X, k=- 1, meth='svd', transpose=False)[source]

what it does

Correspondance Analysis of an array X

Parameters
Xa 2D numpy array, n x p

array to be analysed

kan integer

number of axis to compute

metha string

method for numerical computing

Returns
La 1D numpy array

vector of eigenvalues

Y_ra 2D numpy array,`n x k`

coordinates of row points

Y_ca 2D numpy array, p x k

coordinates of column points

Notes

If \(k=-1\), all axis are computed. If \(k > 0\), only k first axis and components are computed.

methods for PCA core

evd

EVD

svd

SVD with numpy.linalg.svd()

grp

SVD with gaussian random projection

Example

>>> import pydiodon as dio
>>> import numpy as np
>>> dataset = "../data4tests/coa4tests.txt"
>>> A, headers, rownames = dio.load(dataset)
>>> L, Y_r, Y_c = dio.coa(A)

note on the example

The dataset is in text format with tabs as delimiters. It contains headers and row names. These are default parameters for fucntion dio.load()

Reference:

Nenadic & Greenacre, Journal of Statistical Software, 20(3): 2-13, 2007

Lebart, Morineau & Fénelon, 1982, pp. 305-320

revised 21.02.21

pydiodon.dis2gram(dis, no_loop=True)[source]

Computes a Gram matrix gram knowing a distance matrix dis

Parameters
dis: a 2D numpy array, size `n` x `n`, of floats

is a distance or dissimilarity matrix to be analyzed

Returns
grama 2D numpy array, size n x n, of floats

the associated Gram matrix

Notes

if \(d_{i.}^2 = (1/n) \sum_j d_{ij}^2\) and \(d_{..}^2 = (1/n^2) \sum_{i,j} d_{ij}^2\)

then

\(gram[i,j] = -(1/2) (d_{ij}^2 - d_{i.}^2 - d_{.j}^2 + d_{..}^2)\)

af, 07/11/2017, 22.10.13

pydiodon.hovering(F1, F2, label, prefix=None, c='b', s=20)[source]

what it does

display a label of one dot selected with the mouse in a scatter plot

Parameters
F11D numy array

first axis

F21D numpy array

second axis

prefixforgotten
cstring

forgotten

sinteger

forgotten

Notes

A scatter plot with F1 on axis 1 and F2 on axis 2 is displayed. When the mouse clicks on a point, a comment is displayed on the console with the index of the item cliked, and the value of the label for this item

This is event based, i.e. cliking with the mouse can be repeated as many times as wished.

Adapted from http://matplotlib.sourceforge.net/examples/event_handling/pick_event_demo.html

af, 25/10/2017

pydiodon.load_ascii(filename, delimiter, colnames, rownames, dtype)[source]

Loads an ascii file as a numpy array

Parameters
filenamestring

the name of the file to load

delimitercharacter

delimiters between values in a row

colnamesboolean

True if there are column names in the first row of the file False otherwise

rownamesboolean

True if there are rownames in the first column of the file False otherwise

dtypestring

as dtype in numpy.loadtxt (type of values in the numpy array)

Returns
Aa 2D numpy array

the values in the file

rnlist of strings

row names (if rownames==True)

cnlist of strings

column names (if colnames==True)

pydiodon.load_hdf5(filename, datasetname, colnames, rownames)[source]

loads a hdf5 file as a numpy array

Parameters
filenamestring

the name of the file

datasetnamestring

the name of the hdf5 dataset with data values

Returns
see loadfile
pydiodon.loadfile(filename, fmt=None, delimiter='\t', rownames=False, colnames=False, datasetname='values', dtype='float32')[source]

A generic function for loading datasets as numpy arrays

Parameters
filenamestring

contains the data set to be loaded (compulsory)

fmtstring

explicit format of the file (optional); if it is not given the format will be guessed from the suffix (see notes)

delimitercharacter

the delimiter between values in a row

colnamesboolean or string; whether column names
  • for an ascii file, it is boolean, as True if column names are as first row in the file, and False otherwise;

  • for an hdf5 file, gives the name of the dataset with the column names

  • optional, default value is False

rownamesboolean or string; whether row names
  • for an ascii file, it is boolean, as True if row names are as first column in the file, and False otherwise

  • for an hdf5 file, gives the name of the dataset with the row names

  • optional, default value is None.

datasetnamestring
  • for hdf5 files : hdf5 dataset for values

  • optional, default value is “value”

Returns
Aa numpy array

the values of the data set

rnlist of strings

row names (optional)

cnlist of strings

column names (optional)

Notes

  • Recognized formats are: ascii, hdf5 and compressed ascii.

  • Delimiters in ascii format can be blanks, comma, semi-columns, tabulations

  • Ascii data sets with tab delimiters are expected to be with suffix .txt or .tsv.

  • Ascii data sets with other delimiters are expected to be with siffix .csv.

When the filename is read, the function splits the name on the last dot, and interprets the string after as the suffix. Then, there is a call

  • to load_ascii() if the suffix is .txt, tsv, .gz or .bz2,

  • to load_hdf5 if the suffix is h5 or hdf5,

  • and unzips the file before a call to load_ascii() if the sufix is zip.

Examples

Here is a call for loading an ascii file with extension .txt hence with tab as delimiters, and with rownames and colnames. In such a case, the call must specifiy that there are colnames and rownames to be read:

>>> import pydiodon as dio
>>> filename = "pca_template_withnames.txt"
>>> A, colnames, rownames = dio.loadfile(filename, colnames=True, rownames=True)
>>> print(A)
>>> print(colnames)
>>> print(rownames)

If it is not specified (default values), an array without colnames and rownames will be loaded, as in

>>> import pydiodon as dio
>>> filename = "pca_template_nonames.txt"
>>> A = dio.loadfile(filename)

Here is a call of a .csv file where delimiters have to be specified:

>>> import pydiodon as dio
>>> filename = "pca_template_nonames.csv"
>>> A = dio.loadfile(filename, delimiter = ";") 

Here is how to load a zip file from a .txt file:

>>> import pydiodon as dio
>>> filename = "pca_template_nonames.txt.zip"
>>> A = dio.loadfile(filename)
>>> print(A)

and from a .csv file with semi-column as delimiter

>>> import pydiodon as dio
>>> filename = "pca_template_nonames.csv.zip"
>>> A = dio.loadfile(filename, delimiter=";")

Here is an example for loading a hdf5 file with values, colnames and rownames as datasets:

>>> import pydiodon as dio
>>> filename = "pca_template_withnames.h5"
>>> A, colnames, rownames = dio.loadfile(filename, colnames='colnames', rownames='rownames') 

version 21.03.23

pydiodon.mat_evd(mat, k=- 1)[source]

Computes the eigenvalues and eigenvectors of a matrix

Parameters
mata numpy array, n x n (a matrix)
kan integer

the number of axis to be computed

Returns
Va 2D numpy array

of size n x k: the k eigenvectors

La 1D numpy array

a vector ; the first k eigenvalues

Notes

if k=-1, all eigenvalues and eigenvectors are returned.

This simply uses numpy.linalg.eigh(), and adds some features like

  • sorts eigenvalues in decreasing order

  • sorts eigenvectors in corresponding order

  • sets to zero the imaginary parts (numerical approximation)

Example

>>> import pydiodon as dio
>>> import numpy as np
>>> # building a symmetric matrix
>>> n = 5
>>> A = np.random.random((n,n))
>>> A = (A + A.T)/2
>>> # computing eigenvalues and eigenvectors
>>> V, L = dio.mat_evd(A)

version 21.03.21, 22.09.21, 22.10.13, 22.10.25

pydiodon.mds(dis, k=- 1, meth='svd', no_loop=True)[source]

what it does

Multidimensional Scaling of a distance or dissimilarity array

Parameters
disnumpy array, n x n

distances or dissimilarities

kinteger

number of axis

methstring

method for MDS (see notes)

no_loopboolean

see notes

Returns
X2D numpy array,

n x k, coordinates of points

S1D numpy array

k values, singular values |

Notes

Objective: builds a point cloud where one point represents an item, such that the discrepancies beween distances in a matrix dis and between associated points in \(R^k\) is as small as possible. It is classical MDS, not Least Square Scaling.

Procedure: There are three steps
1. computes the Gram matrix G between items from distance matrix dis
2. runs a SVD or EVD of G
3. computes the coordinates from the SVD (or EVD) in X
argument no_loop : it is used in the computation of the Gram matrix:
- if n is large, no_loop=False is recommended
- if not, no_loop=True is recommended (default value)
methods for MDS: the argument meth specifies which method is selected for the core of MDS. Default value is svd. Let G be the Gram matrix associated to the distance array.
- if meth=svd, the a SVD of G is run
- if meth=grp, the SVD is run with Gaussian Random Projection
- if meth=evd, the eigenvalues and eigenvectors of G are computed
Here are some suggestions for selection of the method:
- if n is not too large (\(n < 10,000\)), svd method is recommended
- if n is large, grp method is recommended, and compulsory if n is very large
- if k is small, \(k \simeq 10\), evd method is recommended.
computations : EVD or SVD of the Gram matrix are two equivalent ways to compute a solution. Let G be the Gram matrix. They are linked by
- EVD: \(GU = US\) ; U is the matrix of columnwise eigenvectors of G and S the diagonal matrix of its eigenvalues
- SVD: \(G = USU'\) is the SVD of G (which is symmetric, U’ is the transpose of U)
Then, in both cases, if X is the matrix of coordinates of n items in \(R^r\), \(X = US^{1/2}\)

Examples

This is a first toy example for running a MDS.

First, building a Euclidean distance matrix.

>>> import pydiodon as dio
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import scipy.spatial.distance as ssd # for computing distances
>>> # creates a m x n point cloud
>>> m = 10 ; n = 4
>>> A = np.random.randn(m,n)
>>> # computes the pairwise distances
>>> D = ssd.pdist(A)
>>> D = ssd.squareform(D)       # distance matrix in square form

Second, run the MDS

>>> # runs the MDS
>>> X, S = dio.mds(D)

Third, interpret the quality and displays the result

>>> # plotting the quality
>>> plt.plot(S, color="red")
>>> plt.title("Eigenvalues")
>>> plt.show()
>>> # plotting the point cloud
>>> F1 = X[:,0] ; F2 = X[:,1]
>>> plt.scatter(F1, F2, c="chartreuse", s=20)
>>> plt.xlabel("axis 1")
>>> plt.ylabel("axis 2")
>>> plt.show()

This example is in mds_xmpl_1.py

Here is now a real life example.

>>> import pydiodon as dio
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> #
>>> rank = 200
>>> filename = "guiana_trees.dissw"
>>> Dis, rn, cn = dio.load(filename, rownames=True, colnames=True)
>>> X, S = dio.mds(Dis, k=rank)

This is focused on running the MDS of Dis . Here is a complementary example for visualizing the results.

>>> # plotting the singular values
>>> plt.plot(S, color="red")
>>> plt.title("Quality along the axis")
>>> plt.show()
>>> # plotting the point cloud, axis 1 and 2
>>> F1 = X[:,0]; F2 = X[:,1]; F3 = X[:,2]
>>> plt.scatter(F1,F2,c="blue", s=5)
>>> plt.xlabel("axis 1")
>>> plt.ylabel("axis 2")
>>> plt.title("data set: Guiana trees")
>>> plt.show()

This example is in mds_xmpl_guiana_trees.py

References

[1] T.F. Cox and M. A. A. Cox. Multidimensional Scaling - Second edition, volume 88 of Monographs on Statistics and Applied Probability. Chapman & al., 2001.
[2] I. Borg and P. J. F. Groenen. Modern Multidimensional Scaling. Springer Series in Statistics. Springer, second edition, 2005.
[3] K. V. Mardia, J.T. Kent, and J. M. Bibby. Multivariate Analysis. Probability and Mathematical Statistics. Academic Press, 1979.

Diodon notes: section 10

07/11/2017 - revised 21.03.20, 22.10.13

pydiodon.pca(A, pretreatment='standard', k=- 1, meth='svd')[source]

Principal Component Analysis

Parameters
Aa numpy array

the array to be analyzed

kinteger

number of axes to be computed

methstring

method for numerical calculation (see notes)

pretreatmentstring

which pretreatment to apply ;

accepted values are: standard, bicentering, col_centering, row_centering, scaling

see notes for details

Notes

The method runs as follows:

  • first it implements the required pretreatments

  • second: it runs the function pca_core on the transformed matrix

  • third: it returns the eigenvalues, the principal axis, the principal components and the correlation matrix if required

methods for PCA: the argument meth specifies which method is selected for the core of MDS. Default value is svd. Let A be the the matrix to analyse.

  • if meth=svd, the a SVD of A is run

  • if meth=grp, the SVD is run with Gaussian Random Projection

  • if meth=evd, the eigenvalues and eigenvectors of A are computed

pretreatments: here are the accepted pretreatments:

  • standard: the matrix is centered and scaled columnwise

  • bicentering: Matrix is centered rwowise and columnwise; it is a useful alternative to CoA known as “double averaging”

Examples

This is an example of a standard PCA of a random matrix, with \(m\) rows and \(n\) columns, with elements as realisation of a uniform law between 0 and 1.

First build the random matrix

>>> import pydiodon as dio
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> m = 200 ; n = 50
>>> A = np.random.random((m,n))

Second, run the PCA

>>> L, V, Y = dio.pca(A)

Third, plots some results (eigenvectors and point cloud)

>>> plt.plot(L) ; plt.show()
>>> plt.scatter(Y[:,0],Y[:,1]) ; plt.show()

The above program runs centered scaled PCA, with here default option pretreatment="standard". For PCA without centering nor scaling, the command is

>>> L, V, Y, C = dio.pca(A, standard=False)

For PCA with column centering but without scaling, the command is

>>> L, V, Y, C = dio.pca(A, standard=False, col_centering=True) 

(in such a case, the argument standard must be set to False. If not, the array will be scaled as well). Scaling without centering is quite unusual.

For bicentering, the command is

firefox
>>> L, V, Y, C = dio.pca(A, bicenter=True)      

These are the most usuful options for pretreatment.

Prescribed rank is simply called by (with standard pretreatment)

>>> rank = 10
>>> L, V, Y = dio.pca(A, k=rank)

for having the 10 first components and axis only.

revised 21.03.03 - 21.04.20

pydiodon.pca_core(A, k=- 1, meth='svd')[source]

what it does

core method for PCA (Principal Component Analysis) of an array A

Parameters
Aa n x p numpy array,

array to be analysed

kan integer

number of axis to compute

metha string

method for numerical computing

Returns
Ya 2D numpy array, n x k

matrix of principal components

La 1D numpy array

vector of eigenvalues

Va 2D numpy array, p x k

matrix of eigenvectors (new basis)

Notes

A is an array, and pca_core computes the PCA of A, without any centering nor scaling nor weights nor constraints. PCA with centering, scaling, weights or constraints is called by function pca() which in turns calls this function pca_core().

if k = -1, all axis are computed. If k > 0, only k first axis and components are computed.

if meth is

  • evd, runs by eigendecomposition of A’A

  • svd, runs by singular value decomposition of A

  • grp, runs by SVD of A with Gaussian random projection

Default value is svd.

With EVD, it runs as follows:
1. Computes the correlation matrix \(C=A'A\)
2. Computes the eigevalues and eigevectors of \(C\): \(Cv_j = \lambda_j v_j\)
3. Computes the principal components as \(y_j=Av_j\), or, globally, \(Y=AV\)
with SVD, it runs as follows:
1. \(U,\Sigma,V = SVD(A)\)
2. \(Y = U\Sigma\)
3. \(L=\Sigma^2\)

Example

This is an example of a standard PCA of a random matrix, with \(m\) rows and \(n\) columns, with elements as realisation of a Gaussian law with mean 0 and standrd deviation 1. As such, there is no need to center nor to scale, and the use of pca_core() is relevant.

>>> import pydiodon as dio
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> m = 200 ; n = 50
>>> A = np.random.randn(m,n)
>>> L, V, Y = dio.pca_core(A)
>>> plt.plot(L) ; plt.show()
>>> plt.scatter(Y[:,0],Y[:,1]) ; plt.show()

af, revised 21.03.01; 21.06.27; 22.09.23, 22.10.13

pydiodon.pca_evd(A)[source]

runs the PCA with eigendecomposition of A’A

Notes

No documentation (used behind pca_core) 21/06/27

pydiodon.pca_grp(A, k)[source]

runs the PCA with SVD with gaussian random projection

Notes

No documentation (used behind pca_core) 21.06.27

pydiodon.pca_iv(A, B, k=- 1, meth='svd', pretreatment='col_centering', transpose=False)[source]

what it does

PCA of an array A with instrumental variables

Parameters
Aa numpy array, the items/variable to be analyzed
Ba numpy array, the intrumental variables
kan integer
metha string ; method for numerical calculation
pretreatmenta string, the pretreatment of A and B

currently, only `col_centering`is implemented

Returns
Ya n x k 2D numpy array

the components

La k 1D numpy array

the eigenvalues

Va p x k 2D numpy array

the new basis

A_proja n x p 2D numpy array

the projection of A on the subspace of \(R^n\) spanned by the columns of B

Notes

A_pre is a copy of A for tracking preatreatments, and recover the matrix analyzed after preatreatments, without impacting A

The algorithm is as follows:

  • Build \(P = B(B'B)^{-1}B'\) which is the projector in \(R^n\) on the subspace spanned by the columns of B.

  • build A_proj = PA which is the projection of A on the subspace spanned by the columns of B

  • do PCA of A_proj : Y,L,V = PCA(A_proj)

warning

This has not been tested; no guarantee on the quality of the result.

21/02/2018, révised 22.09.28, 22.10.14

pydiodon.pca_iv_qual_proj(A, A_proj)[source]

what it does

Computes the quality of the projection of A on the space spanned by the columns of B

Parameters
Aa n x p 2D numpy array

the matrix to be analyzed by PVA_iv

A_proja n x p 2D numpy array

the projection of A on the subspace of \(R^n\) spanned by the columns of B

Returns
rhoa real

the quality of the projection

Notes

rho is the ratio of the square norm of A_proj upon the square norm of A

22.09.28

pydiodon.pca_iv_qual_proj_axis(A, A_proj, title=None, x11=True, imfile=None, fmt='png')[source]

what it does Computes and plots the quality of projection of each column of A on Ap

Notes

Documentation ongoing

af, @ Cayenne, 21/02/2018

pydiodon.pca_iv_regressors(A, Q, V, varnames, n_axis=3, x11=True, imfile=None, fmt='png')[source]

what it does

will probably be deprecated

Notes

Documentation ongoing

22.10.26

pydiodon.pca_met(A, M, N, k=- 1, pretreatment='col_centering', meth='svd')[source]

what it does

PCA of an array A with metrics on rows and/or columns

Parameters
Aa \(n \times p\) 2D numpy array

the array to be analysed

Na 2D numpy array

a Symmetric Definite Positive (SDP) matrix defining an inner product on \(R^n\)

Pa 2D numpy array

a SDP matrix defining an inner product on \(R^p\)

kan integer

the number of axis to be computed

if \(k = -1\), all axis are computed

pretreatmenta string

the pretreatment to apply

possible values are: ``col_centering ``

metha string

method for numerical calculation of PCA

Returns
Ya \(n \times k\) 2D numpy array

the princial components

La k 1D numpy array

the eigenvalues

Va \(p \times k\) 2D numpy array

the new basis

Notes

If \(A\) has \(n\) rows and p columns, we assume that an inner product has been defined on \(R^n\) by a SDP matrix N and on \(R^p\) by a SDP matrix P.

This is how the algorithm works:

  • first: computes the square roots of N and P; this is done with a SVD. If

\(N = M^2\) and \(N = U \Sigma U^t\) (\(U=V\) in the SVD because N is symmetric), then \(M = U \Sigma^{1/2} U^t\); simlarily, if \(P=Q^2\), Then \(Q = V\Phi^{1/2} V^t\) if the SVD of P is \(P = V \Phi V^t\).

  • second: computes \(R = MAQ\)

  • third: run the PCA of R: \((Z, \Lambda, X) = PCA(R)\)

  • fourth: computes \(M^{-1}\) and \(Q^{-1}\) as \(M^{-1}=U \Sigma^{-1/2} U^t\) and

\(Q^{-1}=V \Phi^{-1/2} V^t\)

  • fifth: computes \(Y = M^{-1}Z\) and \(V = Q^{-1}X\)

The result is \((Y, V, \Lambda)\)

Let us note that numpy contains a function, called scipy.linalg.sqrtm which computes the square root of a square n x n matrix, from algorithm published in Deadman, E., Higham, N. J. and Ralha, R. (2013) “Blocked Schur Algorithms for Computing the Matrix Square Root, Lecture Notes in Computer Science, 7782. pp. 171-182, which is not used here. Indeed, using the SVD permits to compute easily \(M=N^{1/2}\) and \(M^{-1}=N^{-1/2}\), and the same for P.

Warnings

  • Pretreatment has not been implemented, but centering by column

  • still under work - not frozen for a release

Diodon Documentation

  • see section 6, and algorithm pca_met() in section 11.

28/12/2017, revised 22.09.28

pydiodon.pca_svd(A)[source]

runs the PCA with SVD of A

Notes

No documentation (used behind pca_core) 21/06/27

pydiodon.plot_coa(Y_r, Y_c, axis_1=1, axis_2=2, col_dsize=20, row_dsize=20, row_col='blue', col_col='red', rownames=[], colnames=None, title=None, x11=True, plotfile=None, fmt='png')[source]

what it does

Scatter plot of the result of a Correspondence Analysis

Parameters
Y_ra 2D numpy array

the principal components for the rows

Y_ca 2D numpy array

the principal components for the columns

axis_1an integer

the first axis of the scatter plot (starting at 1)

axis_2an integer

the second axis of the scatter plot (starting at 1)

col_dsizean integer
dot size for dots associated with columns
default value is 20
row_dsizean integer
dot size for dots associated with rows
default value is 20
row_cola string
color for dots associated with rows
default value is blue
col_cola string
color for dots associated with columns
default value is red
rownamesa list
list of names of rows to be plotted
default value is [] (no name is plotted)
colnamesa list
list of names of columns to be plotted
defalt value is [] (no name is plotted)
titlea string
the title of the plot
default value is None (no title)
x11a boolean
whether to display the plot on the screen
default value is True (plot displayed)
plotfilea string
the name of the file where to save the plot
default value is None (plot not saved)
fmta string
format of the file where the plot is saved
possible values are : png, eps, pdf
default value is png
Returns
a scatter plot

Notes

The numbering of the axis in calling the function is natural, and bagins at 1. It is translated to start at 0 as in python in the function itself.

af, 25/10/2017, revised 22.10.26

pydiodon.plot_components_heatmap(Y, axis_1=1, axis_2=2, bins=256, cmap='ocean', range=None, log=True, scale=False, title=None, x11=True, imfile=None, fmt='png')[source]

what it does:

builds the density heatmap of the point cloud of the components of a PCA

Parameters
Ya \(n \times k\) 2D numpy array

array of the components

binsan integer

the size of the image, in pixels

axis_1an integer
first axis to be plotted
columns in the image
axis_2an integer
second axis of Y
rows in the image
binsan integer
the size of the image
it is also the number of classes per axis in the 2D histogram
cmapa string
the matplotlib colormap for the heatmap
default value is ocean
classical default value in matplotlib is viridis
rangeis None

compulsory for callind 2D histogram

loga boolean
whether to translate the densities in a logarithmic scale
default value is True
scalea boolean
deprecated
default value is False
titlea string
the title of the heatmap
default value is None (no title)
x11a boolean
whether to display the plot on the screen
default value is True (plot displayed)
plotfilea string
the name of the file where to save the plot
default value is None (plot not saved)
fmta string
format of the file where the plot is saved
possible values are : png, eps, pdf
default value is png
Returns
a heatmap

Notes

Y is a numpy array of coordinates, as produced by the MDS

when two axis are selected

  • the two components are extracted

  • a 2D histogram of their values is computed with a call to np.histogram2d()

  • the heatmap is drawn with the values of the 2D histogram

af, 25/01/2020, revised 22.10.36

pydiodon.plot_components_quality(Y, Qual, axis_1=1, axis_2=2, r=2, cmap='ocean', diam=50, title=None, x11=True, plotfile=None, fmt='png')[source]

what it does Scatter plot of the components with each dot colored according to the cumulated quality of the item for a given axis

pydiodon.plot_components_scatter(Y, axis_1=1, axis_2=2, dot_size=20, color='red', cmap=None, names=[], title=None, x11=True, plotfile=None, fmt='png')[source]

Scatter plot of the result of a Principal Component Analysis

Parameters
Ya \(n \times k\) 2D numpy array,

the matrix of columnwise principal components

axis_1an integer, the first axis

axis are labelled from 1 to k

axis_2an integer, the second axis

second axis is labelled from axis_1 + 1 to k

dot_sizean integer

the size of dots in the plot, one dot per item

colora string

color of doats in the plot

namesa list of n strings

labels of dots in ythe plot

titlestring

title of the plot

x11boolean

the plot is displayed on the scvreen if x11=True

plotfilestring

name of the file to save the plot

the plot is not saved if plotfile=None

fmtstring

format of the file to save the plot.

accepted values are png, eps, pdf.

Returns
a scatter plot

Notes

None

af, 21/02/2018, revised 22.10.14

pydiodon.plot_correlation_matrix(C, x11=True, plotfile=None, fmt='png')[source]

plots a heatmap of the correlation matrix !!! ongoing work !!!

af, 22.10.27

pydiodon.plot_cumulated_quality_per_item(Y, r=2, col='blue', title=None, x11=True, plotfile=None, fmt='png')[source]

plots the quality per item

Parameters
Ya 2D \(n \times p\) numpy array

a matrix of principal components

ran integer, the axis for plotting cumulated quality

default value is 2

cola string
the color to use for plotting the quality
default value is blue
titlestring

title of the plot

x11boolean

the plot is displayed on the scvreen if x11=True

plotfilestring

name of the file to save the plot

the plot is not saved if plotfile=None

fmtstring

format of the file to save the plot.

accepted values are png, eps, pdf.

Returns
Qual1D \(r\) numpy array

the rowise cumulated quality per axis up to r.

Notes

For each row i of the matrix of components Y

  • computes \(Qual[i] = \sum_{j \leq r} Y[i,j]^2 / \sum_{k \leq p} Y[i,k]^2\)

  • once an axis r has been selected, draws the plot of \((i, Q(i,r))\) with i on x axis.

af, révisé 22.10.28

pydiodon.plot_eig(L, frac=True, cum=False, Log=False, dot_size=- 1, nb_val=- 1, col='red', title=None, x11=True, plotfile=None, fmt='png')[source]

Plots the singular or eigenvalues

Parameters
L1D numpy array

the eigenvalues of A’A

fracboolean

if True, the fraction of the norm of A’A is displayed per eigenvalue

cumboolean

if true, the cumulated eigenvalues are displayed can be combined with frac=True

LogBoolean

if True, the Log of the genvalues is displayed

dot_sizeinteger

if > 0, a dot is displayed per eigenvalue

nb_valinteger

number of eigen,values to display;

if nb_val=-1, all eigenvalues are displayed

colstring

color for the plot

titlestring

title on top of the plot;

if None, no title is displayed

x11Boolean

if True, the plot is displayed on screen

plotfilestring

if not None, the save is save in a file named plotfle

fmtstring

the format of the plot;

accepted formats are png, eps, pdf

Notes

  • The eigenvalues \(L_i\) are such that \(\sum_i L_i = ||A'A||^2\). Then, the quality of representation in axis i is \(L_i/\sum_j L_j\). This is displayed by setting frac=True.

  • the cumulated eigenvalues are simply given by \(\psi_i = \sum_{j \leq i}L_j\)

af, 25/10/2017 - revised 19.09.18, 22.10.14

pydiodon.plot_var(V, varnames=None, axis_1=1, axis_2=2, title=None, x11=True, plotfile=None, fmt='png')[source]

*what it does* Plots the correlations between the variables

Parameters
Va \(p \times p\) 2D numpy array

column k of V is the `k-`th principal axis

varnamesa list of strings
the names of the variables (the columns of A on which the PCA has been done)
default value is None (no names are displayed)
axis_1an integer
the first principal axis to be displayed
starts at 1
axis_2an integer

the second principal axis to be displayed

titlestring

title of the plot

x11boolean

the plot is displayed on the scvreen if x11=True

plotfilestring

name of the file to save the plot

the plot is not saved if plotfile=None

fmtstring

format of the file to save the plot.

accepted values are png, eps, pdf.

Returns
a plot
af, 22.10.28
pydiodon.plot_var_heatmap(V, varnames, cmap='viridis', title=None, x11=True, plotfile=None, fmt='png')[source]

what it does Plots a heatmap of the coordinates of the principal axis

Parameters
Va \(p \times p\) 2D numpy array

the coordonates of the principal axis in the basis of the variables

varnamesa list of strings
the names of the variables
(names of the columns of A on which PCA has been done)
cmapa string
the matplotlib colormap for drawing the heatmap
default value is viridis
titlestring

title of the plot

x11boolean

the plot is displayed on the scvreen if x11=True

plotfilestring

name of the file to save the plot

the plot is not saved if plotfile=None

fmtstring

format of the file to save the plot.

accepted values are png, eps, pdf.

Returns
a heatmap
af, 22.10.28
pydiodon.project_on(A)[source]

what it does:

Builds the projection operator on space spanned by the columns of A

Parameters
Aa \(n \times p\) 2D numpy array

the projector is on the space in \(\mathbf{R}^n\) spanned by the columns of A

Returns
Pa 2D numpy array, \(n \times n\)

the projector

Notes

This function uses the QR decomposition of A, which avoids to compute \((A'A)^{-1}\) which is costly (one product, one inverse). It proceeds as

  • A = QR

  • P = QQ’

Example

import numpy as np
import pydiodon as dio
n = 10 ; p = 5
A = np.random.randn(n,p)
P = project_on(A)

af, 22/02/2018, 22.10.13, 22.10.28

pydiodon.scale(A)[source]

what it does:

scales a matrix columnwise

Parameters
Aa 2D numpy array, n x p

the matrix to be scaled

Returns
Asa 2D numpy array

matrix A with scaled columns

aa 1D numpy array, p values

the norm of each column of A

Notes

It is recommended to center A columnwise before scaling.

Each coordinates \(a_{ij}\) of A is replaced by \(a_{ij} / ||a_{.j}||\) where \(||a_{.j}||\) is the norm of column j: \(||a_{.j}||^2 = \sum_i a_{ij}^2\)

Example

>>> import pydiodon as dio
>>> import numpy as np
>>> n = 4 ; p = 3
>>> A = np.arange(n*p)
>>> A.shape =(n,p)
>>> Ac, m = dio.center_col(A)  
>>> As, s = dio.scale(Ac)

af, 21.02.19, 22.10.13

pydiodon.svd_grp(A, k)[source]

SVD of a (large) matrix A by Gaussian Random Projection

Parameters
Aa 2D numpy array

the matrix to be studied (dimension \(n \times p\))

kan integer

the number of components to be computed

Returns
Ua 2D numpy array

(dimension \(n \times k\))

Sa 1D numpy array

(k values)

Va 2D numpy array

(dimension \(p \times k\))

Notes

Builds the SVD of A as \(A = U\Sigma V^t\) where \(\Sigma\) is the diagonal matrix with diagonal S and \(V^t\) is the transpose of V.

Here are the different steps of the computation:

  1. A is a \(n \times p\) matrix

  2. builds \(\Omega\), a \(p \times k\) random matrix (Gaussian)

  3. computes Y = A\(\Omega\) (dimension \(n \times k\))

  4. computes a QR decomposition of Y
    Y = QR, with
    Q is orthonormal (\(n \times k\))
    R is upper triangular (:math:`k times k)
  5. computes B = Q^t A

    B is :math:`k times p (with k < p) and its SVD is supposed to be close to the one of A

  6. \(U_B\), S, and V are SVD of B:

    \(B = U_B S V'\), S is diagonal

  7. then, \(U_A\), S and V are close to svd of A : \(A = U_A S V'\)

Sizes and computation of the matrices

matrix

value

size

A

input

\(n \times p\)

\(\Omega\)

random

\(p \times k\)

Y

\(A\Omega\)

\(n \times k\)

Q

Y = QR

\(n \times k\)

B

B = Q^t A

\(k \times p\)

\(U_B\)

\(B = U_BSV^t\)

\(k \times k\)

U

\(U = QU_B\)

\(n \times k\)

References

[1] Halko & al., 2011, SIAM Reviews, 53(2): 217-288

af, rp, fp 21.03.21; 22.06.16, 22.10.28

pydiodon.write_ascii(A, filename, delimiter='\t', colnames=False, rownames=False, np_fmt='%.18e', dtype='float32')[source]

Write an ascii file

Notes

Not expected to be used by the user (see writefile()) 22.10.28

pydiodon.write_hdf5(A, filename, colnames=False, rownames=False, datasetname='values', dtype='float32')[source]

Write a hdf5 file

Notes

No documentation because is not expected to be used by the user (see writefile()) 22.10.28

pydiodon.writefile(A, filename, fmt=None, delimiter='\t', colnames=False, rownames=False, datasetname='values', np_fmt='%.18e', dtype='float32')[source]

what it does

A generic function for saving a numpy array as a file

Parameters
Aa numpy array

the array to be saved as a file

filenamestring

the name of the file for saving the array (compulsory)

fmtstring

explicit format of the file (compulsory) one item among ascii, hdf5, zip ; see notes

delimitercharacter

the delimiter between values in a row

colnamesboolean or string

column names

for an ascii file, it is boolean, with True if column names are as first row in the file, and False otherwise

for an hdf5 file, gives the name of the dataset with the column names

optional
default value is None.
rownamesboolean or string

row names

for an ascii file, it is boolean, as True is row names are as first column in the file, and False otherwise

for an hdf5 file, gives the name of the dataset with the row names

optional
default value is None.
datasetnamestring
for writing files in hdf5 format only : hdf5 dataset for values
optional
default value is “value”
fmtstring

is exactly the parameter fmt of numpy.savetxt() for print format in ascii files to be done

dtypeto be done
Returns
writes a file

Notes

It writes the array in a file and returns no value. Parameters and choices are mirrored from function loadfile() as it is the reverse operation.

Possible formats are: ascii, hdf5 and compressed ascii. Delimiters in ascii format can be blanks, comma, semi-columns, tabulations. Ascii data sets with tab delimiters are expected to be with suffix .txt. Ascii data sets with other delimiters are expected to be with siffix .csv.

jmf & af, 21.04.22, 22.10.28