pydiodon module¶
what it is:
a python library for linear dimension reduction.
Content:
The library is organised into several functional blocks, as follows:
block |
functions |
what it does |
|---|---|---|
Core |
mat_evd() svd_grp() |
eigen-values -vectors of a matrix SVD with random projection |
Methods |
coa() mds() pca() pca_core() pca_met() pca_iv() |
Correspondence Analysis Multidimensional Scaling Principal Component Analysis core method for PCA PCA with metrics PCA with latent variables |
Pretreatments |
bicenter() center_col() center_row() dis3gram() scale() |
bicentering columnwise centering rowise centering computes the Gram matrix from distances scales columnwise |
Plotting |
||
Interpretation |
qual_proj() |
quality of a projection |
I/O |
loadfile() |
generic function for loading datasets |
Identity card
- maintainer
Alain Franc
- contributors
Olivier Coulaud
Alain Franc
Jean-Marc Frigerio
Romain Peressoni
Florent Pruvost
- started
21/02/17
- version
22.10.28
- release
0.0.1
- licence
GPL-3
This file is part of diodon project. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
- pydiodon.bicentering(A)[source]¶
bicentering a matrix
- Parameters
- Aa n x p numpy array
the matrix to be bicentered
- Returns
- ma float
global mean of the matrix
- ra 1D numpy array (n values)
rowise means
- ca 1D numpy arrau (p values)
columnwise means
- Ra 2D numpy array (n x p)
bicentered matrix
Notes
If A is a matrix, builds a matrix R of same dimension centered on rows and columns.
\(m = (1/np) \sum_{i,j} a_{ij}\)
\(r = (r_1,...,r_n)\) with \(r_i = (1/p)\sum_j a_{ij} - m\)
\(c = (c_1,...,c_p)\) with \(c_j = (1/n)\sum_i a_{ij} - m\)
such that
\(\sum_ir_i = \sum_jc_j=0\)
and
for any row i, \(\sum_j R_{ij}=0\) and for any column j, \(\sum_i R_{ij}=0\)
The matrix R can describe interactions in a additive model with two categorical variables, as in
\(a_{ij} = m + r_i + c_j + R_{ij}\)
example
>>> import pydiodon as dio >>> import numpy as np >>> n = 4 ; p = 3 >>> A = np.arange(n*p) >>> A.shape=(n,p) >>> m, r, c, R = dio.bicenter(A)
af, revised 21.02.19, 22.09.21, 22.10.13
- pydiodon.center_col(A)[source]¶
centers a matrix columnwise
- Parameters
- Aa 2D numpy array n x p
the matrix to be centered
- Returns
- Aca 2D numpy array
n x p, centered columnwise
- m1D numpy array, p values
means per column
- example
- af, rp, 21.02.19, 22.10.13
- pydiodon.center_row(A)[source]¶
centers a matrix row-wise
- Parameters
- Aa 2D numpy array, n x p
the matrix to be centered
- Returns
- Aca 2D n x p numpy array
A centered row-wise
- ma 1D numpy array (n values)
means per row
- Example
- af & rp, 21.02.19, 22.10.13
- pydiodon.centering_operator(m)[source]¶
Centering operator for a matrix
will probably be deprecated (22.09.21)
argument
m : an integer ; vector or matrix dimension
returns
H: a numpy array, m x m
Notes
If A is a n x p matrix
\(HA\) is A with centered columns (H is n x n)
\(AH\) is A with centered row (H is p x p)
example
>>> import pydiodon as dio >>> import numpy as np >>> n = 4 ; p = 3 >>> A = np.arange(n*p) >>> A.shape = (n,p) >>> H_row = dio.centering_operator(p) >>> H_col = diod.centering_operator(n) >>> Ar = A @ H_row >>> Ac = H_col @ A
revised 21.02.27
- pydiodon.coa(X, k=- 1, meth='svd', transpose=False)[source]¶
what it does
Correspondance Analysis of an array X
- Parameters
- Xa 2D numpy array, n x p
array to be analysed
- kan integer
number of axis to compute
- metha string
method for numerical computing
- Returns
- La 1D numpy array
vector of eigenvalues
- Y_ra 2D numpy array,`n x k`
coordinates of row points
- Y_ca 2D numpy array, p x k
coordinates of column points
Notes
If \(k=-1\), all axis are computed. If \(k > 0\), only k first axis and components are computed.
methods for PCA core
evd
EVD
svd
SVD with numpy.linalg.svd()
grp
SVD with gaussian random projection
Example
>>> import pydiodon as dio >>> import numpy as np >>> dataset = "../data4tests/coa4tests.txt" >>> A, headers, rownames = dio.load(dataset) >>> L, Y_r, Y_c = dio.coa(A)
note on the example
The dataset is in text format with tabs as delimiters. It contains headers and row names. These are default parameters for fucntion dio.load()
Reference:
Nenadic & Greenacre, Journal of Statistical Software, 20(3): 2-13, 2007
Lebart, Morineau & Fénelon, 1982, pp. 305-320
revised 21.02.21
- pydiodon.dis2gram(dis, no_loop=True)[source]¶
Computes a Gram matrix gram knowing a distance matrix dis
- Parameters
- dis: a 2D numpy array, size `n` x `n`, of floats
is a distance or dissimilarity matrix to be analyzed
- Returns
- grama 2D numpy array, size n x n, of floats
the associated Gram matrix
Notes
if \(d_{i.}^2 = (1/n) \sum_j d_{ij}^2\) and \(d_{..}^2 = (1/n^2) \sum_{i,j} d_{ij}^2\)
then
\(gram[i,j] = -(1/2) (d_{ij}^2 - d_{i.}^2 - d_{.j}^2 + d_{..}^2)\)
af, 07/11/2017, 22.10.13
- pydiodon.hovering(F1, F2, label, prefix=None, c='b', s=20)[source]¶
what it does
display a label of one dot selected with the mouse in a scatter plot
- Parameters
- F11D numy array
first axis
- F21D numpy array
second axis
- prefixforgotten
- cstring
forgotten
- sinteger
forgotten
Notes
A scatter plot with F1 on axis 1 and F2 on axis 2 is displayed. When the mouse clicks on a point, a comment is displayed on the console with the index of the item cliked, and the value of the label for this item
This is event based, i.e. cliking with the mouse can be repeated as many times as wished.
Adapted from http://matplotlib.sourceforge.net/examples/event_handling/pick_event_demo.html
af, 25/10/2017
- pydiodon.load_ascii(filename, delimiter, colnames, rownames, dtype)[source]¶
Loads an ascii file as a numpy array
- Parameters
- filenamestring
the name of the file to load
- delimitercharacter
delimiters between values in a row
- colnamesboolean
True if there are column names in the first row of the file False otherwise
- rownamesboolean
True if there are rownames in the first column of the file False otherwise
- dtypestring
as dtype in numpy.loadtxt (type of values in the numpy array)
- Returns
- Aa 2D numpy array
the values in the file
- rnlist of strings
row names (if
rownames==True)- cnlist of strings
column names (if
colnames==True)
- pydiodon.load_hdf5(filename, datasetname, colnames, rownames)[source]¶
loads a hdf5 file as a numpy array
- Parameters
- filenamestring
the name of the file
- datasetnamestring
the name of the hdf5 dataset with data values
- Returns
- see loadfile
- pydiodon.loadfile(filename, fmt=None, delimiter='\t', rownames=False, colnames=False, datasetname='values', dtype='float32')[source]¶
A generic function for loading datasets as numpy arrays
- Parameters
- filenamestring
contains the data set to be loaded (compulsory)
- fmtstring
explicit format of the file (optional); if it is not given the format will be guessed from the suffix (see notes)
- delimitercharacter
the delimiter between values in a row
- colnamesboolean or string; whether column names
for an ascii file, it is boolean, as True if column names are as first row in the file, and False otherwise;
for an hdf5 file, gives the name of the dataset with the column names
optional, default value is False
- rownamesboolean or string; whether row names
for an ascii file, it is boolean, as True if row names are as first column in the file, and False otherwise
for an hdf5 file, gives the name of the dataset with the row names
optional, default value is None.
- datasetnamestring
for hdf5 files : hdf5 dataset for values
optional, default value is “value”
- Returns
- Aa numpy array
the values of the data set
- rnlist of strings
row names (optional)
- cnlist of strings
column names (optional)
Notes
Recognized formats are:
ascii,hdf5andcompressed ascii.Delimiters in ascii format can be blanks, comma, semi-columns, tabulations
Ascii data sets with
tabdelimiters are expected to be with suffix.txtor.tsv.Ascii data sets with other delimiters are expected to be with siffix
.csv.
When the filename is read, the function splits the name on the last dot, and interprets the string after as the suffix. Then, there is a call
to
load_ascii()if the suffix is.txt,tsv,.gzor.bz2,to
load_hdf5if the suffix ish5orhdf5,and unzips the file before a call to
load_ascii()if the sufix iszip.
Examples
Here is a call for loading an
asciifile with extension.txthence with tab as delimiters, and with rownames and colnames. In such a case, the call must specifiy that there are colnames and rownames to be read:>>> import pydiodon as dio >>> filename = "pca_template_withnames.txt" >>> A, colnames, rownames = dio.loadfile(filename, colnames=True, rownames=True) >>> print(A) >>> print(colnames) >>> print(rownames)
If it is not specified (default values), an array without colnames and rownames will be loaded, as in
>>> import pydiodon as dio >>> filename = "pca_template_nonames.txt" >>> A = dio.loadfile(filename)
Here is a call of a
.csvfile where delimiters have to be specified:>>> import pydiodon as dio >>> filename = "pca_template_nonames.csv" >>> A = dio.loadfile(filename, delimiter = ";")
Here is how to load a
zipfile from a.txtfile:>>> import pydiodon as dio >>> filename = "pca_template_nonames.txt.zip" >>> A = dio.loadfile(filename) >>> print(A)
and from a
.csvfile with semi-column as delimiter>>> import pydiodon as dio >>> filename = "pca_template_nonames.csv.zip" >>> A = dio.loadfile(filename, delimiter=";")
Here is an example for loading a
hdf5file with values, colnames and rownames as datasets:>>> import pydiodon as dio >>> filename = "pca_template_withnames.h5" >>> A, colnames, rownames = dio.loadfile(filename, colnames='colnames', rownames='rownames')
version 21.03.23
- pydiodon.mat_evd(mat, k=- 1)[source]¶
Computes the eigenvalues and eigenvectors of a matrix
- Parameters
- mata numpy array, n x n (a matrix)
- kan integer
the number of axis to be computed
- Returns
- Va 2D numpy array
of size n x k: the k eigenvectors
- La 1D numpy array
a vector ; the first k eigenvalues
Notes
if k=-1, all eigenvalues and eigenvectors are returned.
This simply uses
numpy.linalg.eigh(), and adds some features likesorts eigenvalues in decreasing order
sorts eigenvectors in corresponding order
sets to zero the imaginary parts (numerical approximation)
Example
>>> import pydiodon as dio >>> import numpy as np >>> # building a symmetric matrix >>> n = 5 >>> A = np.random.random((n,n)) >>> A = (A + A.T)/2 >>> # computing eigenvalues and eigenvectors >>> V, L = dio.mat_evd(A)
version 21.03.21, 22.09.21, 22.10.13, 22.10.25
- pydiodon.mds(dis, k=- 1, meth='svd', no_loop=True)[source]¶
what it does
Multidimensional Scaling of a distance or dissimilarity array
- Parameters
- disnumpy array, n x n
distances or dissimilarities
- kinteger
number of axis
- methstring
method for MDS (see notes)
- no_loopboolean
see notes
- Returns
- X2D numpy array,
n x k, coordinates of points
- S1D numpy array
k values, singular values |
Notes
Objective: builds a point cloud where one point represents an item, such that the discrepancies beween distances in a matrix dis and between associated points in \(R^k\) is as small as possible. It is classical MDS, not Least Square Scaling.
Procedure: There are three steps1. computes the Gram matrix G between items from distance matrix dis2. runs a SVD or EVD of G3. computes the coordinates from the SVD (or EVD) in Xargument no_loop : it is used in the computation of the Gram matrix:- if n is large,no_loop=Falseis recommended- if not,no_loop=Trueis recommended (default value)methods for MDS: the argumentmethspecifies which method is selected for the core of MDS. Default value issvd. Let G be the Gram matrix associated to the distance array.- ifmeth=svd, the a SVD of G is run- ifmeth=grp, the SVD is run with Gaussian Random Projection- ifmeth=evd, the eigenvalues and eigenvectors of G are computedHere are some suggestions for selection of the method:- if n is not too large (\(n < 10,000\)),svdmethod is recommended- if n is large,grpmethod is recommended, and compulsory if n is very large- if k is small, \(k \simeq 10\),evdmethod is recommended.computations : EVD or SVD of the Gram matrix are two equivalent ways to compute a solution. Let G be the Gram matrix. They are linked by- EVD: \(GU = US\) ; U is the matrix of columnwise eigenvectors of G and S the diagonal matrix of its eigenvalues- SVD: \(G = USU'\) is the SVD of G (which is symmetric, U’ is the transpose of U)Then, in both cases, if X is the matrix of coordinates of n items in \(R^r\), \(X = US^{1/2}\)Examples
This is a first toy example for running a MDS.
First, building a Euclidean distance matrix.
>>> import pydiodon as dio >>> import numpy as np >>> import matplotlib.pyplot as plt >>> import scipy.spatial.distance as ssd # for computing distances >>> # creates a m x n point cloud >>> m = 10 ; n = 4 >>> A = np.random.randn(m,n) >>> # computes the pairwise distances >>> D = ssd.pdist(A) >>> D = ssd.squareform(D) # distance matrix in square form
Second, run the MDS
>>> # runs the MDS >>> X, S = dio.mds(D)
Third, interpret the quality and displays the result
>>> # plotting the quality >>> plt.plot(S, color="red") >>> plt.title("Eigenvalues") >>> plt.show() >>> # plotting the point cloud >>> F1 = X[:,0] ; F2 = X[:,1] >>> plt.scatter(F1, F2, c="chartreuse", s=20) >>> plt.xlabel("axis 1") >>> plt.ylabel("axis 2") >>> plt.show()
This example is in
mds_xmpl_1.pyHere is now a real life example.
>>> import pydiodon as dio >>> import numpy as np >>> import matplotlib.pyplot as plt >>> # >>> rank = 200 >>> filename = "guiana_trees.dissw" >>> Dis, rn, cn = dio.load(filename, rownames=True, colnames=True) >>> X, S = dio.mds(Dis, k=rank)
This is focused on running the MDS of Dis . Here is a complementary example for visualizing the results.
>>> # plotting the singular values >>> plt.plot(S, color="red") >>> plt.title("Quality along the axis") >>> plt.show() >>> # plotting the point cloud, axis 1 and 2 >>> F1 = X[:,0]; F2 = X[:,1]; F3 = X[:,2] >>> plt.scatter(F1,F2,c="blue", s=5) >>> plt.xlabel("axis 1") >>> plt.ylabel("axis 2") >>> plt.title("data set: Guiana trees") >>> plt.show()
This example is in
mds_xmpl_guiana_trees.pyThe data set can be downloaded from https://doi.org/10.15454/XSJ079 where it is dataset
guiana_trees.dissw.Related publication: Caron, H, Molino, J‐F, Sabatier, D, et al. Chloroplast DNA variation in a hyperdiverse tropical tree community. Ecol Evol. 2019; 9: 4897– 4905. https://doi.org/10.1002/ece3.5096
Can be downloaded from Wiley website at https://onlinelibrary.wiley.com/doi/pdf/10.1002/ece3.5096 doi: DOI: 10.1002/ece3.5096
References
[1] T.F. Cox and M. A. A. Cox. Multidimensional Scaling - Second edition, volume 88 of Monographs on Statistics and Applied Probability. Chapman & al., 2001.[2] I. Borg and P. J. F. Groenen. Modern Multidimensional Scaling. Springer Series in Statistics. Springer, second edition, 2005.[3] K. V. Mardia, J.T. Kent, and J. M. Bibby. Multivariate Analysis. Probability and Mathematical Statistics. Academic Press, 1979.Diodon notes: section 10
07/11/2017 - revised 21.03.20, 22.10.13
- pydiodon.pca(A, pretreatment='standard', k=- 1, meth='svd')[source]¶
Principal Component Analysis
- Parameters
- Aa numpy array
the array to be analyzed
- kinteger
number of axes to be computed
- methstring
method for numerical calculation (see notes)
- pretreatmentstring
which pretreatment to apply ;
accepted values are:
standard,bicentering,col_centering,row_centering,scalingsee notes for details
Notes
The method runs as follows:
first it implements the required pretreatments
second: it runs the function pca_core on the transformed matrix
third: it returns the eigenvalues, the principal axis, the principal components and the correlation matrix if required
methods for PCA: the argument
methspecifies which method is selected for the core of MDS. Default value issvd. Let A be the the matrix to analyse.if
meth=svd, the a SVD of A is runif
meth=grp, the SVD is run with Gaussian Random Projectionif
meth=evd, the eigenvalues and eigenvectors of A are computed
pretreatments: here are the accepted pretreatments:
standard: the matrix is centered and scaled columnwisebicentering: Matrix is centered rwowise and columnwise; it is a useful alternative to CoA known as “double averaging”
Examples
This is an example of a standard PCA of a random matrix, with \(m\) rows and \(n\) columns, with elements as realisation of a uniform law between 0 and 1.
First build the random matrix
>>> import pydiodon as dio >>> import numpy as np >>> import matplotlib.pyplot as plt >>> m = 200 ; n = 50 >>> A = np.random.random((m,n))
Second, run the PCA
>>> L, V, Y = dio.pca(A)
Third, plots some results (eigenvectors and point cloud)
>>> plt.plot(L) ; plt.show() >>> plt.scatter(Y[:,0],Y[:,1]) ; plt.show()
The above program runs centered scaled PCA, with here default option
pretreatment="standard". For PCA without centering nor scaling, the command is>>> L, V, Y, C = dio.pca(A, standard=False)
For PCA with column centering but without scaling, the command is
>>> L, V, Y, C = dio.pca(A, standard=False, col_centering=True)
(in such a case, the argument
standardmust be set toFalse. If not, the array will be scaled as well). Scaling without centering is quite unusual.For bicentering, the command is
- firefox
>>> L, V, Y, C = dio.pca(A, bicenter=True)
These are the most usuful options for pretreatment.
Prescribed rank is simply called by (with standard pretreatment)
>>> rank = 10 >>> L, V, Y = dio.pca(A, k=rank)
for having the 10 first components and axis only.
revised 21.03.03 - 21.04.20
- pydiodon.pca_core(A, k=- 1, meth='svd')[source]¶
what it does
core method for PCA (Principal Component Analysis) of an array A
- Parameters
- Aa n x p numpy array,
array to be analysed
- kan integer
number of axis to compute
- metha string
method for numerical computing
- Returns
- Ya 2D numpy array, n x k
matrix of principal components
- La 1D numpy array
vector of eigenvalues
- Va 2D numpy array, p x k
matrix of eigenvectors (new basis)
Notes
A is an array, and
pca_corecomputes the PCA of A, without any centering nor scaling nor weights nor constraints. PCA with centering, scaling, weights or constraints is called by functionpca()which in turns calls this function pca_core().if k = -1, all axis are computed. If k > 0, only k first axis and components are computed.
if meth is
evd, runs by eigendecomposition of A’A
svd, runs by singular value decomposition of A
grp, runs by SVD of A with Gaussian random projection
Default value is svd.
With EVD, it runs as follows:1. Computes the correlation matrix \(C=A'A\)2. Computes the eigevalues and eigevectors of \(C\): \(Cv_j = \lambda_j v_j\)3. Computes the principal components as \(y_j=Av_j\), or, globally, \(Y=AV\)with SVD, it runs as follows:1. \(U,\Sigma,V = SVD(A)\)2. \(Y = U\Sigma\)3. \(L=\Sigma^2\)Example
This is an example of a standard PCA of a random matrix, with \(m\) rows and \(n\) columns, with elements as realisation of a Gaussian law with mean 0 and standrd deviation 1. As such, there is no need to center nor to scale, and the use of
pca_core()is relevant.>>> import pydiodon as dio >>> import numpy as np >>> import matplotlib.pyplot as plt >>> m = 200 ; n = 50 >>> A = np.random.randn(m,n) >>> L, V, Y = dio.pca_core(A) >>> plt.plot(L) ; plt.show() >>> plt.scatter(Y[:,0],Y[:,1]) ; plt.show()
af, revised 21.03.01; 21.06.27; 22.09.23, 22.10.13
- pydiodon.pca_evd(A)[source]¶
runs the PCA with eigendecomposition of A’A
Notes
No documentation (used behind pca_core) 21/06/27
- pydiodon.pca_grp(A, k)[source]¶
runs the PCA with SVD with gaussian random projection
Notes
No documentation (used behind pca_core) 21.06.27
- pydiodon.pca_iv(A, B, k=- 1, meth='svd', pretreatment='col_centering', transpose=False)[source]¶
what it does
PCA of an array A with instrumental variables
- Parameters
- Aa numpy array, the items/variable to be analyzed
- Ba numpy array, the intrumental variables
- kan integer
- metha string ; method for numerical calculation
- pretreatmenta string, the pretreatment of A and B
currently, only `col_centering`is implemented
- Returns
- Ya n x k 2D numpy array
the components
- La k 1D numpy array
the eigenvalues
- Va p x k 2D numpy array
the new basis
- A_proja n x p 2D numpy array
the projection of A on the subspace of \(R^n\) spanned by the columns of B
Notes
A_pre is a copy of A for tracking preatreatments, and recover the matrix analyzed after preatreatments, without impacting A
The algorithm is as follows:
Build \(P = B(B'B)^{-1}B'\) which is the projector in \(R^n\) on the subspace spanned by the columns of B.
build A_proj = PA which is the projection of A on the subspace spanned by the columns of B
do PCA of A_proj : Y,L,V = PCA(A_proj)
warning
This has not been tested; no guarantee on the quality of the result.
21/02/2018, révised 22.09.28, 22.10.14
- pydiodon.pca_iv_qual_proj(A, A_proj)[source]¶
what it does
Computes the quality of the projection of A on the space spanned by the columns of B
- Parameters
- Aa n x p 2D numpy array
the matrix to be analyzed by PVA_iv
- A_proja n x p 2D numpy array
the projection of A on the subspace of \(R^n\) spanned by the columns of B
- Returns
- rhoa real
the quality of the projection
Notes
rho is the ratio of the square norm of A_proj upon the square norm of A
22.09.28
- pydiodon.pca_iv_qual_proj_axis(A, A_proj, title=None, x11=True, imfile=None, fmt='png')[source]¶
what it does Computes and plots the quality of projection of each column of A on Ap
Notes
Documentation ongoing
af, @ Cayenne, 21/02/2018
- pydiodon.pca_iv_regressors(A, Q, V, varnames, n_axis=3, x11=True, imfile=None, fmt='png')[source]¶
what it does
will probably be deprecated
Notes
Documentation ongoing
22.10.26
- pydiodon.pca_met(A, M, N, k=- 1, pretreatment='col_centering', meth='svd')[source]¶
what it does
PCA of an array A with metrics on rows and/or columns
- Parameters
- Aa \(n \times p\) 2D numpy array
the array to be analysed
- Na 2D numpy array
a Symmetric Definite Positive (SDP) matrix defining an inner product on \(R^n\)
- Pa 2D numpy array
a SDP matrix defining an inner product on \(R^p\)
- kan integer
the number of axis to be computed
if \(k = -1\), all axis are computed
- pretreatmenta string
the pretreatment to apply
possible values are: ``col_centering ``
- metha string
method for numerical calculation of PCA
- Returns
- Ya \(n \times k\) 2D numpy array
the princial components
- La k 1D numpy array
the eigenvalues
- Va \(p \times k\) 2D numpy array
the new basis
Notes
If \(A\) has \(n\) rows and p columns, we assume that an inner product has been defined on \(R^n\) by a SDP matrix N and on \(R^p\) by a SDP matrix P.
This is how the algorithm works:
first: computes the square roots of N and P; this is done with a
SVD. If
\(N = M^2\) and \(N = U \Sigma U^t\) (\(U=V\) in the SVD because N is symmetric), then \(M = U \Sigma^{1/2} U^t\); simlarily, if \(P=Q^2\), Then \(Q = V\Phi^{1/2} V^t\) if the SVD of P is \(P = V \Phi V^t\).
second: computes \(R = MAQ\)
third: run the
PCAof R: \((Z, \Lambda, X) = PCA(R)\)fourth: computes \(M^{-1}\) and \(Q^{-1}\) as \(M^{-1}=U \Sigma^{-1/2} U^t\) and
\(Q^{-1}=V \Phi^{-1/2} V^t\)
fifth: computes \(Y = M^{-1}Z\) and \(V = Q^{-1}X\)
The result is \((Y, V, \Lambda)\)
Let us note that
numpycontains a function, calledscipy.linalg.sqrtmwhich computes the square root of a square n x n matrix, from algorithm published in Deadman, E., Higham, N. J. and Ralha, R. (2013) “Blocked Schur Algorithms for Computing the Matrix Square Root, Lecture Notes in Computer Science, 7782. pp. 171-182, which is not used here. Indeed, using the SVD permits to compute easily \(M=N^{1/2}\) and \(M^{-1}=N^{-1/2}\), and the same for P.Warnings
Pretreatment has not been implemented, but centering by column
still under work - not frozen for a release
Diodon Documentation
see section 6, and algorithm
pca_met()in section 11.
28/12/2017, revised 22.09.28
- pydiodon.pca_svd(A)[source]¶
runs the PCA with SVD of A
Notes
No documentation (used behind pca_core) 21/06/27
- pydiodon.plot_coa(Y_r, Y_c, axis_1=1, axis_2=2, col_dsize=20, row_dsize=20, row_col='blue', col_col='red', rownames=[], colnames=None, title=None, x11=True, plotfile=None, fmt='png')[source]¶
what it does
Scatter plot of the result of a Correspondence Analysis
- Parameters
- Y_ra 2D numpy array
the principal components for the rows
- Y_ca 2D numpy array
the principal components for the columns
- axis_1an integer
the first axis of the scatter plot (starting at 1)
- axis_2an integer
the second axis of the scatter plot (starting at 1)
- col_dsizean integer
- dot size for dots associated with columnsdefault value is 20
- row_dsizean integer
- dot size for dots associated with rowsdefault value is 20
- row_cola string
- color for dots associated with rowsdefault value is blue
- col_cola string
- color for dots associated with columnsdefault value is red
- rownamesa list
- list of names of rows to be plotteddefault value is [] (no name is plotted)
- colnamesa list
- list of names of columns to be plotteddefalt value is [] (no name is plotted)
- titlea string
- the title of the plotdefault value is None (no title)
- x11a boolean
- whether to display the plot on the screendefault value is True (plot displayed)
- plotfilea string
- the name of the file where to save the plotdefault value is None (plot not saved)
- fmta string
- format of the file where the plot is savedpossible values are : png, eps, pdfdefault value is png
- Returns
- a scatter plot
Notes
The numbering of the axis in calling the function is natural, and bagins at 1. It is translated to start at 0 as in python in the function itself.
af, 25/10/2017, revised 22.10.26
- pydiodon.plot_components_heatmap(Y, axis_1=1, axis_2=2, bins=256, cmap='ocean', range=None, log=True, scale=False, title=None, x11=True, imfile=None, fmt='png')[source]¶
what it does:
builds the density heatmap of the point cloud of the components of a PCA
- Parameters
- Ya \(n \times k\) 2D numpy array
array of the components
- binsan integer
the size of the image, in pixels
- axis_1an integer
- first axis to be plottedcolumns in the image
- axis_2an integer
- second axis of Yrows in the image
- binsan integer
- the size of the imageit is also the number of classes per axis in the 2D histogram
- cmapa string
- the matplotlib colormap for the heatmapdefault value is oceanclassical default value in matplotlib is viridissee https://matplotlib.org/stable/tutorials/colors/colormaps.html for a choice
- rangeis None
compulsory for callind 2D histogram
- loga boolean
- whether to translate the densities in a logarithmic scaledefault value is True
- scalea boolean
- deprecateddefault value is False
- titlea string
- the title of the heatmapdefault value is None (no title)
- x11a boolean
- whether to display the plot on the screendefault value is True (plot displayed)
- plotfilea string
- the name of the file where to save the plotdefault value is None (plot not saved)
- fmta string
- format of the file where the plot is savedpossible values are : png, eps, pdfdefault value is png
- Returns
- a heatmap
Notes
Y is a numpy array of coordinates, as produced by the MDS
when two axis are selected
the two components are extracted
a 2D histogram of their values is computed with a call to np.histogram2d()
the heatmap is drawn with the values of the 2D histogram
af, 25/01/2020, revised 22.10.36
- pydiodon.plot_components_quality(Y, Qual, axis_1=1, axis_2=2, r=2, cmap='ocean', diam=50, title=None, x11=True, plotfile=None, fmt='png')[source]¶
what it does Scatter plot of the components with each dot colored according to the cumulated quality of the item for a given axis
- pydiodon.plot_components_scatter(Y, axis_1=1, axis_2=2, dot_size=20, color='red', cmap=None, names=[], title=None, x11=True, plotfile=None, fmt='png')[source]¶
Scatter plot of the result of a Principal Component Analysis
- Parameters
- Ya \(n \times k\) 2D numpy array,
the matrix of columnwise principal components
- axis_1an integer, the first axis
axis are labelled from 1 to k
- axis_2an integer, the second axis
second axis is labelled from axis_1 + 1 to k
- dot_sizean integer
the size of dots in the plot, one dot per item
- colora string
color of doats in the plot
- namesa list of n strings
labels of dots in ythe plot
- titlestring
title of the plot
- x11boolean
the plot is displayed on the scvreen if x11=True
- plotfilestring
name of the file to save the plot
the plot is not saved if plotfile=None
- fmtstring
format of the file to save the plot.
accepted values are png, eps, pdf.
- Returns
- a scatter plot
Notes
None
af, 21/02/2018, revised 22.10.14
- pydiodon.plot_correlation_matrix(C, x11=True, plotfile=None, fmt='png')[source]¶
plots a heatmap of the correlation matrix !!! ongoing work !!!
af, 22.10.27
- pydiodon.plot_cumulated_quality_per_item(Y, r=2, col='blue', title=None, x11=True, plotfile=None, fmt='png')[source]¶
plots the quality per item
- Parameters
- Ya 2D \(n \times p\) numpy array
a matrix of principal components
- ran integer, the axis for plotting cumulated quality
default value is 2
- cola string
- the color to use for plotting the qualitydefault value is blue
- titlestring
title of the plot
- x11boolean
the plot is displayed on the scvreen if x11=True
- plotfilestring
name of the file to save the plot
the plot is not saved if plotfile=None
- fmtstring
format of the file to save the plot.
accepted values are png, eps, pdf.
- Returns
- Qual1D \(r\) numpy array
the rowise cumulated quality per axis up to r.
Notes
For each row i of the matrix of components Y
computes \(Qual[i] = \sum_{j \leq r} Y[i,j]^2 / \sum_{k \leq p} Y[i,k]^2\)
once an axis r has been selected, draws the plot of \((i, Q(i,r))\) with i on x axis.
af, révisé 22.10.28
- pydiodon.plot_eig(L, frac=True, cum=False, Log=False, dot_size=- 1, nb_val=- 1, col='red', title=None, x11=True, plotfile=None, fmt='png')[source]¶
Plots the singular or eigenvalues
- Parameters
- L1D numpy array
the eigenvalues of A’A
- fracboolean
if True, the fraction of the norm of A’A is displayed per eigenvalue
- cumboolean
if true, the cumulated eigenvalues are displayed can be combined with frac=True
- LogBoolean
if True, the Log of the genvalues is displayed
- dot_sizeinteger
if > 0, a dot is displayed per eigenvalue
- nb_valinteger
number of eigen,values to display;
if nb_val=-1, all eigenvalues are displayed
- colstring
color for the plot
- titlestring
title on top of the plot;
if None, no title is displayed
- x11Boolean
if True, the plot is displayed on screen
- plotfilestring
if not None, the save is save in a file named plotfle
- fmtstring
the format of the plot;
accepted formats are png, eps, pdf
Notes
The eigenvalues \(L_i\) are such that \(\sum_i L_i = ||A'A||^2\). Then, the quality of representation in axis i is \(L_i/\sum_j L_j\). This is displayed by setting frac=True.
the cumulated eigenvalues are simply given by \(\psi_i = \sum_{j \leq i}L_j\)
af, 25/10/2017 - revised 19.09.18, 22.10.14
- pydiodon.plot_var(V, varnames=None, axis_1=1, axis_2=2, title=None, x11=True, plotfile=None, fmt='png')[source]¶
*what it does* Plots the correlations between the variables
- Parameters
- Va \(p \times p\) 2D numpy array
- varnamesa list of strings
- the names of the variables (the columns of A on which the PCA has been done)default value is None (no names are displayed)
- axis_1an integer
- the first principal axis to be displayedstarts at 1
- axis_2an integer
the second principal axis to be displayed
- titlestring
title of the plot
- x11boolean
the plot is displayed on the scvreen if x11=True
- plotfilestring
name of the file to save the plot
the plot is not saved if plotfile=None
- fmtstring
format of the file to save the plot.
accepted values are png, eps, pdf.
- Returns
- a plot
- af, 22.10.28
- pydiodon.plot_var_heatmap(V, varnames, cmap='viridis', title=None, x11=True, plotfile=None, fmt='png')[source]¶
what it does Plots a heatmap of the coordinates of the principal axis
- Parameters
- Va \(p \times p\) 2D numpy array
the coordonates of the principal axis in the basis of the variables
- varnamesa list of strings
- the names of the variables(names of the columns of A on which PCA has been done)
- cmapa string
- the matplotlib colormap for drawing the heatmapdefault value is viridis
- titlestring
title of the plot
- x11boolean
the plot is displayed on the scvreen if x11=True
- plotfilestring
name of the file to save the plot
the plot is not saved if plotfile=None
- fmtstring
format of the file to save the plot.
accepted values are png, eps, pdf.
- Returns
- a heatmap
- af, 22.10.28
- pydiodon.project_on(A)[source]¶
what it does:
Builds the projection operator on space spanned by the columns of A
- Parameters
- Aa \(n \times p\) 2D numpy array
the projector is on the space in \(\mathbf{R}^n\) spanned by the columns of A
- Returns
- Pa 2D numpy array, \(n \times n\)
the projector
Notes
This function uses the QR decomposition of A, which avoids to compute \((A'A)^{-1}\) which is costly (one product, one inverse). It proceeds as
A = QR
P = QQ’
Example
import numpy as np import pydiodon as dio n = 10 ; p = 5 A = np.random.randn(n,p) P = project_on(A)
af, 22/02/2018, 22.10.13, 22.10.28
- pydiodon.scale(A)[source]¶
what it does:
scales a matrix columnwise
- Parameters
- Aa 2D numpy array, n x p
the matrix to be scaled
- Returns
- Asa 2D numpy array
matrix A with scaled columns
- aa 1D numpy array, p values
the norm of each column of A
Notes
It is recommended to center A columnwise before scaling.
Each coordinates \(a_{ij}\) of A is replaced by \(a_{ij} / ||a_{.j}||\) where \(||a_{.j}||\) is the norm of column j: \(||a_{.j}||^2 = \sum_i a_{ij}^2\)
Example
>>> import pydiodon as dio >>> import numpy as np >>> n = 4 ; p = 3 >>> A = np.arange(n*p) >>> A.shape =(n,p) >>> Ac, m = dio.center_col(A) >>> As, s = dio.scale(Ac)
af, 21.02.19, 22.10.13
- pydiodon.svd_grp(A, k)[source]¶
SVD of a (large) matrix A by Gaussian Random Projection
- Parameters
- Aa 2D numpy array
the matrix to be studied (dimension \(n \times p\))
- kan integer
the number of components to be computed
- Returns
- Ua 2D numpy array
(dimension \(n \times k\))
- Sa 1D numpy array
(k values)
- Va 2D numpy array
(dimension \(p \times k\))
Notes
Builds the SVD of A as \(A = U\Sigma V^t\) where \(\Sigma\) is the diagonal matrix with diagonal S and \(V^t\) is the transpose of V.
Here are the different steps of the computation:
A is a \(n \times p\) matrix
builds \(\Omega\), a \(p \times k\) random matrix (Gaussian)
computes Y = A\(\Omega\) (dimension \(n \times k\))
- computes a QR decomposition of Y
- Y = QR, withQ is orthonormal (\(n \times k\))R is upper triangular (:math:`k times k)
- computes B = Q^t A
B is :math:`k times p (with k < p) and its SVD is supposed to be close to the one of A
- \(U_B\), S, and V are SVD of B:
\(B = U_B S V'\), S is diagonal
then, \(U_A\), S and V are close to svd of A : \(A = U_A S V'\)
Sizes and computation of the matrices
matrix
value
size
A
input
\(n \times p\)
\(\Omega\)
random
\(p \times k\)
Y
\(A\Omega\)
\(n \times k\)
Q
Y = QR
\(n \times k\)
B
B = Q^t A
\(k \times p\)
\(U_B\)
\(B = U_BSV^t\)
\(k \times k\)
U
\(U = QU_B\)
\(n \times k\)
References
[1] Halko & al., 2011, SIAM Reviews, 53(2): 217-288
af, rp, fp 21.03.21; 22.06.16, 22.10.28
- pydiodon.write_ascii(A, filename, delimiter='\t', colnames=False, rownames=False, np_fmt='%.18e', dtype='float32')[source]¶
Write an ascii file
Notes
Not expected to be used by the user (see writefile()) 22.10.28
- pydiodon.write_hdf5(A, filename, colnames=False, rownames=False, datasetname='values', dtype='float32')[source]¶
Write a hdf5 file
Notes
No documentation because is not expected to be used by the user (see writefile()) 22.10.28
- pydiodon.writefile(A, filename, fmt=None, delimiter='\t', colnames=False, rownames=False, datasetname='values', np_fmt='%.18e', dtype='float32')[source]¶
what it does
A generic function for saving a numpy array as a file
- Parameters
- Aa numpy array
the array to be saved as a file
- filenamestring
the name of the file for saving the array (compulsory)
- fmtstring
explicit format of the file (compulsory) one item among
ascii,hdf5,zip; see notes- delimitercharacter
the delimiter between values in a row
- colnamesboolean or string
column names
for an ascii file, it is boolean, with True if column names are as first row in the file, and False otherwise
for an hdf5 file, gives the name of the dataset with the column names
optionaldefault value is None.- rownamesboolean or string
row names
for an ascii file, it is boolean, as True is row names are as first column in the file, and False otherwise
for an hdf5 file, gives the name of the dataset with the row names
optionaldefault value is None.- datasetnamestring
- for writing files in hdf5 format only : hdf5 dataset for valuesoptionaldefault value is “value”
- fmtstring
is exactly the parameter
fmtofnumpy.savetxt()for print format in ascii files to be done- dtypeto be done
- Returns
- writes a file
Notes
It writes the array in a file and returns no value. Parameters and choices are mirrored from function
loadfile()as it is the reverse operation.Possible formats are:
ascii,hdf5andcompressed ascii. Delimiters in ascii format can be blanks, comma, semi-columns, tabulations. Ascii data sets withtabdelimiters are expected to be with suffix.txt. Ascii data sets with other delimiters are expected to be with siffix.csv.jmf & af, 21.04.22, 22.10.28