This is an elementary tutorial for PCA with Diodon.
Olivier Coulaud
Alain Franc
Jean-Marc Frigerio
Rémy Peressoni
Florent Pruvost
Alain Franc, alain.franc@inrae.fr
started: October, 11th, 2022
version: 22.10.28
The tutorial programm for running PCA is very short, and given here. It will be explained step by step along this notebok.
# importing library
import pydiodon as dio
# loading dataset
dataset = "diatoms_sweden"
A, colnames, rownames = dio.load_dataset(dataset)
# running PCA
Y, L, V = dio.pca(A, pretreatment="standard", k=-1, meth="svd")
It is followed by a few functions for plotting the results
# plotting the results
dio.plot_eig(L, frac=True, cum=True, dot_size=20, title = "cumulated eigenvalues")
dio.plot_components_scatter(Y, dot_size=5, title="Principal components")
dio.plot_var(V, varnames=colnames)
and interpreting the quality of the projection on the princial axis.
import pydiodon as dio
and as it will be useful too for side interpretatons and understandings
import numpy as np
This is made simply by a call to function load_dataset()
A, rownames, colnames = dio.load_dataset("diatoms_sweden")
This yields:
A, a numpy array representing a $m \times n$ matrix, the values in the dataset colnames: a list of strings, representing the names of the headers for columnsrownames: a lits of strings, representing the row namesThen check it has been correctly uploaded
m = A.shape[0]
n = A.shape[1]
print("data loaded ...")
print("Data array has " + str(m) + " rows and " + str(n) + " columns.")
The function to run a PCA is pca(). Here, we
The arguments are:
pretreatmentkmeth pretreatment: a string: the pretreatment to run, among
standard: centering and scaling col_centering: centering columbwise (the associated point cloud is trnslated to its barycenter)scaling: each lement is divided by he standrd deviation of the column; it is meaningful after columnwise centering only (this is the standard pretreatment) bicentering: centering rowise and columnwise k: an integer; the number of axis to compute;
default value is $k=-1$, which means that all axis are computed
meth: a string: the method to compute axis and components in the PCA. Choices are:
evd: computing the eigevectors and eigenvalues of $A'A$ svd: computing the SVD $A = U\Sigma V'$ of $A$grp: computing the first singular values and vectors of the SVD with Gaussian Random ProjectionIn what follows, the PCA is run with pretreatment standard, computing all axis, and through the SVD of $A$.
Y, L, V = dio.pca(A, pretreatment="standard", k=-1, meth="svd")
print("PCA done!")
The result of a call to pca() is a set of three arrays:
Y: a $m \times k$ 2D numpy array (if $k=-1$, a $m \times n$ array) with components, one set of components per columnL: a 1D numpy array; the eigenvalues of $A^t A$, or the square of the singular values of $A$, up to $k$V: a $n \times k$ numpy array, with column $j$ being the $j^{th}$ principal axis. Eigenvalues, principal components and principal axis are sorted according to decreasing values of eigenvalues.
There are different options to plot the eigenvalues. Let us suppose they are labeled as $$ \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_j \geq \ldots \geq \lambda_n $$ Then, cumulated eigenvalues are $$ \lambda_1 \leq \lambda_1 + \lambda_2 \leq \ldots \sum_{i \leq j}\lambda_j \leq \ldots \leq \sum_j \lambda_j= Tr\: A^t A $$ Thre are related to the quality of the representation of the point cloud from its $j$ first axis.
frac: boolean; whether to plot or use the fraction of variance per axis (instead of the eigenvalue itself) cum: boolean; whether to plot the cumulated value up to a given axis instead of the value per axisdot_size: each value is represented by a dot of size dot_size; no dotis displays if $dot\_size=-1$ title: title of the plot dio.plot_eig(L, frac=True, cum=True, dot_size=20, title = "cumulated eigenvalues")
Each point $A[i,:]$ is projected on the principal axis as $Y[i,:]$ with $Y=AV$. If axis $j$ and $\ell$ are selected as axis_1 and axis_2, a scatterplot with coordinates $(Y[i,j],Y[i,\ell])$ for point $i$ is displayed.
dio.plot_components_scatter(Y, dot_size=5, title="Principal components")
The plot can be saved by giving a filename and a format as arguments of plot_components_scatter()as in
$\rightarrow$ (check that directory plots exists)
dio.plot_components_scatter(Y, dot_size=5, title="Principal components", plotfile="plots/pca_components", fmt="png")
Other pairs of components can be plotted, as in
dio.plot_components_scatter(Y, axis_1=2, axis_2=3, dot_size=5, color="blue", title="Principal components")
Each principal axis (new axis) is a linear combination of the variables (the old axis, with components as the columns of $A$). This is expressed by $V$ as $Y=AV$. $V$ is orthonormal $(V^t V = I_p)$. Hence, $V^{-1}=V^t$. The column $j$ of $V^{-1}$, i.e. the row $j$ of $V^t$, is the expression of the (old) variable in the (new) basis of the principal axis. This can be displayed graphically as follows, where both the correlations between the variables and between each variable and new axis can be seen and interpreted. Blue segments represent the projection of the variables (old axis) on the prncpal (new) axis. The length of the projection represents its quality (how much the new axis encompasses the informtion in the old axis). If the projection of varables $j$ and $j'$ are both good, their cosine in an indicator of their correlation.
dio.plot_var(V, varnames=colnames)
A second way to visualize the correlations between the variables and the principal axis is to daw a heatmap of matrix $V$. Indeed, we have $$ V[:,k] = \sum_j V[j,k] \mathbf{e}_j $$ by construction. Then, $$ V[j,k] = \langle V[:,k], \mathbf{e}_j \rangle $$ and is the correlation between the (new) principal axis $k$ and the (old) variable $j$. This can be seen for all variables and axis by visualizing the heatmap of $V$, which can be done by a call to function $\texttt{plot_var_heatmap()}$. The matrix $V$ is prnted first to facilitate the understanding of the process.
print("V =")
print(np.round(100*V)/100)
dio.plot_var_heatmap(V, varnames=colnames, cmap="viridis")
Variables are in rows, and principal axis in colums.
The matrix $Y$ is the matrix of the coordinates of the items in the new basis, defined by principal axis (the columns of $V$). An item like $i$ is row $i$ of matrix $A$, i.e. $a_{i*} = A[i,:]$. Its coordinates in the basis of principal axis are $y_{i*} = Y[i,:]$. Let us denote by $y_{ir}$ the projection of $y_{i*}$ on the subspace defined by the first $r$ principal axis, i.e. $y_{ir} = (Y[i,1], \ldots,Y[1,r])$. Then, the quality of projection of item $i$ on the first $r$ axis is given by $$ Q(i,r) = \frac{\|y_{ir}\|^2}{\|y_{i*}\|^2} $$
Qual = dio.plot_cumulated_quality_per_item(Y,r=3)
The matrix $Q$ with $Q[:,r]$ i.e. the matrix with the cumulated quality of projection for all items in column $r$, is given by a call to $\texttt{dio.plot_cumulated_quality_per_item()}$. In this plot, we have, once the axis $r$ has been selected
The cumulated quality per axis can be displayed in a second way, in a scatterplot. Here is an example with scatter plot of axis 1 and 2.
r = 2
dio.plot_components_quality(Y, Qual, r=r)
This is the scatter plot of the point cloud of components on axis 1 and 2. The color and the size of one dot (one item) indicate the cumulated quality on the $r$ first principlal axis (here, $r=2$). Next plot is the scatter plot on axis 1 and 3, with cumulated quality on axis 3.
r = 3
dio.plot_components_quality(Y, Qual, axis_2=3, r=r)
F. Keck, A. Franc, and M. Kahlert. Disentangling processes driving freshwater diatoms biogeography: a multiscale approach - J. of Biogeography. 2018
pydiodon and can be downloaded from XXXXCppdiodon and can be downloaded from XXXThe methods used in those libraries and their pseudocodes are explained in