PCA tutorial for Diodon (python)

This is an elementary tutorial for PCA with Diodon.

Authors

Olivier Coulaud
Alain Franc
Jean-Marc Frigerio
Rémy Peressoni
Florent Pruvost

Contact and maintainer

Alain Franc, alain.franc@inrae.fr

Version

started: October, 11th, 2022
version: 22.10.28

Overview

The tutorial programm for running PCA is very short, and given here. It will be explained step by step along this notebok.

# importing library
import pydiodon as dio 

# loading dataset
dataset = "diatoms_sweden"
A, colnames, rownames = dio.load_dataset(dataset)

# running PCA
Y, L, V = dio.pca(A, pretreatment="standard", k=-1, meth="svd")

It is followed by a few functions for plotting the results

# plotting the results
dio.plot_eig(L, frac=True, cum=True, dot_size=20, title = "cumulated eigenvalues")
dio.plot_components_scatter(Y, dot_size=5, title="Principal components")
dio.plot_var(V, varnames=colnames)

and interpreting the quality of the projection on the princial axis.

Importing python version of diodon

In [1]:
import pydiodon as dio 
loading pydiodon - version 22.10.29

and as it will be useful too for side interpretatons and understandings

In [2]:
import numpy as np

Loading dataset

This is made simply by a call to function load_dataset()

In [3]:
A, rownames, colnames = dio.load_dataset("diatoms_sweden") 

This yields:

  • A, a numpy array representing a $m \times n$ matrix, the values in the dataset
  • colnames: a list of strings, representing the names of the headers for columns
  • rownames: a lits of strings, representing the row names

Then check it has been correctly uploaded

In [5]:
m	= A.shape[0]
n	= A.shape[1]
print("data loaded ...")
print("Data array has " + str(m) + " rows and " + str(n) + " columns.")	
data loaded ...
Data array has 616 rows and 10 columns.

Analyzing dataset

The function to run a PCA is pca(). Here, we

  • present the arguments
  • indicate the line to type to run the method
  • present the results

Arguments

The arguments are:

  • the array to analyze
  • pretreatment
  • k
  • meth

pretreatment: a string: the pretreatment to run, among

  • standard: centering and scaling
  • col_centering: centering columbwise (the associated point cloud is trnslated to its barycenter)
  • scaling: each lement is divided by he standrd deviation of the column; it is meaningful after columnwise centering only (this is the standard pretreatment)
  • bicentering: centering rowise and columnwise

k: an integer; the number of axis to compute;
default value is $k=-1$, which means that all axis are computed

meth: a string: the method to compute axis and components in the PCA. Choices are:

  • evd: computing the eigevectors and eigenvalues of $A'A$
  • svd: computing the SVD $A = U\Sigma V'$ of $A$
  • grp: computing the first singular values and vectors of the SVD with Gaussian Random Projection

Calculation

In what follows, the PCA is run with pretreatment standard, computing all axis, and through the SVD of $A$.

In [6]:
Y, L, V	= dio.pca(A, pretreatment="standard", k=-1, meth="svd")
print("PCA done!")
-----------------------------------------
running pca(), version 21.05.05
Matrix A has 616 rows and 10 columns
rank is -1 (full rank if k = -1)
pretreatment is standard
method is svd
------------------------------------------

PCA done!

Results

The result of a call to pca() is a set of three arrays:

  • Y: a $m \times k$ 2D numpy array (if $k=-1$, a $m \times n$ array) with components, one set of components per column
  • L: a 1D numpy array; the eigenvalues of $A^t A$, or the square of the singular values of $A$, up to $k$
  • V: a $n \times k$ numpy array, with column $j$ being the $j^{th}$ principal axis.

Eigenvalues, principal components and principal axis are sorted according to decreasing values of eigenvalues.

Plotting some results

Plotting eigenvalues

There are different options to plot the eigenvalues. Let us suppose they are labeled as $$ \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_j \geq \ldots \geq \lambda_n $$ Then, cumulated eigenvalues are $$ \lambda_1 \leq \lambda_1 + \lambda_2 \leq \ldots \sum_{i \leq j}\lambda_j \leq \ldots \leq \sum_j \lambda_j= Tr\: A^t A $$ Thre are related to the quality of the representation of the point cloud from its $j$ first axis.

Main parameters

  • frac: boolean; whether to plot or use the fraction of variance per axis (instead of the eigenvalue itself)
  • cum: boolean; whether to plot the cumulated value up to a given axis instead of the value per axis
  • dot_size: each value is represented by a dot of size dot_size; no dotis displays if $dot\_size=-1$
  • title: title of the plot
In [7]:
dio.plot_eig(L, frac=True, cum=True, dot_size=20, title = "cumulated eigenvalues")
-> pydiodon:plot_eig()

Plotting components

Each point $A[i,:]$ is projected on the principal axis as $Y[i,:]$ with $Y=AV$. If axis $j$ and $\ell$ are selected as axis_1 and axis_2, a scatterplot with coordinates $(Y[i,j],Y[i,\ell])$ for point $i$ is displayed.

In [8]:
dio.plot_components_scatter(Y, dot_size=5, title="Principal components")

The plot can be saved by giving a filename and a format as arguments of plot_components_scatter()as in
$\rightarrow$ (check that directory plots exists)

In [9]:
dio.plot_components_scatter(Y, dot_size=5, title="Principal components", plotfile="plots/pca_components", fmt="png")

Other pairs of components can be plotted, as in

In [10]:
dio.plot_components_scatter(Y, axis_1=2, axis_2=3, dot_size=5, color="blue", title="Principal components")

Plotting correlation between variables

Each principal axis (new axis) is a linear combination of the variables (the old axis, with components as the columns of $A$). This is expressed by $V$ as $Y=AV$. $V$ is orthonormal $(V^t V = I_p)$. Hence, $V^{-1}=V^t$. The column $j$ of $V^{-1}$, i.e. the row $j$ of $V^t$, is the expression of the (old) variable in the (new) basis of the principal axis. This can be displayed graphically as follows, where both the correlations between the variables and between each variable and new axis can be seen and interpreted. Blue segments represent the projection of the variables (old axis) on the prncpal (new) axis. The length of the projection represents its quality (how much the new axis encompasses the informtion in the old axis). If the projection of varables $j$ and $j'$ are both good, their cosine in an indicator of their correlation.

In [11]:
dio.plot_var(V, varnames=colnames)

A second way to visualize the correlations between the variables and the principal axis is to daw a heatmap of matrix $V$. Indeed, we have $$ V[:,k] = \sum_j V[j,k] \mathbf{e}_j $$ by construction. Then, $$ V[j,k] = \langle V[:,k], \mathbf{e}_j \rangle $$ and is the correlation between the (new) principal axis $k$ and the (old) variable $j$. This can be seen for all variables and axis by visualizing the heatmap of $V$, which can be done by a call to function $\texttt{plot_var_heatmap()}$. The matrix $V$ is prnted first to facilitate the understanding of the process.

In [12]:
print("V =")
print(np.round(100*V)/100)
V =
[[-0.12  0.59 -0.02  0.07 -0.21  0.19 -0.09  0.16 -0.39  0.6 ]
 [ 0.49 -0.12 -0.05 -0.16 -0.06 -0.01 -0.34 -0.1  -0.72 -0.27]
 [-0.35 -0.23 -0.03 -0.31 -0.21  0.8  -0.05 -0.17 -0.04 -0.12]
 [ 0.08 -0.15 -0.48  0.74 -0.37  0.12  0.07 -0.17 -0.01 -0.05]
 [ 0.38 -0.35 -0.09 -0.08 -0.17  0.17 -0.24  0.67  0.28  0.28]
 [ 0.09  0.59  0.07  0.03 -0.29  0.1  -0.24  0.23  0.28 -0.6 ]
 [ 0.47  0.14  0.08 -0.16 -0.15  0.07 -0.19 -0.63  0.4   0.32]
 [ 0.43  0.15 -0.04 -0.17 -0.06  0.2   0.84  0.09 -0.07 -0.07]
 [ 0.09  0.23 -0.65 -0.07  0.66  0.22 -0.13  0.01  0.09 -0.03]
 [ 0.2   0.    0.57  0.51  0.44  0.42 -0.05  0.03 -0.03 -0.01]]
In [13]:
dio.plot_var_heatmap(V, varnames=colnames, cmap="viridis")
varnames are
abs.f420.5 ; Alk ; Altitude ; ARO_uppstream ; pH ; TOC ; Tot.N_ps ; TP ; Turbidity ; LC_Urban

Variables are in rows, and principal axis in colums.

Plotting cumulated quality per axis

The matrix $Y$ is the matrix of the coordinates of the items in the new basis, defined by principal axis (the columns of $V$). An item like $i$ is row $i$ of matrix $A$, i.e. $a_{i*} = A[i,:]$. Its coordinates in the basis of principal axis are $y_{i*} = Y[i,:]$. Let us denote by $y_{ir}$ the projection of $y_{i*}$ on the subspace defined by the first $r$ principal axis, i.e. $y_{ir} = (Y[i,1], \ldots,Y[1,r])$. Then, the quality of projection of item $i$ on the first $r$ axis is given by $$ Q(i,r) = \frac{\|y_{ir}\|^2}{\|y_{i*}\|^2} $$

In [14]:
Qual = dio.plot_cumulated_quality_per_item(Y,r=3)

The matrix $Q$ with $Q[:,r]$ i.e. the matrix with the cumulated quality of projection for all items in column $r$, is given by a call to $\texttt{dio.plot_cumulated_quality_per_item()}$. In this plot, we have, once the axis $r$ has been selected

  • the items $i$ on $x$ axis
  • $Q[i,r]$ on the $y$ axis

Scatter plot with components quality

The cumulated quality per axis can be displayed in a second way, in a scatterplot. Here is an example with scatter plot of axis 1 and 2.

In [15]:
r = 2
dio.plot_components_quality(Y, Qual, r=r)

This is the scatter plot of the point cloud of components on axis 1 and 2. The color and the size of one dot (one item) indicate the cumulated quality on the $r$ first principlal axis (here, $r=2$). Next plot is the scatter plot on axis 1 and 3, with cumulated quality on axis 3.

In [16]:
r = 3
dio.plot_components_quality(Y, Qual, axis_2=3, r=r)

References

For PCA

  • T. W. Anderson. An introduction to Multivariate Statistical Analysis. John Wiley & Sons, 1958.
  • A. J. Izenman. Modern Multivariate Statistical Techniques. Springer, NY, 2008.
  • I. T. Jolliffe. Principal Component Analysis. Springer, second edition, 2002.
  • L. Lebart, A. Morineau, and N. Tabard. Techniques de la description statistique. Bordas - Dunod, 1977.
  • L. Lebart, A. Morineau, and J.-P. Fénelon. Traitement des données statistiques. Dunod, Paris, 1982.
  • K. V. Mardia, J.T. Kent, and J. M. Bibby. Multivariate Analysis. Probability and Mathematical Statistics. Academic Press, 1979.

For dataset

F. Keck, A. Franc, and M. Kahlert. Disentangling processes driving freshwater diatoms biogeography: a multiscale approach - J. of Biogeography. 2018

For softwares

  • the python library is called pydiodon and can be downloaded from XXXX
  • the Cpp library is called Cppdiodon and can be downloaded from XXX
  • the "binding" of Cppdiodon with pydiodon is called Cppydiodon and can be downloaded from XXX

Further documentation

The methods used in those libraries and their pseudocodes are explained in