import timeit
import numpy as np
import scipy.spatial.distance as ssd
import matplotlib.pyplot as plt
import pydiodon as dio
The dataset analyzed here is a $1,458 \times 1,458$ matrix of dissimilarities between molecular markers of trees in French Guiana from a study on barcopding Amazonian trees. See guiana trees for a presentation of the dataset. Note that the dataset is available in the git data4test but with another suffix (.sw.dis). So, you are advised to load the dataset with the function load_dataset() if the library.
First load the dataset by using the function load_dataset(). The name of the dataset is guiana_trees. As it is a set of dissimilarities, it is named Dis. These are pairwise Smith-Waterman scores of local alignment between molecular markers of taxonomic interest.
Dis, rn, cn = dio.load_dataset("guiana_trees")
Now,
X, S = dio.mds(Dis)
dio.plot_components_scatter(X)
dio.plot_eig(S)
We see that most of the inertia of the point cloud is concentrated in the first axis (the $x$ axis is the ranks of the eigenvalues, and $y$ axis is the fraction of the inertia carried by each axis, in decreasing order). We just adjust the call to dio.plot_eig() for a clearer display. Therefore, we add two arguments:
dio.plot_eig(S, k=100, cum=True)
dio.plot_eig(S, k=30, cum=True, pr=True, x11=False)
One sees that the plane $(1,2)$ of the point cloud represents only $16 \%$ of the inertia of the cloud. Hence, studies on its shape should take into account more dimensions. This can be done with visualization of the point cloud with function plot_components_splines() which implements parallel coordinates.
Parallel coordinates is a way to visualize simultaneously several coordinates of a point clouds in a high dimensional space. Let us suppose we have a point cloud $X$ of $n$ points with coordinate on axis $j$ for point $i$ being $x_{ij}$ , and that we wish to visualize coordinates from 1 to $p$. To do this, we set a plot with labels 1 to $p$ on $x$ axis. Each point $i \in \{1, . . . , n\}$ is represented by a curve running through all points $(j, x_{ij})$ with $j \in {1, . . . , p}. The curves are smoothed with cubic splines (hence the name of the method). See parallel coordinates for further information on this method.
Let us start with the visualisation of variatons of 30 first coordinates ($67 \%$ of the inertia). On can see that several items (one item is one curve) have significant coodinates even on axis with large rank.
dio.plot_components_splines(X, n_axis=30)
To get a better idea of the concentration of the coordinates on the first xis with more details than just the cumulated fractions of inertia from eigenvalues, let us visualize all axis, by
dio.plot_components_splines(X)
One can see that the dispersion around the centroid of the cloud (here, $(0, \ldots,0)$) decreases smoothly, and that we have to wait for large ranks for the coordinates to be non significant.
n = 2000
p = 1000
A = np.random.random((n,p))
dis = ssd.pdist(A)
Dis = ssd.squareform(dis)
print(Dis.shape)
print("\nmds with numpy, method svd")
t_start = timeit.default_timer()
Xrs, Srs = dio.mds(Dis)
t_stop = timeit.default_timer()
print("\nDuration with numpy, SVD is", t_stop-t_start)
dio.plot_components_splines(Xrs)
print("\nmds with numpy, method grp")
t_start = timeit.default_timer()
Xrg, Srg = dio.mds(Dis, k=800, meth="grp")
t_stop = timeit.default_timer()
print("\nDuration with numpy, rSVD is", t_stop-t_start)
k = 20
plt.plot(Srs[0:k], c="green")
plt.plot(Srg[0:k], c="blue")
plt.show()