Documentation for Diffusor

topo.tpgraph.diffusion.Diffusor

Sklearn-compatible estimator for using fast anisotropic diffusion with an adaptive neighborhood search algorithm. The Diffusion Maps algorithm was initially proposed by Coifman et al in 2005, and was augmented by the work of many. This implementation aggregates recent advances in diffusion harmonics, and innovates only by implementing an adaptively decaying kernel (the rate of decay is dependent on neighborhood density) and an adaptive neighborhood estimation approach.

Parameters

n_eigs : int (optional, default 50) Number of diffusion components to compute. This number can be iterated to get different views from data at distinct spectral resolution.

use_eigs : int or str (optional, default 'knee') Number of eigenvectors to use. If 'max', expands to the maximum number of positive eigenvalues (reach of numerical precision), else to the maximum amount of computed components. If 'knee', uses Kneedle to find an optimal cutoff point, and expands it by expansion. If 'comp_gap', tries to find a discrete eigengap from the computation process.

n_neighbors : int (optional, default 10) Number of k-nearest-neighbors to compute. The adaptive kernel will normalize distances by each cell distance of its median neighbor. Nonetheless, this hyperparameter remains as an user input regarding the minimal sample neighborhood resolution that drives the computation of the diffusion metrics. For practical purposes, the minimum amount of samples one would expect to constitute a neighborhood of its own. Increasing k can generate more globally-comprehensive metrics and maps, to a certain extend, however at the expense of fine-grained resolution. More generally, consider this a calculus discretization threshold.

ann : bool (optional, default True) Whether to use approximate nearest neighbors for graph construction with NMSLib. If False, uses sklearn default implementation.

metric : str (optional, default 'cosine') Distance metric for building an approximate kNN graph. Defaults to 'euclidean'. Users are encouraged to explore different metrics, such as 'cosine' and 'jaccard'. The 'hamming' and 'jaccard' distances are also available for string vectors. Accepted metrics include NMSLib metrics and sklearn metrics. Some examples are: -'sqeuclidean' -'euclidean' -'l1' -'lp' - requires setting the parameter p -'cosine' -'angular' -'negdotprod' -'levenshtein' -'hamming' -'jaccard' -'jansen-shan'

p : int or float (optional, default 11/16 ) P for the Lp metric, when metric='lp'. Can be fractional. The default 11/16 approximates an astroid norm with some computational efficiency (2^n bases are less painstakinly slow to compute). See https://en.wikipedia.org/wiki/Lp_space for some context.

transitions : bool (optional, default False) Whether to estimate the diffusion transitions graph. If True, maps a basis encoding neighborhood transitions probability during eigendecomposition. If 'False' (default), maps the diffusion kernel.

alpha : int or float (optional, default 1) Alpha in the diffusion maps literature. Controls how much the results are biased by data distribution. Defaults to 1, which is suitable for normalized data.

kernel_use : str (optional, default 'decay_adaptive') Which type of kernel to use. There are four implemented, considering the adaptive decay and the neighborhood expansion, written as 'simple', 'decay', 'simple_adaptive' and 'decay_adaptive'. The first, 'simple' , is a locally-adaptive kernel similar to that proposed by Nadler et al.(https://doi.org/10.1016/j.acha.2005.07.004) and implemented in Setty et al. (https://doi.org/10.1038/s41587-019-0068-4). The 'decay' option applies an adaptive decay rate, but no neighborhood expansion. Those, followed by '_adaptive', apply the neighborhood expansion process. The default and recommended is 'decay_adaptive'. The neighborhood expansion can impact runtime, although this is not usually expressive for datasets under 10e6 samples.

transitions : bool (optional, default False) Whether to decompose the transition graph when transforming. norm : bool (optional, default True) Whether to normalize the kernel transition probabilities to approximate the LPO. eigengap : bool (optional, default True) Whether to expand the eigendecomposition a bit and stop if eigenvalues sign shift (limit of float64). Used to guarantee numerical stability. n_jobs : int (optional, default 4) Number of threads to use in calculations. Defaults to 4 for safety, but performance scales dramatically when using more threads. plot_spectrum : bool (optional, default False) Whether to plot the spectrum decay analysis. verbose : bool (optional, default False) Controls verbosity. cache : bool (optional, default True) Whether to cache nearest-neighbors (before fit) and to store diffusion matrices after mapping (before transform).

Example

import numpy as np from sklearn.datasets import load_digits from scipy.sparse import csr_matrix from topo.tpgraph.diffusion import Diffusor

digits = load_digits() data = csr_matrix(digits)

diff = Diffusor().fit(data)

msdiffmap = diff.transform(data)

fit(self, X)

Fits an adaptive anisotropic diffusion kernel to the data.

Parameters

X : input data. Takes in numpy arrays and scipy csr sparse matrices. Use with sparse data for top performance. You can adjust a series of parameters that can make the process faster and more informational depending on your dataset.

Returns
Diffusor object with kernel Diffusor.K and the transition potencial Diffusor.T .
Source code in topo/tpgraph/diffusion.py
def fit(self, X):

    """
    Fits an adaptive anisotropic diffusion kernel to the data.

    Parameters
    ----------
    X :
        input data. Takes in numpy arrays and scipy csr sparse matrices.
    Use with sparse data for top performance. You can adjust a series of
    parameters that can make the process faster and more informational depending
    on your dataset.

    Returns
    -------
        Diffusor object with kernel Diffusor.K and the transition potencial Diffusor.T .

    """
    data = X
    self.start_time = time.time()
    self.N = data.shape[0]
    self.M = data.shape[1]
    if self.kernel_use not in ['simple', 'simple_adaptive', 'decay', 'decay_adaptive']:
        raise Exception('Kernel must be either \'simple\', \'simple_adaptive\', \'decay\' or \'decay_adaptive\'.') 
    if self.ann:
        # Construct an approximate k-nearest-neighbors graph
        anbrs = ann.NMSlibTransformer(n_neighbors=self.n_neighbors,
                                      metric=self.metric,
                                      p=self.p,
                                      method='hnsw',
                                      n_jobs=self.n_jobs,
                                      M=self.M,
                                      efC=self.efC,
                                      efS=self.efS,
                                      verbose=self.verbose).fit(data)
        knn = anbrs.transform(data)
        # X, y specific stds: Normalize by the distance of median nearest neighbor to account for neighborhood size.
        median_k = np.floor(self.n_neighbors / 2).astype(np.int)
        adap_sd = np.zeros(self.N)
        for i in np.arange(len(adap_sd)):
            adap_sd[i] = np.sort(knn.data[knn.indptr[i]: knn.indptr[i + 1]])[
                median_k - 1
                ]
    else:
        if self.metric == 'lp':
            raise Exception('Generalized Lp distances are available only with `ann` set to True.')

        # Construct a k-nearest-neighbors graph
        nbrs = NearestNeighbors(n_neighbors=int(self.n_neighbors), metric=self.metric, n_jobs=self.n_jobs).fit(
            data)
        knn = nbrs.kneighbors_graph(data, mode='distance')
        # X, y specific stds: Normalize by the distance of median nearest neighbor to account for neighborhood size.
        median_k = np.floor(self.n_neighbors / 2).astype(np.int)
        adap_sd = np.zeros(self.N)
        for i in np.arange(len(adap_sd)):
            adap_sd[i] = np.sort(knn.data[knn.indptr[i]: knn.indptr[i + 1]])[
                median_k - 1
                ]

    # Distance metrics
    x, y, dists = find(knn)  # k-nearest-neighbor distances

    if self.cache:
        self.dists = knn
        self.adap_sd = adap_sd

    # Neighborhood graph expansion
    # define decay as sample's pseudomedian k-nearest-neighbor
    pm = np.interp(adap_sd, (adap_sd.min(), adap_sd.max()), (2, self.n_neighbors))

    self.omega = pm
    # adaptive neighborhood size
    if self.kernel_use == 'simple_adaptive' or self.kernel_use == 'decay_adaptive':
        self.new_k = int(self.n_neighbors + (self.n_neighbors - pm.max()))
        # increase neighbor search:
        anbrs_new = ann.NMSlibTransformer(n_neighbors=self.new_k,
                                          metric=self.metric,
                                          method='hnsw',
                                          n_jobs=self.n_jobs,
                                          p=self.p,
                                          M=self.M,
                                          efC=self.efC,
                                          efS=self.efS).fit(data)
        knn_new = anbrs_new.transform(data)

        x_new, y_new, dists_new = find(knn_new)

        # adaptive neighborhood size
        adap_nbr = np.zeros(self.N)
        for i in np.arange(len(adap_nbr)):
            adap_k = int(np.floor(pm[i]))
            adap_nbr[i] = np.sort(knn_new.data[knn_new.indptr[i]: knn_new.indptr[i + 1]])[
                adap_k - 1
            ]

        if self.cache:
            self.dists_new = knn_new
            self.adap_nbr_sd = adap_nbr

    if self.kernel_use == 'simple':
        # X, y specific stds
        dists = dists / (adap_sd[x] + 1e-10)  # Normalize by the distance of median nearest neighbor
        W = csr_matrix((np.exp(-dists), (x, y)), shape=[self.N, self.N])

    if self.kernel_use == 'simple_adaptive':
        # X, y specific stds
        dists = dists_new / (adap_nbr[x_new] + 1e-10)  # Normalize by normalized contribution to neighborhood size.
        W = csr_matrix((np.exp(-dists), (x_new, y_new)), shape=[self.N, self.N])

    if self.kernel_use == 'decay':
        # X, y specific stds
        dists = (dists / (adap_sd[x] + 1e-10)) ** np.power(2, ((self.n_neighbors - pm[x]) / pm[x]))
        W = csr_matrix((np.exp(-dists), (x, y)), shape=[self.N, self.N])

    if self.kernel_use == 'decay_adaptive':
        # X, y specific stds
        dists = (dists_new / (adap_nbr[x_new] + 1e-10)) ** np.power(2, (((int(self.n_neighbors + (self.n_neighbors - pm.max()))) - pm[x_new]) / pm[x_new]))  # Normalize by normalized contribution to neighborhood size.
        W = csr_matrix((np.exp(-dists), (x_new, y_new)), shape=[self.N, self.N])

    # Kernel construction
    kernel = (W + W.T) / 2
    self.K = kernel

    # handle nan, zeros
    self.K.data = np.where(np.isnan(self.K.data), 1, self.K.data)
    # Diffusion through Markov chain

    D = np.ravel(self.K.sum(axis=1))
    if self.alpha > 0:
        # L_alpha
        D[D != 0] = D[D != 0] ** (-self.alpha)
        mat = csr_matrix((D, (range(self.N), range(self.N))), shape=[self.N, self.N])
        kernel = mat.dot(self.K).dot(mat)
        D = np.ravel(kernel.sum(axis=1))

    D[D != 0] = 1 / D[D != 0]

    # Setting the diffusion operator
    if self.norm:
        self.K = kernel
        self.T = csr_matrix((D, (range(self.N), range(self.N))), shape=[self.N, self.N]).dot(self.K)
    else:
        self.T = csr_matrix((D, (range(self.N), range(self.N))), shape=[self.N, self.N]).dot(self.K)

    # Guarantee symmetry
    self.T = (self.T + self.T.T) / 2
    self.T[(np.arange(self.T.shape[0]), np.arange(self.T.shape[0]))] = 0

    end = time.time()
    if self.verbose:
        print('Diffusion time = %f (sec), per sample=%f (sec), per sample adjusted for thread number=%f (sec)' %
              (end - self.start_time, float(end - self.start_time) / self.N, self.n_jobs * float(end - self.start_time) / self.N))


    return self

ind_dist_grad(self, data)

Utility function to get indices, distances and gradients from a multiscale diffusion map.

Parameters

data : Input data matrix (numpy array, pandas df, csr_matrix).

!!! n_components "int (optional, default None)" Numper of components to map to prior to learning.

Returns

A tuple containing neighborhood indices, distances, gradient and a knn graph.

Source code in topo/tpgraph/diffusion.py
def ind_dist_grad(self, data):
    """
    Utility function to get indices, distances and gradients from a multiscale diffusion map.

    Parameters
    ----------
    data :
        Input data matrix (numpy array, pandas df, csr_matrix).

    n_components: int (optional, default None)
        Numper of components to map to prior to learning.

    Returns
    -------
    A tuple containing neighborhood indices, distances, gradient and a knn graph.

    """

    self.start_time = time.time()
    # Fit an optimal number of components based on the eigengap
    # Use user's  or default initial guess
    multiplier = self.N // 10e4
    # initial eigen value decomposition
    if self.transitions:
        D, V = eigs(self.T, self.n_components, tol=1e-4, maxiter=self.N)
    else:
        D, V = eigs(self.K, self.n_components, tol=1e-4, maxiter=self.N)
    D = np.real(D)
    V = np.real(V)
    inds = np.argsort(D)[::-1]
    D = D[inds]
    V = V[:, inds]
    # Normalize by the first diffusion component
    for i in range(V.shape[1]):
        V[:, i] = V[:, i] / np.linalg.norm(V[:, i])
    vals = np.array(V)
    pos = np.sum(vals > 0, axis=0)
    residual = np.sum(vals < 0, axis=0)

    if self.eigengap and len(residual) < 1:
        #expand eigendecomposition
        target = self.n_components * multiplier
        while residual < 3:
            print('Eigengap not found for determined number of components. Expanding eigendecomposition to '
                  + str(target) + 'components.')
            if self.transitions:
                D, V = eigs(self.T, target, tol=1e-4, maxiter=self.N)
            else:
                D, V = eigs(self.K, target, tol=1e-4, maxiter=self.N)
            D = np.real(D)
            V = np.real(V)
            inds = np.argsort(D)[::-1]
            D = D[inds]
            V = V[:, inds]
            # Normalize by the first diffusion component
            for i in range(V.shape[1]):
                V[:, i] = V[:, i] / np.linalg.norm(V[:, i])
            vals = np.array(V)
            residual = np.sum(vals < 0, axis=0)
            target = target * 2

        if len(residual) > 30:
            self.n_components = len(pos) + 15
            # adapted eigen value decomposition
            if self.transitions:
                D, V = eigs(self.T, self.n_components, tol=1e-4, maxiter=self.N)
            else:
                D, V = eigs(self.K, self.n_components, tol=1e-4, maxiter=self.N)
            D = np.real(D)
            V = np.real(V)
            inds = np.argsort(D)[::-1]
            D = D[inds]
            V = V[:, inds]
            # Normalize by the first diffusion component
            for i in range(V.shape[1]):
                V[:, i] = V[:, i] / np.linalg.norm(V[:, i])


    # Create the results dictionary
    self.res = {'EigenVectors': V, 'EigenValues': D}
    self.res['EigenVectors'] = pd.DataFrame(self.res['EigenVectors'])
    if not issparse(data):
        self.res['EigenValues'] = pd.Series(self.res['EigenValues'])
    self.res["EigenValues"] = pd.Series(self.res["EigenValues"])

    self.res['MultiscaleComponents'], self.kn, self.scaled_eigs = multiscale.multiscale(self.res,
                                                                                        n_eigs=self.use_eigs,
                                                                                        verbose=self.verbose)


    anbrs = ann.NMSlibTransformer(n_neighbors=self.n_neighbors,
                                  metric='cosine',
                                  method='hnsw',
                                  n_jobs=self.n_jobs,
                                  M=self.M,
                                  efC=self.efC,
                                  efS=self.efS,
                                  dense=True,
                                  verbose=self.verbose
                                  ).fit(self.res['MultiscaleComponents'])

    ind, dists, grad, graph = anbrs.ind_dist_grad(self.res['MultiscaleComponents'])

    end = time.time()
    print('Diffusion time = %f (sec), per sample=%f (sec), per sample adjusted for thread number=%f (sec)' %
          (end - self.start_time, float(end - self.start_time) / self.N, self.n_jobs * float(end - self.start_time) / self.N))
    if self.plot_spectrum:
        self.spectrum_plot()

    return ind, dists, grad, graph

rescale(self, n_eigs=None)

Re-scale the multiscale procedure to a new number of components.

Parameters

self : Diffusor object.

n_eigs : int. Number of diffusion components to multiscale.

Returns

np.ndarray containing the new multiscaled basis.

Source code in topo/tpgraph/diffusion.py
def rescale(self, n_eigs=None):
    """
    Re-scale the multiscale procedure to a new number of components.

    Parameters
    ----------
    self : Diffusor object.

    n_eigs : int. Number of diffusion components to multiscale.

    Returns
    -------

    np.ndarray containing the new multiscaled basis.

    """
    if n_eigs is None:
        n_eigs = self.n_components

    mms, self.kn, self.scaled_eigs = multiscale.multiscale(self.res,
                                                            n_eigs=n_eigs,
                                                            verbose=self.verbose)

    self.res['MultiscaleComponents'] = mms
    return mms

spectrum_plot(self, bla=None)

Plot the decay spectra.

Parameters

self : Diffusor object.

bla : Here only for autodoc's sake.

Returns

A nice plot of the diffusion spectra.

Source code in topo/tpgraph/diffusion.py
def spectrum_plot(self, bla=None):
    """
    Plot the decay spectra.

    Parameters
    ----------
    self : Diffusor object.

    bla : Here only for autodoc's sake.

    Returns
    -------

    A nice plot of the diffusion spectra.

    """

    if self.kn is None:
        msc, self.kn, self.scaled_eigs = multiscale.multiscale(self.res,
                                             n_eigs=self.use_eigs,
                                             verbose=self.verbose)

    ax1 = plt.subplot(2, 1, 1)
    ax1.set_title('Spectrum decay and \'knee\' (%i)' % int(self.kn.knee))
    ax1.plot(self.kn.x, self.kn.y, 'b', label='data')
    ax1.set_ylabel('Eigenvalues')
    ax1.set_xlabel('Eigenvectors')
    ax1.vlines(
        self.kn.knee, plt.ylim()[0], plt.ylim()[1], linestyles="--", label='Knee'
    )
    ax1.legend(loc='best')

    ax2 = plt.subplot(2, 1, 2)
    ax2.set_title('Curve analysis')
    ax2.plot(self.kn.x_normalized, self.kn.y_normalized, "b", label="normalized")
    ax2.plot(self.kn.x_difference, self.kn.y_difference, "r", label="differential")
    ax2.set_xticks(
        np.arange(self.kn.x_normalized.min(), self.kn.x_normalized.max() + 0.1, 0.1)
    )
    ax2.set_yticks(
        np.arange(self.kn.y_difference.min(), self.kn.y_normalized.max() + 0.1, 0.1)
    )

    ax2.vlines(
        self.kn.norm_knee,
        plt.ylim()[0],
        plt.ylim()[1],
        linestyles="--",
        label="Knee",
    )
    ax2.legend(loc="best")
    plt.tight_layout()
    plt.show()
    return plt

transform(self, X)

Fits the renormalized Laplacian approximating the Laplace Beltrami-Operator in a discrete eigendecomposition. Then multiscales the resulting components. Parameters


X : input data. Takes in numpy arrays and scipy csr sparse matrices. Use with sparse data for top performance. You can adjust a series of parameters that can make the process faster and more informational depending on your dataset.

Returns


Diffusor.res['MultiscaleComponents']]

Source code in topo/tpgraph/diffusion.py
def transform(self, X):
    """
    Fits the renormalized Laplacian approximating the Laplace Beltrami-Operator
    in a discrete eigendecomposition. Then multiscales the resulting components.
    Parameters
    ----------
    X :
        input data. Takes in numpy arrays and scipy csr sparse matrices.
    Use with sparse data for top performance. You can adjust a series of
    parameters that can make the process faster and more informational depending
    on your dataset.

    Returns
    -------

   ``Diffusor.res['MultiscaleComponents']]``

    """
    self.start_time = time.time()
    # Fit an optimal number of components based on the eigengap
    # Use user's  or default initial guess
    # initial eigen value decomposition
    if self.transitions:
        D, V = eigs(self.T, self.n_components, tol=1e-4, maxiter=(self.N // 10))
    else:
        D, V = eigs(self.K, self.n_components, tol=1e-4, maxiter=(self.N // 10))
    D = np.real(D)
    V = np.real(V)
    inds = np.argsort(D)[::-1]
    D = D[inds]
    V = V[:, inds]
    # Normalize by the first diffusion component
    for i in range(V.shape[1]):
        V[:, i] = V[:, i] / np.linalg.norm(V[:, i])
    vals = np.array(V)
    pos = np.sum(vals > 0, axis=0)
    residual = np.sum(vals < 0, axis=0)

    if self.eigengap and len(residual) < 1:
        #expand eigendecomposition
        target = self.n_components + 30
        while residual < 3:
            print('Eigengap not found for determined number of components. Expanding eigendecomposition to '
                  + str(target) + 'components.')
            if self.transitions:
                D, V = eigs(self.T, target, tol=1e-4, maxiter=(self.N // 10))
            else:
                D, V = eigs(self.K, target, tol=1e-4, maxiter=(self.N // 10))
            D = np.real(D)
            V = np.real(V)
            inds = np.argsort(D)[::-1]
            D = D[inds]
            V = V[:, inds]
            # Normalize by the first diffusion component
            for i in range(V.shape[1]):
                V[:, i] = V[:, i] / np.linalg.norm(V[:, i])
            vals = np.array(V)
            residual = np.sum(vals < 0, axis=0)
            pos = np.sum(vals > 0, axis=0)
            target = int(target * 1.6)

    if len(residual) > 30:
        self.n_components = len(pos) + 5
        # adapted eigen value decomposition
        if self.transitions:
           D, V = eigs(self.T, self.n_components, tol=1e-4, maxiter=self.N)
        else:
            D, V = eigs(self.K, self.n_components, tol=1e-4, maxiter=self.N)
        D = np.real(D)
        V = np.real(V)
        inds = np.argsort(D)[::-1]
        D = D[inds]
        V = V[:, inds]

    if not self.cache:
        del self.K
        del self.T

    # Create the results dictionary
    self.res = {'EigenVectors': V, 'EigenValues': D}
    self.res['EigenVectors'] = pd.DataFrame(self.res['EigenVectors'])
    self.res["EigenValues"] = pd.Series(self.res["EigenValues"])

    self.res['MultiscaleComponents'], self.kn, self.scaled_eigs = multiscale.multiscale(self.res, verbose=self.verbose)

    end = time.time()
    if self.verbose:
        print('Multiscale decomposition time = %f (sec), per sample=%f (sec), per sample adjusted for thread number=%f (sec)' %
              (end - self.start_time, float(end - self.start_time) / self.N, self.n_jobs * float(end - self.start_time) / self.N))
    if self.plot_spectrum:
        self.spectrum_plot()

    return self.res['MultiscaleComponents']