\[\DeclareMathOperator{\erf}{erf} \DeclareMathOperator{\argmin}{argmin} \newcommand{\R}{\mathbb{R}} \newcommand{\n}{\boldsymbol{n}}\]

Module pyqt_fit.kde

Author:Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com>

Module implementing kernel-based estimation of density of probability.

Kernel Density Estimation Methods

class pyqt_fit.kde.KDE1D(xdata, **kwords)[source]

Perform a kernel based density estimation in 1D, possibly on a bounded domain \([L,U]\).

Parameters:data (ndarray) – 1D array with the data points

Any other named argument will be equivalent to setting the property after the fact. For example:

>>> xs = [1,2,3]
>>> k = KDE1D(xs, lower=0)

will be equivalent to:

>>> k = KDE1D(xs)
>>> k.lower = 0

The method rely on an estimator of kernel density given by:

\[f(x) \triangleq \frac{1}{hW} \sum_{i=1}^n \frac{w_i}{\lambda_i} K\left(\frac{X-x}{h\lambda_i}\right)\]\[W = \sum_{i=1}^n w_i\]

where \(h\) is the bandwidth of the kernel (bandwidth), and \(K\) is the kernel used for the density estimation (kernel), \(w_i\) are the weights of the data points (weights) and \(\lambda_i\) are the adaptation factor of the kernel width (lambdas). \(K\) should be a function such that:

\[\begin{split}\begin{array}{rcl} \int_\mathbb{R} K(z) &=& 1 \\ \int_\mathbb{R} zK(z)dz &=& 0 \\ \int_\mathbb{R} z^2K(z) dz &<& \infty \quad (\approx 1) \end{array}\end{split}\]

Which translates into, the function should be of sum 1 (i.e. a valid density of probability), of average 0 (i.e. centered) and of finite variance. It is even recommanded that the variance is close to 1 to give a uniform meaning to the bandwidth.

If the domain of the density estimation is bounded to the interval \([L,U]\) (i.e. from lower to upper), the density is then estimated with:

\[f(x) \triangleq \frac{1}{hW} \sum_{i=1}^n \frac{w_i}{\lambda_i} \hat{K}(x;X,\lambda_i h,L,U)\]

Where \(\hat{K}\) is a modified kernel that depends on the exact method used.

To express the various methods, we will refer to the following functions:

\[a_0(l,u) = \int_l^u K(z) dz\]\[a_1(l,u) = \int_l^u zK(z) dz\]\[a_2(l,u) = \int_l^u z^2K(z) dz\]

The default methods are implemented in the kde_methods module.

__call__(points, output=None)[source]

This method is an alias for BoundedKDE1D.evaluate()

bandwidth[source]

Bandwidth of the kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.

Note

A ndarray with a single value will be converted to a floating point value.

closed[source]

Returns true if the density domain is closed (i.e. lower and upper are both finite)

copy()[source]

Shallow copy of the KDE object

covariance[source]

Covariance of the gaussian kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.

Note

A ndarray with a single value will be converted to a floating point value.

evaluate(points, output=None)[source]

Evaluate the kernel on the set of points points

grid(N=None, cut=None)[source]

Evaluate the density on a grid of N points spanning the whole dataset.

Returns:a tuple with the mesh on which the density is evaluated and

the density itself

kernel[source]

Kernel object. Should provide the following methods:

kernel.pdf(xs)
Density of the kernel, denoted \(K(x)\)
kernel.cdf(z)
Cumulative density of probability, that is \(F^K(z) = \int_{-\infty}^z K(x) dx\)
kernel.pm1(z)
First partial moment, defined by \(\mathcal{M}^K_1(z) = \int_{-\infty}^z xK(x)dx\)
kernel.pm2(z)
Second partial moment, defined by \(\mathcal{M}^K_2(z) = \int_{-\infty}^z x^2K(x)dx\)
kernel.fft(z)
FFT of the kernel on the points of z. The points will always be provided as a grid with \(2^n\) points, representing the whole frequency range to be explored. For convenience, the second half of the points will be provided as negative values.
kernel.dct(z)
DCT of the kernel on the points of z. The points will always be provided as a grid with \(2^n\) points, representing the whole frequency range to be explored.

By default, the kernel is an instance of kernels.normal_kernel1d

lambdas[source]

Scaling of the bandwidth, per data point. It can be either a single value or an array with one value per data point.

When deleted, the lamndas are reset to 1.

lower[source]

Lower bound of the density domain. If deleted, becomes set to \(-\infty\)

method[source]

Select the method to use. Available methods in the pyqt_fit.kde_methods sub-module.

The method is an object that should provide the following:

method(kde, points, output)
Evaluate the KDE defined by the kde object on the points. If output is provided, it should have the right shape and the result should be written in it.
method.grid(kde, N, cut)
Evaluate the KDE defined by the kde object on a grid. See :py:fct:`pyqt_fit.kde_methods.generate_grid` for a detailed explanation on how the grid is computed.
method.name
Return a user-readable name for the method
str(method)
Should return the method’s name
update_bandwidth()[source]

Re-compute the bandwidth if it was specified as a function.

upper[source]

Upper bound of the density domain. If deleted, becomes set to \(\infty\)

weights[source]

Weigths associated to each data point. It can be either a single value, or an array with a value per data point. If a single value is provided, the weights will always be set to 1.

Estimation in a Different Domain

class pyqt_fit.kde.TransformKDE(kde, trans, inv=None, Dinv=None)[source]

Compute the Kernel Density Estimate of a dataset, transforming it first to a domain where distances are “more meaningful”.

Often, KDE is best estimated in a different domain. This object takes a KDE1D object (or one compatible), and a transformation function.

Given a random variable \(X\) of distribution \(f_X\), the random variable \(Y = g(X)\) has a distribution \(f_Y\) given by:

\[f_Y(y) = \left| \frac{1}{g'(g^{-1}(y))} \right| \cdot f_X(g^{-1}(y))\]

In our term, \(Y\) is the random variable the user is interested in, and \(X\) the random variable we can estimate using the KDE. In this case, \(g\) is the transform from \(Y\) to \(X\).

So to estimate the distribution on a set of points given in \(x\), we need a total of three functions:

  • Direct function: transform from the original space to the one in which the KDE will be perform (i.e. \(g^{-1}: y \mapsto x\))
  • Invert function: transform from the KDE space to the original one (i.e. \(g: x \mapsto y\))
  • Derivative of the invert function

If the derivative is not provided, it will be estimated numerically.

Parameters:
  • kde – KDE evaluation object
  • trans – Either a simple function, or a function object with attributes inv and Dinv to use in case they are not provided as arguments.
  • inv – Invert of the function. If not provided, trans must have it as attribute.
  • Dinv – Derivative of the invert function.

Any unknown member is forwarded to the underlying KDE object.

__call__(points, output=None)[source]

Evaluate the KDE on a set of points

copy()[source]

Creates a shallow copy of the TransformKDE object

evaluate(points, output=None)[source]

Evaluate the KDE on a set of points

grid(N=None)[source]

Evaluate the KDE on a grid of points with N points.

The grid is regular in the transformed domain, so as to use FFT or CDT methods when applicable.

Bandwidth Estimation Methods

pyqt_fit.kde.variance_bandwidth(factor, xdata)

Returns the covariance matrix:

\[\mathcal{C} = \tau^2 cov(X)\]

where \(\tau\) is a correcting factor that depends on the method.

pyqt_fit.kde.silverman_covariance(xdata, ydata=None, model=None)

The Silverman bandwidth is defined as a variance bandwidth with factor:

\[\tau = \left( n \frac{d+2}{4} \right)^\frac{-1}{d+4}\]
pyqt_fit.kde.scotts_covariance(xdata, ydata=None, model=None)

The Scotts bandwidth is defined as a variance bandwidth with factor:

\[\tau = n^\frac{-1}{d+4}\]
pyqt_fit.kde.botev_bandwidth(N=None, **kword)

Implementation of the KDE bandwidth selection method outline in:

Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.

Based on the implementation of Daniel B. Smith, PhD.

The object is a callable returning the bandwidth for a 1D kernel.

Table Of Contents

Previous topic

Module pyqt_fit.kernel_smoothing

Next topic

Module pyqt_fit.kde_methods

This Page