qp
Demo¶Alex Malz, Phil Marshall, Eric Charles
In this notebook we use the qp
module to approximate some simple, standard, 1D PDFs using sets of quantiles, samples, and histograms, and assess their relative accuracy.
We also show how such analyses can be extended to use "composite" PDFs made up of mixtures of standard distributions.
import numpy as np
import os
import scipy.stats as sps
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
%matplotlib inline
import qp
scipy.stats
module¶The scipy.stats
module is the standard for manipulating distribtions so is a natural place to start for implementing 1D PDF parameterizations.
It allows you do define a wide variety of distibutions and uses numpy
array broadcasting for efficiency.
Here are some examples of things you can do with the scipy.stats
module, using a Gaussian or Normal distribution.
loc
and scale
are the means and standard deviations of the underlying Gaussians.
Note the distinction between passing arguments to norm
and passing arguments to pdf
to access multiple distributions and their PDF values at multiple points.
# evaluate a single distribution's PDF at one value
print("PDF at one point for one distribution:",
sps.norm(loc=0, scale=1).pdf(0.5))
# evaluate a single distribution's PDF at multiple value
print("PDF at three points for one distribution:",
sps.norm(loc=0, scale=1).pdf([0.5, 1., 1.5]))
# evalute three distributions' PDFs at one shared value
print("PDF at one point for three distributions:",
sps.norm(loc=[0., 1., 2.], scale=1).pdf(0.5))
# evalute three distributions' PDFs each at one different value
print("PDF at one different point for three distributions:",
sps.norm(loc=[0., 1., 2.], scale=1).pdf([0.5, 1., 1.5]))
# evalute three distributions' PDFs each at four different values
# (note the change in shape of the argument)
print("PDF at four different points for three distributions:\n",
sps.norm(loc=[0., 1., 2.], scale=1).pdf([[0.5],[1.],[1.5],[2]]))
# evalute three distributions' PDFs at each of four different values
# (note the change in shape of the argument)
print("PDF at four different points for three distributions: broadcast reversed\n",
sps.norm(loc=[[0.], [1.], [2.]], scale=1).pdf([0.5,1.,1.5,2]))
PDF at one point for one distribution: 0.3520653267642995 PDF at three points for one distribution: [0.35206533 0.24197072 0.1295176 ] PDF at one point for three distributions: [0.35206533 0.35206533 0.1295176 ] PDF at one different point for three distributions: [0.35206533 0.39894228 0.35206533] PDF at four different points for three distributions: [[0.35206533 0.35206533 0.1295176 ] [0.24197072 0.39894228 0.24197072] [0.1295176 0.35206533 0.35206533] [0.05399097 0.24197072 0.39894228]] PDF at four different points for three distributions: broadcast reversed [[0.35206533 0.24197072 0.1295176 0.05399097] [0.35206533 0.39894228 0.35206533 0.24197072] [0.1295176 0.24197072 0.35206533 0.39894228]]
scipy.stats
classes¶In the scipy.stats
module, all of the distributions are sub-classes of scipy.stats.rv_continuous
.
You make an object of a particular sub-type, and then 'freeze' it by passing it shape parameters.
print("This is the generic normal distribution class: ",
sps._continuous_distns.norm_gen)
ng = sps._continuous_distns.norm_gen()
print("This is an instance of the generic normal distribution class",
ng)
norm_sp = ng(loc=0, scale=1)
print("This is a frozen normal distribution, with specific paramters",
norm_sp, norm_sp.kwds)
print("The frozen object know what generic distribution it comes from",
norm_sp.dist)
This is the generic normal distribution class: <class 'scipy.stats._continuous_distns.norm_gen'> This is an instance of the generic normal distribution class <scipy.stats._continuous_distns.norm_gen object at 0x7f60c5e59fa0> This is a frozen normal distribution, with specific paramters <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f607083f130> {'loc': 0, 'scale': 1} The frozen object know what generic distribution it comes from <scipy.stats._continuous_distns.norm_gen object at 0x7f607083f610>
scipy.stats
lets you evaluate multiple properties of distributions. These include:
print("PDF = ", norm_sp.pdf(0.5))
print("CDF = ", norm_sp.cdf(0.5))
print("PPF = ", norm_sp.ppf(0.6))
print("SF = ", norm_sp.sf(0.6))
print("ISF = ", norm_sp.isf(0.5))
print("RVS = ", norm_sp.rvs())
print("stats = ", norm_sp.stats())
print("M2 = ", norm_sp.moment(2))
PDF = 0.3520653267642995 CDF = 0.6914624612740131 PPF = 0.2533471031357997 SF = 0.2742531177500736 ISF = 0.0 RVS = -0.1331918386344051 stats = (array(0.), array(1.)) M2 = 1.0
qp
parameterizations and visualization functionality¶The next part of this notebook shows how we can extend the functionality of scipy.stats
to implement distributions that are based on parameterizations of 1D PDFs, like histograms, interpolations, splines, or mixture models.
scipy.stats
¶qp
automatically generates classes for all of the scipy.stats.rv_continuous
distributions, providing feed-through access to all scipy.stats.rv_continuous
objects but adds on additional attributes and methods specific to parameterization conversions.
qp.stats.keys()
odict_keys(['alpha', 'anglit', 'arcsine', 'argus', 'beta', 'betaprime', 'bradford', 'burr', 'burr12', 'cauchy', 'chi', 'chi2', 'cosine', 'crystalball', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponnorm', 'exponpow', 'exponweib', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'gamma', 'gausshyper', 'genexpon', 'genextreme', 'gengamma', 'genhalflogistic', 'genhyperbolic', 'geninvgauss', 'genlogistic', 'gennorm', 'genpareto', 'gilbrat', 'gompertz', 'gumbel_l', 'gumbel_r', 'halfcauchy', 'halfgennorm', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'kappa3', 'kappa4', 'ksone', 'kstwo', 'kstwobign', 'laplace', 'laplace_asymmetric', 'levy', 'levy_l', 'levy_stable', 'loggamma', 'logistic', 'loglaplace', 'lognorm', 'loguniform', 'lomax', 'maxwell', 'mielke', 'moyal', 'nakagami', 'ncf', 'nct', 'ncx2', 'norm', 'norminvgauss', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rayleigh', 'rdist', 'recipinvgauss', 'reciprocal', 'rice', 'semicircular', 'skewcauchy', 'skewnorm', 'studentized_range', 't', 'trapezoid', 'trapz', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'vonmises_line', 'wald', 'weibull_max', 'weibull_min', 'wrapcauchy', 'spline', 'hist', 'interp', 'interp_irregular', 'quant', 'quant_piecewise', 'mixmod', 'sparse'])
help(qp.stats.lognorm_gen)
Help on class lognorm in module qp.factory: class lognorm(qp.pdf_gen.Pdf_gen_wrap, scipy.stats._continuous_distns.lognorm_gen) | lognorm(*args, **kwargs) | | Mixin class to extend `scipy.stats.rv_continuous` with | information needed for `qp` for analytic distributions. | | Method resolution order: | lognorm | qp.pdf_gen.Pdf_gen_wrap | qp.pdf_gen.Pdf_gen | scipy.stats._continuous_distns.lognorm_gen | scipy.stats._distn_infrastructure.rv_continuous | scipy.stats._distn_infrastructure.rv_generic | builtins.object | | Methods defined here: | | freeze = _my_freeze(self, *args, **kwds) | | moment = _moment_fix(self, n, *args, **kwds) | | ---------------------------------------------------------------------- | Data and other attributes defined here: | | name = 'lognorm' | | version = 0 | | ---------------------------------------------------------------------- | Methods inherited from qp.pdf_gen.Pdf_gen_wrap: | | __init__(self, *args, **kwargs) | C'tor | | ---------------------------------------------------------------------- | Class methods inherited from qp.pdf_gen.Pdf_gen_wrap: | | add_mappings() from builtins.type | Add this classes mappings to the conversion dictionary | | get_allocation_kwds(npdf, **kwargs) from builtins.type | Return kwds necessary to create 'empty' hdf5 file with npdf entries | for iterative writeout | | ---------------------------------------------------------------------- | Class methods inherited from qp.pdf_gen.Pdf_gen: | | add_method_dicts() from builtins.type | Add empty method dicts | | create(**kwds) from builtins.type | Create and return a `scipy.stats.rv_frozen` object using the | keyword arguemntets provided | | create_gen(**kwds) from builtins.type | Create and return a `scipy.stats.rv_continuous` object using the | keyword arguemntets provided | | creation_method(method=None) from builtins.type | Return the method used to create a PDF of this type | | extraction_method(method=None) from builtins.type | Return the method used to extract data to create a PDF of this type | | plot(pdf, **kwargs) from builtins.type | Plot the pdf as a curve | | plot_native(pdf, **kwargs) from builtins.type | Plot the PDF in a way that is particular to this type of distibution | | This defaults to plotting it as a curve, but this can be overwritten | | print_method_maps(stream=<ipykernel.iostream.OutStream object at 0x7f617093ab80>) from builtins.type | Print the maps showing the methods | | reader_method(version=None) from builtins.type | Return the method used to convert data read from a file PDF of this type | | ---------------------------------------------------------------------- | Readonly properties inherited from qp.pdf_gen.Pdf_gen: | | metadata | Return the metadata for this set of PDFs | | objdata | Return the object data for this set of PDFs | | ---------------------------------------------------------------------- | Data descriptors inherited from qp.pdf_gen.Pdf_gen: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) | | ---------------------------------------------------------------------- | Methods inherited from scipy.stats._continuous_distns.lognorm_gen: | | fit = wrapper(self, *args, **kwds) | # if fit method is overridden only for MLE and doesn't specify what to do | # if method == 'mm', this decorator calls generic implementation | | ---------------------------------------------------------------------- | Methods inherited from scipy.stats._distn_infrastructure.rv_continuous: | | __getstate__(self) | | cdf(self, x, *args, **kwds) | Cumulative distribution function of the given RV. | | Parameters | ---------- | x : array_like | quantiles | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | cdf : ndarray | Cumulative distribution function evaluated at `x` | | expect(self, func=None, args=(), loc=0, scale=1, lb=None, ub=None, conditional=False, **kwds) | Calculate expected value of a function with respect to the | distribution by numerical integration. | | The expected value of a function ``f(x)`` with respect to a | distribution ``dist`` is defined as:: | | ub | E[f(x)] = Integral(f(x) * dist.pdf(x)), | lb | | where ``ub`` and ``lb`` are arguments and ``x`` has the ``dist.pdf(x)`` | distribution. If the bounds ``lb`` and ``ub`` correspond to the | support of the distribution, e.g. ``[-inf, inf]`` in the default | case, then the integral is the unrestricted expectation of ``f(x)``. | Also, the function ``f(x)`` may be defined such that ``f(x)`` is ``0`` | outside a finite interval in which case the expectation is | calculated within the finite range ``[lb, ub]``. | | Parameters | ---------- | func : callable, optional | Function for which integral is calculated. Takes only one argument. | The default is the identity mapping f(x) = x. | args : tuple, optional | Shape parameters of the distribution. | loc : float, optional | Location parameter (default=0). | scale : float, optional | Scale parameter (default=1). | lb, ub : scalar, optional | Lower and upper bound for integration. Default is set to the | support of the distribution. | conditional : bool, optional | If True, the integral is corrected by the conditional probability | of the integration interval. The return value is the expectation | of the function, conditional on being in the given interval. | Default is False. | | Additional keyword arguments are passed to the integration routine. | | Returns | ------- | expect : float | The calculated expected value. | | Notes | ----- | The integration behavior of this function is inherited from | `scipy.integrate.quad`. Neither this function nor | `scipy.integrate.quad` can verify whether the integral exists or is | finite. For example ``cauchy(0).mean()`` returns ``np.nan`` and | ``cauchy(0).expect()`` returns ``0.0``. | | The function is not vectorized. | | Examples | -------- | | To understand the effect of the bounds of integration consider | | >>> from scipy.stats import expon | >>> expon(1).expect(lambda x: 1, lb=0.0, ub=2.0) | 0.6321205588285578 | | This is close to | | >>> expon(1).cdf(2.0) - expon(1).cdf(0.0) | 0.6321205588285577 | | If ``conditional=True`` | | >>> expon(1).expect(lambda x: 1, lb=0.0, ub=2.0, conditional=True) | 1.0000000000000002 | | The slight deviation from 1 is due to numerical integration. | | fit_loc_scale(self, data, *args) | Estimate loc and scale parameters from data using 1st and 2nd moments. | | Parameters | ---------- | data : array_like | Data to fit. | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information). | | Returns | ------- | Lhat : float | Estimated location parameter for the data. | Shat : float | Estimated scale parameter for the data. | | isf(self, q, *args, **kwds) | Inverse survival function (inverse of `sf`) at q of the given RV. | | Parameters | ---------- | q : array_like | upper tail probability | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | x : ndarray or scalar | Quantile corresponding to the upper tail probability q. | | logcdf(self, x, *args, **kwds) | Log of the cumulative distribution function at x of the given RV. | | Parameters | ---------- | x : array_like | quantiles | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | logcdf : array_like | Log of the cumulative distribution function evaluated at x | | logpdf(self, x, *args, **kwds) | Log of the probability density function at x of the given RV. | | This uses a more numerically accurate calculation if available. | | Parameters | ---------- | x : array_like | quantiles | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | logpdf : array_like | Log of the probability density function evaluated at x | | logsf(self, x, *args, **kwds) | Log of the survival function of the given RV. | | Returns the log of the "survival function," defined as (1 - `cdf`), | evaluated at `x`. | | Parameters | ---------- | x : array_like | quantiles | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | logsf : ndarray | Log of the survival function evaluated at `x`. | | nnlf(self, theta, x) | Negative loglikelihood function. | | Notes | ----- | This is ``-sum(log pdf(x, theta), axis=0)`` where `theta` are the | parameters (including loc and scale). | | pdf(self, x, *args, **kwds) | Probability density function at x of the given RV. | | Parameters | ---------- | x : array_like | quantiles | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | pdf : ndarray | Probability density function evaluated at x | | ppf(self, q, *args, **kwds) | Percent point function (inverse of `cdf`) at q of the given RV. | | Parameters | ---------- | q : array_like | lower tail probability | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | x : array_like | quantile corresponding to the lower tail probability q. | | sf(self, x, *args, **kwds) | Survival function (1 - `cdf`) at x of the given RV. | | Parameters | ---------- | x : array_like | quantiles | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | sf : array_like | Survival function evaluated at x | | ---------------------------------------------------------------------- | Methods inherited from scipy.stats._distn_infrastructure.rv_generic: | | __call__(self, *args, **kwds) | Freeze the distribution for the given arguments. | | Parameters | ---------- | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution. Should include all | the non-optional arguments, may include ``loc`` and ``scale``. | | Returns | ------- | rv_frozen : rv_frozen instance | The frozen distribution. | | __setstate__(self, state) | | entropy(self, *args, **kwds) | Differential entropy of the RV. | | Parameters | ---------- | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information). | loc : array_like, optional | Location parameter (default=0). | scale : array_like, optional (continuous distributions only). | Scale parameter (default=1). | | Notes | ----- | Entropy is defined base `e`: | | >>> drv = rv_discrete(values=((0, 1), (0.5, 0.5))) | >>> np.allclose(drv.entropy(), np.log(2.0)) | True | | interval(self, alpha, *args, **kwds) | Confidence interval with equal areas around the median. | | Parameters | ---------- | alpha : array_like of float | Probability that an rv will be drawn from the returned range. | Each value should be in the range [0, 1]. | arg1, arg2, ... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information). | loc : array_like, optional | location parameter, Default is 0. | scale : array_like, optional | scale parameter, Default is 1. | | Returns | ------- | a, b : ndarray of float | end-points of range that contain ``100 * alpha %`` of the rv's | possible values. | | mean(self, *args, **kwds) | Mean of the distribution. | | Parameters | ---------- | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | mean : float | the mean of the distribution | | median(self, *args, **kwds) | Median of the distribution. | | Parameters | ---------- | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | Location parameter, Default is 0. | scale : array_like, optional | Scale parameter, Default is 1. | | Returns | ------- | median : float | The median of the distribution. | | See Also | -------- | rv_discrete.ppf | Inverse of the CDF | | rvs(self, *args, **kwds) | Random variates of given type. | | Parameters | ---------- | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information). | loc : array_like, optional | Location parameter (default=0). | scale : array_like, optional | Scale parameter (default=1). | size : int or tuple of ints, optional | Defining number of random variates (default is 1). | random_state : {None, int, `numpy.random.Generator`, | `numpy.random.RandomState`}, optional | | If `seed` is None (or `np.random`), the `numpy.random.RandomState` | singleton is used. | If `seed` is an int, a new ``RandomState`` instance is used, | seeded with `seed`. | If `seed` is already a ``Generator`` or ``RandomState`` instance | then that instance is used. | | Returns | ------- | rvs : ndarray or scalar | Random variates of given `size`. | | stats(self, *args, **kwds) | Some statistics of the given RV. | | Parameters | ---------- | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional (continuous RVs only) | scale parameter (default=1) | moments : str, optional | composed of letters ['mvsk'] defining which moments to compute: | 'm' = mean, | 'v' = variance, | 's' = (Fisher's) skew, | 'k' = (Fisher's) kurtosis. | (default is 'mv') | | Returns | ------- | stats : sequence | of requested moments. | | std(self, *args, **kwds) | Standard deviation of the distribution. | | Parameters | ---------- | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | std : float | standard deviation of the distribution | | support(self, *args, **kwargs) | Support of the distribution. | | Parameters | ---------- | arg1, arg2, ... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information). | loc : array_like, optional | location parameter, Default is 0. | scale : array_like, optional | scale parameter, Default is 1. | | Returns | ------- | a, b : array_like | end-points of the distribution's support. | | var(self, *args, **kwds) | Variance of the distribution. | | Parameters | ---------- | arg1, arg2, arg3,... : array_like | The shape parameter(s) for the distribution (see docstring of the | instance object for more information) | loc : array_like, optional | location parameter (default=0) | scale : array_like, optional | scale parameter (default=1) | | Returns | ------- | var : float | the variance of the distribution | | ---------------------------------------------------------------------- | Data descriptors inherited from scipy.stats._distn_infrastructure.rv_generic: | | random_state | Get or set the generator object for generating random variates. | | If `seed` is None (or `np.random`), the `numpy.random.RandomState` | singleton is used. | If `seed` is an int, a new ``RandomState`` instance is used, | seeded with `seed`. | If `seed` is already a ``Generator`` or ``RandomState`` instance then | that instance is used.
help(qp.stats.lognorm)
Help on method create in module qp.pdf_gen: create(**kwds) method of builtins.type instance Create and return a `scipy.stats.rv_frozen` object using the keyword arguemntets provided
If you have a single distribution you can plot it, the qp.plotting.plot_native
function will find a nice way to represent the data used to construct the distribution.
loc1 = np.array([[0]])
scale1 = np.array([[1]])
norm_dist1 = qp.stats.norm(loc=loc1, scale=scale1)
fig, axes = qp.plotting.plot_native(norm_dist1, xlim=(-5., 5.))
# fig, axes = qp.stats.norm.plot_native(norm_dist1, xlim=(-5., 5.))
qp
histogram (piecewise constant) parameterization¶This represents a set of distributions made by interpolating a set of histograms with shared binning. To construct this you need to give the bin edges (shape=(N)) and the bin values (shape=(npdf, N-1)).
Note that the native visual representation is different from the Normal distribution.
# Convert to a histogram by computing the bin values by taking the intergral of the CDF
xvals = np.linspace(-5, 5, 11)
cdf = norm_dist1.cdf(xvals)
bin_vals = cdf[:,1:] - cdf[:,0:-1]
# Construct histogram PDF using the bin edges and the bin values
hist_dist = qp.hist(bins=xvals, pdfs=bin_vals)
yvals = hist_dist.pdf(xvals)
# Construct a single PDF for plotting
hist_dist1 = qp.hist(bins=xvals, pdfs=np.atleast_2d(bin_vals[0]))
fig, axes = qp.plotting.plot_native(hist_dist1, xlim=(-5., 5.))
leg = fig.legend()
What if you want to evaluate a vector of input values, where each input value is different for each PDF? In that case you need the shape of the vector of input value to match the implicit shape of the PDFs, which in this case is (2,1)
xvals_x = np.array([[-1.], [1.]])
yvals_x = hist_dist.pdf(xvals_x)
print ("For an input vector of shape %s the output shape is %s" % (xvals_x.shape, yvals_x.shape))
For an input vector of shape (2, 1) the output shape is (2, 1)
qp
quantile parameterization¶This represents a set of distributions made by interpolating the locations at which various distributions reach a given set of quantiles. To construct this you need to give the quantiles edges (shape=(N)) and the location values (shape=(npdf, N)).
Note that the native visual representation is different.
# Define the quantile values to compute the locations for
quants = np.linspace(0.01, 0.99, 7)
# Compute the corresponding locations
locs = norm_dist1.ppf(quants)
# Construct the distribution using the quantile value and locations
quant_dist = qp.quant_piecewise(quants=quants, locs=locs)
quant_vals = quant_dist.pdf(xvals)
print("The input and output shapes are:", xvals.shape, quant_vals.shape)
# Construct a single PDF for plotting
quant_dist1 = qp.quant_piecewise(quants=np.atleast_1d(quants), locs=np.atleast_2d(locs[0]))
fig, axes = qp.plotting.plot_native(quant_dist1, xlim=(-5., 5.), label="quantiles")
leg = fig.legend()
The input and output shapes are: (11,) (1, 11)
print(quants)
print(quant_dist.dist.quants)
[0.01 0.17333333 0.33666667 0.5 0.66333333 0.82666667 0.99 ] [0. 0.01 0.17333333 0.33666667 0.5 0.66333333 0.82666667 0.99 1. ]
qp
interpolated parameterization¶This represents a set of distributions made by interpolating a set of x and y values. To construct this you need to give the x and y values (both of shape=(npdf, N))
Note that the native visual representation is pretty similar to the original one for the Gaussian.
# Define the x-grid locations
xvals = np.linspace(-5, 5, 11)
# Compute the corresponding y values
yvals = norm_dist1.pdf(xvals)
# Construct the PDFs using the x grid and y values
interp_dist = qp.interp(xvals=xvals, yvals=yvals)
interp_vals = interp_dist.pdf(xvals)
print("The input and output shapes are:", xvals.shape, interp_vals.shape)
# Construct a single PDF for plotting
interp_dist1 = qp.interp(xvals=xvals, yvals=np.atleast_2d(yvals[0]))
fig, axes = qp.plotting.plot_native(interp_dist1, xlim=(-5., 5.), label="interpolated")
leg = fig.legend()
The input and output shapes are: (11,) (1, 11)
qp
spline parameterization constructed from kernel density estimate (samples) parameterization¶This represents a set of distributions made by producing a kernel density estimate from a set of samples.
To construct this you need to give the samples edges (shape=(npdf, Nsamples)).
Note again that the the native visual represenation is different.
# Take 100 random samples from each of 2 PDFs
samples = norm_dist1.rvs(size=(2, 1000))
# Define points at which to evaluate the kernal density estimate (KDE)
xvals_kde = np.linspace(-5., 5., 51)
# Use a utility function to construct the KDE, sample it, and they construct a spline
kde_dist = qp.spline_from_samples(xvals=xvals_kde, samples=samples)
kde_vals = kde_dist.pdf(xvals_kde)
print("The input and output shapes are:", xvals.shape, kde_vals.shape)
# Construct a single PDF for plotting
kde_dist1 = qp.spline_from_samples(xvals=xvals_kde, samples=np.atleast_2d(samples[0]))
fig, axes = qp.plotting.plot_native(kde_dist1, xlim=(-5., 5.), label="kde")
leg = fig.legend()
The input and output shapes are: (11,) (2, 51)
qp
spline parameterization¶This represents a set of distributions made building a set of splines. Though the parameterization is defined by the spline knots, you can construct this from x and y values (both of shape=(npdf, N)).
Note that the native visual representation is pretty similar to the original one for the Gaussian.
Note also that the spline knots are stored.
# To make a spline you need the spline knots, you can get those from the xval, yval values
splx, sply, spln = qp.spline_gen.build_normed_splines(np.expand_dims(xvals,0), yvals)
spline_dist_orig = qp.spline(splx=splx, sply=sply, spln=spln)
# Or we can do these two steps together using one function
spline_dist = qp.spline_from_xy(xvals=np.expand_dims(xvals,0), yvals=yvals)
spline_vals = spline_dist.pdf(xvals)
print("The input and output shapes are:", xvals.shape, spline_vals.shape)
print("Spline knots", spline_dist.dist.splx, spline_dist.dist.sply, spline_dist.dist.spln)
# Construct a single PDF for plotting
spline_dist1 = qp.spline_from_xy(xvals=np.atleast_2d(xvals), yvals=np.atleast_2d(yvals))
print(spline_dist1.dist.splx.shape)
fig, axes = qp.plotting.plot_native(spline_dist1, xlim=(-5., 5.), label="spline")
leg = fig.legend()
The input and output shapes are: (11,) (1, 11) Spline knots [[-5. -5. -5. -5. -3. -2. -1. 0. 1. 2. 3. 5. 5. 5. 5.]] [[ 1.48609459e-06 1.68880388e-03 -5.05824770e-03 2.11574348e-02 2.37684221e-01 4.79319775e-01 2.37684221e-01 2.11574348e-02 -5.05824770e-03 1.68880388e-03 1.48609459e-06 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]] [[3]] (1, 15)
You can visually compare the represenations by plotting them all on the same figure.
fig, axes = qp.plotting.plot_native(norm_dist1, xlim=(-5., 5.), label="norm")
qp.plotting.plot_native(hist_dist1, axes=axes)
qp.plotting.plot_native(quant_dist1, axes=axes)
qp.plotting.plot_native(interp_dist1, axes=axes, label="interp")
# qp.plotting.plot_native(kde_dist1, axes=axes)
# qp.plotting.plot_native(spline_dist1, axes=axes, label="spline")
leg = fig.legend()
qp.Ensemble
Class¶This is the basic element of qp
- an object representing a set of probability density functions. This class is stored in the module ensemble.py
.
To create a qp.Ensemble
you need to specify the class used to represent the PDFs, and provide that data for the specific set of PDFs.
qp
no longer distinguishes between distributions and ensembles thereof -- a single distribution is just a special case of an ensemble with only one member, which takes advantage of computational efficiencies in scipy
.
The shape of the array returned by a call to the pdf function of a distribution depends on the shape of the parameters and evaluate points.
For distributions that take multiple input arrays, qp
uses te convention that the rows are the individual distributions and the columns are the values of the parameters defining the distributions under a known parameterization.
# This is a trivial extension, with the number of pdfs as a member of the `scipy.stats.norm_gen` distribution.
loc = np.array([[0],[1]])
scale = np.array([[1],[1]])
norm_dist = qp.stats.norm(loc=loc, scale=scale)
xvals = np.linspace(-5, 5, 51)
yvals = norm_dist.pdf(xvals)
print("This object represents %i pdfs" % norm_dist.npdf)
print("The input and output shapes are:", xvals.shape, yvals.shape)
This object represents 2 pdfs The input and output shapes are: (51,) (2, 51)
print ("For an input vector of shape %s the output shape is %s" % (xvals.shape, yvals.shape))
For an input vector of shape (51,) the output shape is (2, 51)
# In this case we return an array were the rows are the evaluation points and the columns the different PDFs
vector_pdf = qp.stats.norm(loc=[0., 1., 2], scale=1.)
vector_pdf.pdf([[0.], [0.5]])
array([[0.39894228, 0.24197072, 0.05399097], [0.35206533, 0.35206533, 0.1295176 ]])
# This is the same, except we use `numpy.expand_dims` to shape the input array of evaluation points
vector_pdf = qp.stats.norm(loc=[0., 1., 2], scale=1.)
vector_pdf.pdf(np.expand_dims(np.array([0., 0.5]), -1))
array([[0.39894228, 0.24197072, 0.05399097], [0.35206533, 0.35206533, 0.1295176 ]])
# In this case we return an array were the rows are pdfs and the columns the evaluation points
vector_pdf = qp.stats.norm(loc=[[0.], [1.], [2]], scale=1.)
vector_pdf.pdf([0., 0.5])
array([[0.39894228, 0.35206533], [0.24197072, 0.35206533], [0.05399097, 0.1295176 ]])
# This is the same, except we use `numpy.expand_dims` to shape the input array of pdf parameters
vector_pdf = qp.stats.norm(loc=np.expand_dims([0., 1., 2], -1), scale=1.)
vector_pdf.pdf([0., 0.5])
array([[0.39894228, 0.35206533], [0.24197072, 0.35206533], [0.05399097, 0.1295176 ]])
Here we will create 100 Gaussians with means distributed between -1 and 1, and widths distributed between 0.9 and 1.1.
locs = 2* (np.random.uniform(size=(100,1))-0.5)
scales = 1 + 0.2*(np.random.uniform(size=(100,1))-0.5)
ens_n = qp.Ensemble(qp.stats.norm, data=dict(loc=locs, scale=scales))
All of the methods of the distributions (pdf
, cdf
etc.) work the same way for an ensemble as for underlying classes.
To isolate a single distribution in the ensemble, use the square brackets operator []
.
vals_n = ens_n.pdf(xvals)
print("The shapes are: ", xvals.shape, vals_n.shape)
fig, axes = qp.plotting.plot_native(ens_n[15], xlim=(-5.,5.))
The shapes are: (51,) (100, 51)
The qp.Ensemble.convert_to
function lets you convert ensembles to other representations. To do this you have to provide the original ensemble, the class you want to convert to, and any some keyword arguments to specify details about how to convert to the new class, here are some examples.
bins = np.linspace(-5, 5, 11)
quants = np.linspace(0.01, 0.99, 7)
print("Making hist")
ens_h = ens_n.convert_to(qp.hist_gen, bins=bins)
print("Making interp")
ens_i = ens_n.convert_to(qp.interp_gen, xvals=bins)
print("Making spline")
ens_s = ens_n.convert_to(qp.spline_gen, xvals=bins, method="xy")
#print("Making spline from samples")
#ens_s = ens_n.convert_to(qp.spline_gen, xvals=bins, samples=1000, method="samples")
print("Making quants")
ens_q = ens_n.convert_to(qp.quant_piecewise_gen, quants=quants)
print("Making mixmod")
ens_m = ens_n.convert_to(qp.mixmod_gen, samples=1000, ncomps=3)
#print("Making flexcode")
#ens_f = ens_n.convert_to(qp.flex_gen, grid=bins, basis_system='cosine')
Making hist Making interp Making spline Making quants Making mixmod
The qp.convert
function also works the more or less the same way, but with slightly different syntax, where you can use the name of the class instead of the class object.
print("Making hist")
ens_h2 = qp.convert(ens_n, "hist", bins=bins)
print("Making interp")
ens_i2 = qp.convert(ens_n, "interp", xvals=bins)
print("Making spline")
ens_s2 = qp.convert(ens_n, "spline", xvals=bins, method="xy")
print("Making quants")
ens_q2 = qp.convert(ens_n, "quant", quants=quants)
print("Making mixmod")
ens_m2 = qp.convert(ens_n, "mixmod", samples=1000, ncomps=3)
Making hist Making interp Making spline Making quants Making mixmod
qp
supports quantitative comparisons between different distributions, across parametrizations.
Let's visualize the PDF object in order to original and the other representaions. The solid, black line shows the true PDF evaluated between the bounds. The green rugplot shows the locations of the 1000 samples we took. The vertical, dotted, blue lines show the percentiles we asked for, and the hotizontal, dotted, red lines show the 10 equally spaced bins we asked for. Note that the quantiles refer to the probability distribution between the bounds, because we are not able to integrate numerically over an infinite range. Interpolations of each parametrization are given as dashed lines in their corresponding colors. Note that the interpolations of the quantile and histogram parametrizations are so close to each other that the difference is almost imperceptible!
fig, axes = qp.plotting.plot_native(ens_n[15], xlim=(-5.,5.))
qp.plotting.plot_native(ens_h[15], axes=axes)
qp.plotting.plot_native(ens_q[15], axes=axes, label='quantile')
qp.plotting.plot_native(ens_i[15], axes=axes, label='interp')
# qp.plotting.plot_native(ens_s[15], axes=axes, label='spline')
qp.plotting.plot_native(ens_m[15], axes=axes, label='mixmod')
#qp.qp_plot_native(ens_f[15], axes=axes, label='flex')
leg = fig.legend()
We can also interpolate the function onto an evenly spaced grid point and cache those values with the gridded
function.
grid = np.linspace(-3., 3., 100)
gridded = ens_n.pdf(grid)
cached_gridded = ens_n.gridded(grid)[1]
check = gridded - cached_gridded
print(check.min(), check.max())
0.0 0.0
symm_lims = np.array([-1., 1.])
all_lims = [symm_lims, 2.*symm_lims, 3.*symm_lims]
Next, let's compare the different parametrizations to the truth using the Kullback-Leibler Divergence (KLD). The KLD is a measure of how close two probability distributions are to one another -- a smaller value indicates closer agreement. It is measured in units of bits of information, the information lost in going from the second distribution to the first distribution. The KLD calculator here takes in a shared grid upon which to evaluate the true distribution and the interpolated approximation of that distribution and returns the KLD of the approximation relative to the truth, which is not in general the same as the KLD of the truth relative to the approximation. Below, we'll calculate the KLD of the approximation relative to the truth over different ranges, showing that it increases as it includes areas where the true distribution and interpolated distributions diverge.
# for a single pair of pdfs. (the 15th in each ensemble)
klds = qp.metrics.calculate_kld(ens_n, ens_s, limits=symm_lims)
print(klds)
[0.00328146 0.00054128 0.00336019 0.00313608 0.00309157 0.00439593 0.00351473 0.00401957 0.00307561 0.00071542 0.00399572 0.00071199 0.00020325 0.00094967 0.00241168 0.00375948 0.00220818 0.0030863 0.00087861 0.00405013 0.00232632 0.00215408 0.00367854 0.00158009 0.00425325 0.00375375 0.00304016 0.00358496 0.0018729 0.00274755 0.00328138 0.00191869 0.00110029 0.00266608 0.00426591 0.0032709 0.00099753 0.00148295 0.004339 0.00128665 0.0034808 0.00202899 0.00317622 0.00062753 0.00030643 0.00220329 0.00193671 0.0026318 0.00346929 0.00288634 0.00097727 0.00379207 0.00123326 0.00201153 0.00345742 0.00339039 0.00340414 0.00334835 0.00324942 0.00234062 0.00291409 0.00192987 0.00339078 0.00449794 0.0016507 0.00100137 0.00282526 0.00433243 0.0037402 0.00143214 0.00249251 0.00273985 0.00383852 0.00032115 0.00387173 0.00048201 0.00359615 0.00135482 0.00343967 0.00069775 0.00178008 0.00276994 0.0044026 0.0042397 0.00361834 0.00195354 0.00372586 0.00249882 0.00290852 0.00372111 0.001671 0.001206 0.00316984 0.00365273 0.00424861 0.00378746 0.00405003 0.00343775 0.00375975 0.00236313]
# Loop over all the other ensemble types
ensembles = [ens_n, ens_h, ens_i, ens_s, ens_q, ens_m]
for ensemble in ensembles[1:]:
D = []
for lims in all_lims:
klds = qp.metrics.calculate_kld(ens_n, ensemble, limits=lims)
D.append("%.2e +- %.2e" % (klds.mean(), klds.std()))
print(ensemble.gen_class.name + ' approximation: KLD over 1, 2, 3, sigma ranges = ' + str(D))
hist approximation: KLD over 1, 2, 3, sigma ranges = ['1.16e-02 +- 3.49e-03', '2.76e-02 +- 5.08e-03', '3.72e-02 +- 5.09e-03'] interp approximation: KLD over 1, 2, 3, sigma ranges = ['3.16e-02 +- 1.09e-02', '2.33e-02 +- 2.57e-03', '1.01e-02 +- 1.84e-03'] spline approximation: KLD over 1, 2, 3, sigma ranges = ['2.68e-03 +- 1.17e-03', '7.57e-04 +- 2.08e-03', '1.06e-03 +- 2.69e-04'] quant_piecewise approximation: KLD over 1, 2, 3, sigma ranges = ['4.72e-02 +- 1.21e-02', '1.60e-01 +- 7.84e-02', '3.76e-01 +- 7.41e-02'] mixmod approximation: KLD over 1, 2, 3, sigma ranges = ['4.96e-03 +- 1.52e-02', '6.73e-03 +- 7.95e-03', '4.07e-03 +- 3.79e-03']
The progression of KLD values should follow that of the root mean square error (RMSE), another measure of how close two functions are to one another. The RMSE also increases as it includes areas where the true distribution and interpolated distribution diverge. Unlike the KLD, the RMSE is symmetric, meaning the distance measured is not that of one distribution from the other but of the symmetric distance between them.
for ensemble in ensembles[1:]:
D = []
for lims in all_lims:
rmses = qp.metrics.calculate_rmse(ens_n, ensemble, limits=lims)
D.append("%.2e +- %.2e" % (rmses.mean(), rmses.std()))
print(ensemble.gen_class.name + ' approximation: RMSE over 1, 2, 3, sigma ranges = ' + str(D))
hist approximation: RMSE over 1, 2, 3, sigma ranges = ['4.98e-02 +- 6.10e-03', '5.02e-02 +- 5.34e-03', '4.36e-02 +- 3.81e-03'] interp approximation: RMSE over 1, 2, 3, sigma ranges = ['2.35e-02 +- 3.98e-03', '1.89e-02 +- 2.86e-03', '1.65e-02 +- 2.40e-03'] spline approximation: RMSE over 1, 2, 3, sigma ranges = ['4.09e-03 +- 1.76e-03', '3.40e-03 +- 1.26e-03', '2.88e-03 +- 1.02e-03'] quant_piecewise approximation: RMSE over 1, 2, 3, sigma ranges = ['4.57e-02 +- 5.57e-03', '5.28e-02 +- 3.77e-03', '5.02e-02 +- 2.62e-03'] mixmod approximation: RMSE over 1, 2, 3, sigma ranges = ['2.33e-02 +- 8.14e-03', '2.04e-02 +- 5.86e-03', '1.73e-02 +- 4.71e-03']
Both the KLD and RMSE metrics suggest that the quantile approximation is better in the high density region, but samples work better when the tails are included. We might expect the answer to the question of which approximation to use to depend on the application, and whether the tails need to be captured or not.
You can store and retrieve ensembles from disk using the qp.Ensemble.write_to
and qp.Ensemble.read_from
methods.
These work in two steps, first they convert the Ensemble data to astropy.table
objects, and then they write the tables. This means you can store the data in any format support by astropy
.
tabs = ens_n.build_tables()
print(tabs.keys())
print()
print("Meta Data")
print(tabs['meta'])
print()
print("Object Data")
print(tabs['data'])
dict_keys(['meta', 'data']) Meta Data {'pdf_name': ['norm'], 'pdf_version': [0]} Object Data {'loc': array([[ 0.21890181], [-0.93534652], [ 0.07903113], [ 0.28176773], [-0.24968255], [-0.26335094], [-0.42948521], [-0.4387297 ], [-0.57217189], [-0.92455534], [ 0.34858641], [-0.91406855], [-0.98414987], [-0.86060038], [-0.69508709], [-0.38516899], [ 0.75300729], [-0.27768773], [ 0.89066815], [ 0.07691531], [-0.60425626], [ 0.70749606], [ 0.12587212], [-0.82361134], [-0.11546958], [ 0.49798422], [-0.26760305], [ 0.20776273], [ 0.6291673 ], [-0.59647901], [-0.00848918], [-0.76643467], [-0.79806762], [-0.4022234 ], [-0.08351926], [-0.0671515 ], [-0.91445451], [ 0.71709338], [-0.17404626], [ 0.85493452], [-0.50612038], [ 0.58551973], [ 0.12310108], [ 0.89065211], [ 0.98553825], [-0.54523168], [-0.64930249], [-0.6904524 ], [ 0.10365389], [ 0.48165105], [-0.81445327], [ 0.50516354], [-0.84387341], [-0.7536708 ], [-0.02983834], [ 0.52625015], [ 0.0251668 ], [-0.53713988], [-0.19856034], [ 0.64554989], [ 0.5297811 ], [-0.69668053], [-0.29724201], [-0.07584431], [ 0.67563132], [-0.89806725], [-0.5085701 ], [-0.00230427], [ 0.24542697], [-0.85116672], [ 0.7220289 ], [ 0.4094468 ], [ 0.36862184], [ 0.95756083], [-0.34470474], [-0.92832706], [-0.33563417], [ 0.8647574 ], [ 0.36856708], [-0.9321885 ], [-0.80126137], [ 0.36804869], [ 0.06692571], [ 0.2286091 ], [ 0.29082508], [ 0.7435764 ], [ 0.12353467], [-0.70184026], [-0.5470406 ], [-0.1198184 ], [-0.80795649], [-0.8460751 ], [ 0.34980147], [ 0.08250539], [-0.18080125], [-0.02415185], [ 0.29852082], [ 0.16449945], [-0.12600999], [-0.71552244]]), 'scale': array([[1.05628552], [1.02699736], [1.06502927], [1.06092663], [1.07481621], [0.90373225], [0.96129187], [0.90162672], [0.94565748], [0.98378847], [0.93155093], [1.012214 ], [1.04801547], [1.03997191], [0.94610651], [0.94807385], [0.92175172], [1.06866137], [1.00618853], [0.97491666], [1.03004701], [0.97416184], [1.01915825], [0.94269106], [0.94427332], [0.90704054], [1.0774135 ], [1.01872588], [1.08736909], [0.97311218], [1.07810052], [0.95297913], [1.08745088], [1.0863323 ], [0.94525947], [1.07747709], [0.90734377], [1.08832187], [0.92557912], [0.95820683], [0.9323448 ], [1.09059329], [1.08510689], [1.08773728], [0.96730724], [1.08696857], [1.06120851], [0.92081073], [1.04875741], [1.02002314], [1.09783552], [0.90057407], [0.99066761], [0.95133835], [1.05478905], [0.93307612], [1.06184084], [0.93248613], [1.06442614], [0.99735603], [0.9907961 ], [1.02248052], [1.0229267 ], [0.91247745], [1.0917688 ], [0.94950303], [1.01468626], [0.93893304], [0.99063856], [0.93046741], [0.91248718], [1.07302298], [0.94408806], [1.06640794], [0.94751646], [1.06756101], [0.98475575], [0.92380544], [0.99388315], [0.9683748 ], [0.93527174], [1.08436163], [0.92703754], [0.92987075], [0.99505897], [0.97200232], [1.01325771], [0.9285457 ], [0.98174109], [1.01429324], [0.94676408], [0.99403992], [1.03596655], [1.0266698 ], [0.93673041], [1.01194906], [0.93848628], [1.04539621], [1.00855411], [0.93516073]])}
Here is a loopback test showing that we get the same results before and after a write/read cycle.
suffix_list = ['_n', '_h', '_i', '_s', '_q', '_m']
filetypes = ['fits', 'hf5']
for ens, suffix in zip(ensembles, suffix_list):
for ft in filetypes:
outfile = "test%s.%s" % (suffix, ft)
metafile = "test%s_meta.%s" % (suffix, ft)
pdf_1 = ens.pdf(bins)
ens.write_to(outfile)
ens_r = qp.read(outfile)
pdf_2 = ens_r.pdf(bins)
check = pdf_1 - pdf_2
print(suffix, ft, check.min(), check.max())
os.unlink(outfile)
try:
os.unlink(metafile)
except Exception:
pass
_n fits 0.0 0.0 _n hf5 0.0 0.0 _h fits -1.1102230246251565e-16 5.551115123125783e-17 _h hf5 -1.1102230246251565e-16 5.551115123125783e-17 _i fits -1.1102230246251565e-16 5.551115123125783e-17 _i hf5 -1.1102230246251565e-16 5.551115123125783e-17 _s fits 0.0 0.0 _s hf5 0.0 0.0 _q fits 0.0 0.0 _q hf5 0.0 0.0 _m fits 0.0 0.0 _m hf5 0.0 0.0
Finally, we can compare the moments of each approximation and compare those to the moments of the true distribution.
which_moments = range(3)
all_moments = []
for ens in ensembles:
moments = []
for n in which_moments:
moms = qp.metrics.calculate_moment(ens, n, limits=(-3, 3))
moments.append("%.2e +- %.2e" % (moms.mean(), moms.std()))
all_moments.append(moments)
print('moments: '+str(which_moments))
for ens, mom in zip(ensembles, all_moments):
print(ens.gen_class.name+': '+str(mom))
moments: range(0, 3) norm: ['9.92e-01 +- 6.50e-03', '-1.02e-01 +- 5.32e-01', '1.23e+00 +- 2.43e-01'] hist: ['9.93e-01 +- 6.63e-03', '-1.03e-01 +- 5.31e-01', '1.39e+00 +- 2.36e-01'] interp: ['9.87e-01 +- 8.52e-03', '-9.91e-02 +- 5.18e-01', '1.34e+00 +- 2.16e-01'] spline: ['9.91e-01 +- 6.62e-03', '-1.02e-01 +- 5.33e-01', '1.23e+00 +- 2.48e-01'] quant_piecewise: ['9.90e-01 +- 1.63e-02', '-9.93e-02 +- 5.20e-01', '1.44e+00 +- 1.94e-01'] mixmod: ['9.93e-01 +- 6.73e-03', '-1.05e-01 +- 5.37e-01', '1.26e+00 +- 2.52e-01']