gvar.dataset - Random Data Sets

Introduction

gvar.dataset contains a several tools for collecting and analyzing random samples from arbitrary distributions. The random samples are represented by lists of numbers or arrays, where each number/array is a new sample from the underlying distribution. For example, six samples from a one-dimensional gaussian distribution, 1±1, might look like

>>> random_numbers = [1.739, 2.682, 2.493, -0.460, 0.603, 0.800]

while six samples from a two-dimensional distribution, [1±1, 2±1], might be

>>> random_arrays = [[ 0.494, 2.734], [ 0.172, 1.400], [ 1.571, 1.304],
...                  [ 1.532, 1.510], [ 0.669, 0.873], [ 1.242, 2.188]]

Samples from more complicated multidimensional distributions are represented by dictionaries whose values are lists of numbers or arrays: for example,

>>> random_dict = dict(n=random_numbers, a=random_arrays)

where list elements random_dict['n'][i] and random_dict['a'][i] are part of the same multidimensional sample for every i — that is, the lists for different keys in the dictionary are synchronized one with the other.

With large samples, we typically want to estimate the mean value of the underlying distribution. This is done using gvar.dataset.avg_data(): for example,

>>> print(avg_data(random_numbers))
1.31(45)

indicates that 1.31(45) is our best guess, based only upon the samples in random_numbers, for the mean of the distribution from which those samples were drawn. Similarly

>>> print(avg_data(random_arrays))
[0.95(22) 1.67(25)]

indicates that the means for the two-dimensional distribution behind random_arrays are [0.95(22), 1.67(25)]. avg_data() can also be applied to a dictionary whose values are lists of numbers/arrays: for example,

>>> print(avg_data(random_dict))
{'a': array([0.95(22), 1.67(25)], dtype=object),'n': 1.31(45)}

Class gvar.dataset.Dataset can be used to assemble dictionaries containing random samples. For example, imagine that the random samples above were originally written into a file, as they were generated:

# file: datafile
n 1.739
a [ 0.494, 2.734]
n 2.682
a [ 0.172, 1.400]
n 2.493
a [ 1.571, 1.304]
n -0.460
a [ 1.532, 1.510]
n 0.603
a [ 0.669, 0.873]
n 0.800
a [ 1.242, 2.188]

Here each line is a different random sample, either from the one-dimensional distribution (labeled n) or from the two-dimensional distribution (labeled a). Assuming the file is called datafile, this data can be read into a dictionary, essentially identical to the data dictionary above, using:

>>> data = Dataset("datafile")
>>> print(data['a'])
[array([ 0.494, 2.734]), array([ 0.172, 1.400]), array([ 1.571, 1.304]) ... ]
>>> print(avg_data(data['n']))
1.31(45)

The brackets and commas can be omitted in the input file for one-dimensional arrays: for example, datafile (above) could equivalently be written

# file: datafile
n 1.739
a 0.494 2.734
n 2.682
a 0.172 1.400
...

Other data formats may also be easy to use. For example, a data file written using yaml would look like

# file: datafile
---
n: 1.739
a: [ 0.494, 2.734]
---
n: 2.682
a: [ 0.172, 1.400]
.
.
.

and could be read into a gvar.dataset.Dataset using:

import yaml

data = Dataset()
with open("datafile", "r") as dfile:
    for d in yaml.load_all(dfile.read()):   # iterate over yaml records
        data.append(d)                      # d is a dictionary

Finally note that data can be binned, into bins of size binsize, using gvar.dataset.bin_data(). For example, gvar.dataset.bin_data(data, binsize=3) replaces every three samples in data by the average of those samples. This creates a dataset that is 1/3 the size of the original but has the same mean. Binning is useful for making large datasets more manageable, and also for removing sample-to-sample correlations. Over-binning, however, erases statistical information.

Class gvar.dataset.Dataset can also be used to build a dataset sample by sample in code: for example,

>>> a = Dataset()
>>> a.append(n=1.739, a=[ 0.494, 2.734])
>>> a.append(n=2.682, a=[ 0.172, 1.400])
...

creates the same dataset as above.

Functions

The functions defined in the module are:

gvar.dataset.avg_data(data, median=False, spread=False, bstrap=False)

Average random data to estimate mean.

data is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays: for example,

>>> random_numbers = [1.60, 0.99, 1.28, 1.30, 0.54, 2.15]
>>> random_arrays = [[12.2,121.3],[13.4,149.2],[11.7,135.3],
...                  [7.2,64.6],[15.2,69.0],[8.3,108.3]]
>>> random_dict = dict(n=random_numbers,a=random_arrays)

where in each case there are six random numbers/arrays. avg_data estimates the means of the distributions from which the random numbers/arrays are drawn, together with the uncertainties in those estimates. The results are returned as a gvar.GVar or an array of gvar.GVars, or a dictionary of gvar.GVars and/or arrays of gvar.GVars:

>>> print(avg_data(random_numbers))
1.31(20)
>>> print(avg_data(random_arrays))
[11.3(1.1) 108(13)]
>>> print(avg_data(random_dict))
{'a': array([11.3(1.1), 108(13)], dtype=object),'n': 1.31(20)}

The arrays in random_arrays are one dimensional; in general, they can have any shape.

avg_data(data) also estimates any correlations between different quantities in data. When data is a dictionary, it does this by assuming that the lists of random numbers/arrays for the different data[k]s are synchronized, with the first element in one list corresponding to the first elements in all other lists, and so on. If some lists are shorter than others, the longer lists are truncated to the same length as the shortest list (discarding data samples).

There are four optional arguments. If argument spread=True each standard deviation in the results refers to the spread in the data, not the uncertainty in the estimate of the mean. The former is sqrt(N) larger where N is the number of random numbers (or arrays) being averaged:

>>> print(avg_data(random_numbers,spread=True))
1.31(50)
>>> print(avg_data(random_numbers))
1.31(20)
>>> print((0.50 / 0.20) ** 2)   # should be (about) 6
6.25

This is useful, for example, when averaging bootstrap data. The default value is spread=False.

The second option is triggered by setting median=True. This replaces the means in the results by medians, while the standard deviations are approximated by the half-width of the interval, centered around the median, that contains 68% of the data. These estimates are more robust than the mean and standard deviation when averaging over small amounts of data; in particular, they are unaffected by extreme outliers in the data. The default is median=False.

The third option is triggered by setting bstrap=True. This is shorthand for setting median=True and spread=True, and overrides any explicit setting for these keyword arguments. This is the typical choice for analyzing bootstrap data — hence its name. The default value is bstrap=False.

The final option is to omit the error estimates on the averages, which is triggered by setting noerror=True. Just the mean values are returned. The default value is noerror=False.

gvar.dataset.autocorr(data, ncorr=None)

Compute autocorrelation in random data.

data is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays.

When data is a list of random numbers, autocorr(data) returns an array where autocorr(data)[i] is the correlation between elements in data that are separated by distance i in the list: for example,

>>> print(autocorr([2,-2,2,-2,2,-2]))
[ 1. -1.  1. -1.  1. -1.]

shows perfect correlation between elements separated by an even interval in the list, and perfect anticorrelation between elements by an odd interval.

autocorr(data) returns a list of arrays of autocorrelation coefficients when data is a list of random arrays. Again autocorr(data)[i] gives the autocorrelations for data elements separated by distance i in the list. Similarly autocorr(data) returns a dictionary when data is a dictionary.

autocorr(data) uses FFTs to compute the autocorrelations; the cost of computing the autocorrelations should grow roughly linearly with the number of random samples in data (up to logarithms).

gvar.dataset.bin_data(data, binsize=2)

Bin random data.

data is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays. bin_data(data,binsize) replaces consecutive groups of binsize numbers/arrays by the average of those numbers/arrays. The result is new data list (or dictionary) with 1/binsize times as much random data: for example,

>>> print(bin_data([1,2,3,4,5,6,7],binsize=2))
[1.5, 3.5, 5.5]
>>> print(bin_data(dict(s=[1,2,3,4,5],v=[[1,2],[3,4],[5,6],[7,8]]),binsize=2))
{'s': [1.5, 3.5], 'v': [array([ 2.,  3.]), array([ 6.,  7.])]}

Data is dropped at the end if there is insufficient data to from complete bins. Binning is used to make calculations faster and to reduce measurement-to-measurement correlations, if they exist. Over-binning erases useful information.

gvar.dataset.bootstrap_iter(data, n=None)

Create iterator that returns bootstrap copies of data.

data is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays. bootstrap_iter(data,n) is an iterator that returns n bootstrap copies of data. The random numbers/arrays in a bootstrap copy are drawn at random (with repetition allowed) from among the samples in data: for example,

>>> data = [1.1, 2.3, 0.5, 1.9]
>>> data_iter = bootstrap_iter(data)
>>> print(next(data_iter))
[ 1.1  1.1  0.5  1.9]
>>> print(next(data_iter))
[ 0.5  2.3  1.9  0.5]
    
>>> data = dict(a=[1,2,3,4],b=[1,2,3,4])
>>> data_iter = bootstrap_iter(data)
>>> print(next(data_iter))
{'a': array([3, 3, 1, 2]), 'b': array([3, 3, 1, 2])}
>>> print(next(data_iter))
{'a': array([1, 3, 3, 2]), 'b': array([1, 3, 3, 2])}
    
>>> data = [[1,2],[3,4],[5,6],[7,8]]
>>> data_iter = bootstrap_iter(data)
>>> print(next(data_iter))
[[ 7.  8.]
 [ 1.  2.]
 [ 1.  2.]
 [ 7.  8.]]
>>> print(next(data_iter))
[[ 3.  4.]
 [ 7.  8.]
 [ 3.  4.]
 [ 1.  2.]]

The distribution of bootstrap copies is an approximation to the distribution from which data was drawn. Consequently means, variances and correlations for bootstrap copies should be similar to those in data. Analyzing variations from bootstrap copy to copy is often useful when dealing with non-gaussian behavior or complicated correlations between different quantities.

Parameter n specifies the maximum number of copies; there is no maximum if n is None.

Classes

gvar.dataset.Dataset is used to assemble random samples from multidimensional distributions:

class gvar.dataset.Dataset

Dictionary for collecting random data.

This dictionary class simplifies the collection of random data. The random data are stored in a dictionary, with each piece of random data being a number or an array of numbers. For example, consider a situation where there are four random values for a scalar s and four random values for vector v. These can be collected as follows:

>>> data = Dataset()
>>> data.append(s=1.1,v=[12.2,20.6])
>>> data.append(s=0.8,v=[14.1,19.2])
>>> data.append(s=0.95,v=[10.3,19.7])
>>> data.append(s=0.91,v=[8.2,21.0])
>>> print(data['s'])       # 4 random values of s
[ 1.1, 0.8, 0.95, 0.91]
>>> print(data['v'])       # 4 random vector-values of v
[array([ 12.2,  20.6]), array([ 14.1,  19.2]), array([ 10.3,  19.7]), array([  8.2,  21. ])]

The argument to data.append() could be a dictionary: for example, dd = dict(s=1.1,v=[12.2,20.6]); data.append(dd) is equivalent to the first append statement above. This is useful, for example, if the data comes from a function (that returns a dictionary).

One can also append data key-by-key: for example, data.append('s',1.1); data.append('v',[12.2,20.6]) is equivalent to the first append in the example above. One could also achieve this with, for example, data['s'].append(1.1); data['v'].append([12.2,20.6]), since each dictionary value is a list, but gvar.Dataset‘s append checks for consistency between the new data and data already collected and so is preferable.

Use extend in place of append to add data in batches: for example,

>>> data = Dataset()
>>> data.extend(s=[1.1,0.8],v=[[12.2,20.6],[14.1,19.2]])
>>> data.extend(s=[0.95,0.91],v=[[10.3,19.7],[8.2,21.0]])
>>> print(data['s'])       # 4 random values of s
[ 1.1, 0.8, 0.95, 0.91]

gives the same dataset as the first example above.

A Dataset can also be created from a file where every line is a new random sample. The data in the first example above could have been stored in a file with the following content:

# file: datafile
s 1.1
v [12.2,20.6]
s 0.8
v [14.1,19.2]
s 0.95
v [10.3,19.7]
s 0.91
v [8.2,21.0]

Lines that begin with # are ignored. Assuming the file is called datafile, we create a dataset identical to that above using the code:

>>> data = Dataset('datafile')
>>> print(data['s'])
[ 1.1, 0.8, 0.95, 0.91]

Data can be binned while reading it in, which might be useful if there the data set is huge. To bin the data contained in file datafile in bins of binsize 2 we use:

>>> data = Dataset('datafile',binsize=2)
>>> print(data['s'])
[0.95, 0.93]

Finally the keys read from a data file are restricted to those listed in keyword keys and those that are matched (or partially matched) by regular expression grep if one or the other of these is specified: for example,

>>> data = Dataset('datafile')
>>> print([k for k in a])
['s', 'v']
>>> data = Dataset('datafile',keys=['v'])
>>> print([k for k in a])
['v']
>>> data = Dataset('datafile',grep='[^v]')
>>> print([k for k in a])
['s']
>>> data = Dataset('datafile',keys=['v'],grep='[^v]')
>>> print([k for k in a])
[]

The main attributes and methods are:

samplesize

Smallest number of samples for any key.

append(*args, **kargs)

Append data to dataset.

There are three equivalent ways of adding data to a dataset data: for example, each of

data.append(n=1.739,a=[0.494,2.734])        # method 1

data.append(n,1.739)                        # method 2
data.append(a,[0.494,2.734])

dd = dict(n=1.739,a=[0.494,2.734])          # method 3
data.append(dd)

adds one new random number (or array) to data['n'] (or data['a']).

extend(*args, **kargs)

Add batched data to dataset.

There are three equivalent ways of adding batched data, containing multiple samples for each quantity, to a dataset data: for example, each of

data.extend(n=[1.739,2.682],
            a=[[0.494,2.734],[ 0.172, 1.400]])  # method 1

data.extend(n,[1.739,2.682])                    # method 2
data.extend(a,[[0.494,2.734],[ 0.172, 1.400]])

dd = dict(n=[1.739,2.682],
            a=[[0.494,2.734],[ 0.172, 1.400]])  # method 3
data.extend(dd)

adds two new random numbers (or arrays) to data['n'] (or data['a']).

This method can be used to merge two datasets, whether or not they share keys: for example,

data = Dataset("file1")
data_extra = Dataset("file2")
data.extend(data_extra)   # data now contains all of data_extra
grep(rexp)

Create new dataset containing items whose keys match rexp.

Returns a new gvar.dataset.Dataset` containing only the items self[k] whose keys k match regular expression rexp (a string) according to Python module re:

>>> a = Dataset()
>>> a.append(xx=1.,xy=[10.,100.])
>>> a.append(xx=2.,xy=[20.,200.])
>>> print(a.grep('y'))
{'yy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print(a.grep('x'))
{'xx': [1.0, 2.0], 'xy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print(a.grep('x|y'))
{'xx': [1.0, 2.0], 'xy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print a.grep('[^y][^x]')
{'xy': [array([  10.,  100.]), array([  20.,  200.])]}

Items are retained even if rexp matches only part of the item’s key.

slice(sl)

Create new dataset with self[k] -> self[k][sl].

Parameter sl is a slice object that is applied to every item in the dataset to produce a new gvar.Dataset. Setting sl = slice(0,None,2), for example, discards every other sample for each quantity in the dataset. Setting sl = slice(100,None) discards the first 100 samples for each quantity.

arrayzip(template)

Merge lists of random data according to template.

template is an array of keys in the dataset, where the shapes of self[k] are the same for all keys k in template. self.arrayzip(template) merges the lists of random numbers/arrays associated with these keys to create a new list of (merged) random arrays whose layout is specified by template: for example,

>>> d = Dataset()
>>> d.append(a=1,b=10)  
>>> d.append(a=2,b=20)
>>> d.append(a=3,b=30)
>>> print(d)            # three random samples each for a and b
{'a': [1.0, 2.0, 3.0], 'b': [10.0, 20.0, 30.0]}
>>> # merge into list of 2-vectors:
>>> print(d.arrayzip(['a','b']))
[[  1.  10.]
 [  2.  20.]
 [  3.  30.]]
>>> # merge into list of (symmetric) 2x2 matrices: 
>>> print(d.arrayzip([['b','a'],['a','b']])) 
[[[ 10.   1.]
  [  1.  10.]]
  
 [[ 20.   2.]
  [  2.  20.]]
  
 [[ 30.   3.]
  [  3.  30.]]]

The number of samples in each merged result is the same as the number samples for each key (here 3). The keys used in this example represent scalar quantities; in general, they could be either scalars or arrays (of any shape, so long as all have the same shape).

trim()

Create new dataset where all entries have same sample size.

toarray()

Copy self but with self[k] as numpy arrays.

Table Of Contents

Previous topic

gvar - Gaussian Random Variables

Next topic

Numerical Analysis Modules in gvar