gvar.dataset
- Random Data Sets¶
Introduction¶
gvar.dataset
contains a several tools for collecting and analyzing
random samples from arbitrary distributions. The random samples are
represented by lists of numbers or arrays, where each number/array is a new
sample from the underlying distribution. For example, six samples from a
one-dimensional gaussian distribution, 1±1, might look like
>>> random_numbers = [1.739, 2.682, 2.493, -0.460, 0.603, 0.800]
while six samples from a two-dimensional distribution, [1±1, 2±1], might be
>>> random_arrays = [[ 0.494, 2.734], [ 0.172, 1.400], [ 1.571, 1.304],
... [ 1.532, 1.510], [ 0.669, 0.873], [ 1.242, 2.188]]
Samples from more complicated multidimensional distributions are represented by dictionaries whose values are lists of numbers or arrays: for example,
>>> random_dict = dict(n=random_numbers, a=random_arrays)
where list elements random_dict['n'][i]
and random_dict['a'][i]
are
part of the same multidimensional sample for every i
— that is, the
lists for different keys in the dictionary are synchronized one with the
other.
With large samples, we typically want to estimate the mean value of the
underlying distribution. This is done using gvar.dataset.avg_data()
:
for example,
>>> print(avg_data(random_numbers))
1.31(45)
indicates that 1.31(45)
is our best guess, based only upon the samples in
random_numbers
, for the mean of the distribution from which those samples
were drawn. Similarly
>>> print(avg_data(random_arrays))
[0.95(22) 1.67(25)]
indicates that the means for the two-dimensional distribution behind
random_arrays
are [0.95(22), 1.67(25)]
. avg_data()
can also
be applied to a dictionary whose values are lists of numbers/arrays: for
example,
>>> print(avg_data(random_dict))
{'a': array([0.95(22), 1.67(25)], dtype=object),'n': 1.31(45)}
Class gvar.dataset.Dataset
can be used to assemble dictionaries containing
random samples. For example, imagine that the random samples above were
originally written into a file, as they were generated:
# file: datafile
n 1.739
a [ 0.494, 2.734]
n 2.682
a [ 0.172, 1.400]
n 2.493
a [ 1.571, 1.304]
n -0.460
a [ 1.532, 1.510]
n 0.603
a [ 0.669, 0.873]
n 0.800
a [ 1.242, 2.188]
Here each line is a different random sample, either from the one-dimensional
distribution (labeled n
) or from the two-dimensional distribution (labeled
a
). Assuming the file is called datafile
, this data can be read into
a dictionary, essentially identical to the data
dictionary above, using:
>>> data = Dataset("datafile")
>>> print(data['a'])
[array([ 0.494, 2.734]), array([ 0.172, 1.400]), array([ 1.571, 1.304]) ... ]
>>> print(avg_data(data['n']))
1.31(45)
The brackets and commas can be omitted in the input file for one-dimensional
arrays: for example, datafile
(above) could equivalently be written
# file: datafile
n 1.739
a 0.494 2.734
n 2.682
a 0.172 1.400
...
Other data formats may also be easy to use. For example, a data file written
using yaml
would look like
# file: datafile
---
n: 1.739
a: [ 0.494, 2.734]
---
n: 2.682
a: [ 0.172, 1.400]
.
.
.
and could be read into a gvar.dataset.Dataset
using:
import yaml
data = Dataset()
with open("datafile", "r") as dfile:
for d in yaml.load_all(dfile.read()): # iterate over yaml records
data.append(d) # d is a dictionary
Finally note that data can be binned, into bins of size binsize
, using
gvar.dataset.bin_data()
. For example,
gvar.dataset.bin_data(data, binsize=3)
replaces every three samples in
data
by the average of those samples. This creates a dataset that is
1/3
the size of the original but has the same mean. Binning is useful
for making large datasets more manageable, and also for removing
sample-to-sample correlations. Over-binning, however, erases statistical
information.
Class gvar.dataset.Dataset
can also be used to build a dataset sample by
sample in code: for example,
>>> a = Dataset()
>>> a.append(n=1.739, a=[ 0.494, 2.734])
>>> a.append(n=2.682, a=[ 0.172, 1.400])
...
creates the same dataset as above.
Functions¶
The functions defined in the module are:
-
gvar.dataset.
avg_data
(data, spread=False, median=False, bstrap=False, noerror=False, warn=True)¶ Average random data to estimate mean.
data
is a list of random numbers, a list of random arrays, or a dictionary of lists of random numbers and/or arrays: for example,>>> random_numbers = [1.60, 0.99, 1.28, 1.30, 0.54, 2.15] >>> random_arrays = [[12.2,121.3],[13.4,149.2],[11.7,135.3], ... [7.2,64.6],[15.2,69.0],[8.3,108.3]] >>> random_dict = dict(n=random_numbers,a=random_arrays)
where in each case there are six random numbers/arrays.
avg_data
estimates the means of the distributions from which the random numbers/arrays are drawn, together with the uncertainties in those estimates. The results are returned as agvar.GVar
or an array ofgvar.GVar
s, or a dictionary ofgvar.GVar
s and/or arrays ofgvar.GVar
s:>>> print(avg_data(random_numbers)) 1.31(20) >>> print(avg_data(random_arrays)) [11.3(1.1) 108(13)] >>> print(avg_data(random_dict)) {'a': array([11.3(1.1), 108(13)], dtype=object),'n': 1.31(20)}
The arrays in
random_arrays
are one dimensional; in general, they can have any shape.avg_data(data)
also estimates any correlations between different quantities indata
. Whendata
is a dictionary, it does this by assuming that the lists of random numbers/arrays for the differentdata[k]
s are synchronized, with the first element in one list corresponding to the first elements in all other lists, and so on. If some lists are shorter than others, the longer lists are truncated to the same length as the shortest list (discarding data samples).There are four optional arguments. If argument
spread=True
each standard deviation in the results refers to the spread in the data, not the uncertainty in the estimate of the mean. The former issqrt(N)
larger whereN
is the number of random numbers (or arrays) being averaged:>>> print(avg_data(random_numbers,spread=True)) 1.31(50) >>> print(avg_data(random_numbers)) 1.31(20) >>> print((0.50 / 0.20) ** 2) # should be (about) 6 6.25
This is useful, for example, when averaging bootstrap data. The default value is
spread=False
.The second option is triggered by setting
median=True
. This replaces the means in the results by medians, while the standard deviations are approximated by the half-width of the interval, centered around the median, that contains 68% of the data. These estimates are more robust than the mean and standard deviation when averaging over small amounts of data; in particular, they are unaffected by extreme outliers in the data. The default ismedian=False
.The third option is triggered by setting
bstrap=True
. This is shorthand for settingmedian=True
andspread=True
, and overrides any explicit setting for these keyword arguments. This is the typical choice for analyzing bootstrap data — hence its name. The default value isbstrap=False
.The fourth option is to omit the error estimates on the averages, which is triggered by setting
noerror=True
. Just the mean values are returned. The default value isnoerror=False
.The final option
warn
determines whether or not a warning is issued when different components of a dictionary data set have different sample sizes.
-
gvar.dataset.
autocorr
(data)¶ Compute autocorrelation in random data.
data
is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays.When
data
is a list of random numbers,autocorr(data)
returns an array whereautocorr(data)[i]
is the correlation between elements indata
that are separated by distancei
in the list: for example,>>> print(autocorr([2,-2,2,-2,2,-2])) [ 1. -1. 1. -1. 1. -1.]
shows perfect correlation between elements separated by an even interval in the list, and perfect anticorrelation between elements by an odd interval.
autocorr(data)
returns a list of arrays of autocorrelation coefficients whendata
is a list of random arrays. Againautocorr(data)[i]
gives the autocorrelations fordata
elements separated by distancei
in the list. Similarlyautocorr(data)
returns a dictionary whendata
is a dictionary.autocorr(data)
uses FFTs to compute the autocorrelations; the cost of computing the autocorrelations should grow roughly linearly with the number of random samples indata
(up to logarithms).
-
gvar.dataset.
bin_data
(data, binsize=2)¶ Bin random data.
data
is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays.bin_data(data,binsize)
replaces consecutive groups ofbinsize
numbers/arrays by the average of those numbers/arrays. The result is new data list (or dictionary) with1/binsize
times as much random data: for example,>>> print(bin_data([1,2,3,4,5,6,7],binsize=2)) [1.5, 3.5, 5.5] >>> print(bin_data(dict(s=[1,2,3,4,5],v=[[1,2],[3,4],[5,6],[7,8]]),binsize=2)) {'s': [1.5, 3.5], 'v': [array([ 2., 3.]), array([ 6., 7.])]}
Data is dropped at the end if there is insufficient data to from complete bins. Binning is used to make calculations faster and to reduce measurement-to-measurement correlations, if they exist. Over-binning erases useful information.
-
gvar.dataset.
bootstrap_iter
(data, n=None)¶ Create iterator that returns bootstrap copies of
data
.data
is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays.bootstrap_iter(data,n)
is an iterator that returnsn
bootstrap copies of data. The random numbers/arrays in a bootstrap copy are drawn at random (with repetition allowed) from among the samples indata
: for example,>>> data = [1.1, 2.3, 0.5, 1.9] >>> data_iter = bootstrap_iter(data) >>> print(next(data_iter)) [ 1.1 1.1 0.5 1.9] >>> print(next(data_iter)) [ 0.5 2.3 1.9 0.5] >>> data = dict(a=[1,2,3,4],b=[1,2,3,4]) >>> data_iter = bootstrap_iter(data) >>> print(next(data_iter)) {'a': array([3, 3, 1, 2]), 'b': array([3, 3, 1, 2])} >>> print(next(data_iter)) {'a': array([1, 3, 3, 2]), 'b': array([1, 3, 3, 2])} >>> data = [[1,2],[3,4],[5,6],[7,8]] >>> data_iter = bootstrap_iter(data) >>> print(next(data_iter)) [[ 7. 8.] [ 1. 2.] [ 1. 2.] [ 7. 8.]] >>> print(next(data_iter)) [[ 3. 4.] [ 7. 8.] [ 3. 4.] [ 1. 2.]]
The distribution of bootstrap copies is an approximation to the distribution from which
data
was drawn. Consequently means, variances and correlations for bootstrap copies should be similar to those indata
. Analyzing variations from bootstrap copy to copy is often useful when dealing with non-gaussian behavior or complicated correlations between different quantities.Parameter
n
specifies the maximum number of copies; there is no maximum ifn is None
.
Classes¶
gvar.dataset.Dataset
is used to assemble random samples from
multidimensional distributions:
-
class
gvar.dataset.
Dataset
¶ Dictionary for collecting random data.
This dictionary class simplifies the collection of random data. The random data are stored in a dictionary, with each piece of random data being a number or a
numpy
array of numbers. For example, consider a situation where there are four random values for a scalars
and four random values for vectorv
. These can be collected as follows:>>> data = Dataset() >>> data.append(s=1.1,v=[12.2,20.6]) >>> data.append(s=0.8,v=[14.1,19.2]) >>> data.append(s=0.95,v=[10.3,19.7]) >>> data.append(s=0.91,v=[8.2,21.0]) >>> print(data['s']) # 4 random values of s [ 1.1, 0.8, 0.95, 0.91] >>> print(data['v']) # 4 random vector-values of v [array([ 12.2, 20.6]), array([ 14.1, 19.2]), array([ 10.3, 19.7]), array([ 8.2, 21. ])]
The argument to
data.append()
could be a dictionary: for example,dd = dict(s=1.1,v=[12.2,20.6]); data.append(dd)
is equivalent to the firstappend
statement above. This is useful, for example, if the data comes from a function (that returns a dictionary).One can also append data key-by-key: for example,
data.append('s',1.1); data.append('v',[12.2,20.6])
is equivalent to the firstappend
in the example above. One could also achieve this with, for example,data['s'].append(1.1); data['v'].append([12.2,20.6])
, since each dictionary value is a list, butgvar.Dataset
‘sappend
checks for consistency between the new data and data already collected and so is preferable.Use
extend
in place ofappend
to add data in batches: for example,>>> data = Dataset() >>> data.extend(s=[1.1,0.8],v=[[12.2,20.6],[14.1,19.2]]) >>> data.extend(s=[0.95,0.91],v=[[10.3,19.7],[8.2,21.0]]) >>> print(data['s']) # 4 random values of s [ 1.1, 0.8, 0.95, 0.91]
gives the same dataset as the first example above.
A
Dataset
can also be created from a file where every line is a new random sample. The data in the first example above could have been stored in a file with the following content:# file: datafile s 1.1 v [12.2,20.6] s 0.8 v [14.1,19.2] s 0.95 v [10.3,19.7] s 0.91 v [8.2,21.0]
Lines that begin with
#
are ignored. Assuming the file is calleddatafile
, we create a dataset identical to that above using the code:>>> data = Dataset('datafile') >>> print(data['s']) [ 1.1, 0.8, 0.95, 0.91]
Data can be binned while reading it in, which might be useful if the data set is huge. To bin the data contained in file
datafile
in bins of binsize 2 we use:>>> data = Dataset('datafile',binsize=2) >>> print(data['s']) [0.95, 0.93]
The keys read from a data file are restricted to those listed in keyword
keys
and those that are matched (or partially matched) by regular expressiongrep
if one or other of these is specified: for example,>>> data = Dataset('datafile') >>> print([k for k in a]) ['s', 'v'] >>> data = Dataset('datafile',keys=['v']) >>> print([k for k in a]) ['v'] >>> data = Dataset('datafile',grep='[^v]') >>> print([k for k in a]) ['s'] >>> data = Dataset('datafile',keys=['v'],grep='[^v]') >>> print([k for k in a]) []
Dataset
s can also be constructed from dictionaries, otherDataset
s, or lists of key-data tuples. For example,>>> data = Dataset('datafile') >>> data_binned = Dataset(data, binsize=2) >>> data_v = Dataset(data, keys=['v'])
reads data from file
'datafile'
intoDataset
data
, and then creates a newDataset
with the data binned (data_binned
), and another that only containes the data with key'v'
(data_v
).The main attributes and methods are:
-
samplesize
¶ Smallest number of samples for any key.
-
append
(*args, **kargs)¶ Append data to dataset.
There are three equivalent ways of adding data to a dataset
data
: for example, each ofdata.append(n=1.739,a=[0.494,2.734]) # method 1 data.append(n,1.739) # method 2 data.append(a,[0.494,2.734]) dd = dict(n=1.739,a=[0.494,2.734]) # method 3 data.append(dd)
adds one new random number to
data['n']
, and a new vector todata['a']
.
-
extend
(*args, **kargs)¶ Add batched data to dataset.
There are three equivalent ways of adding batched data, containing multiple samples for each quantity, to a dataset
data
: for example, each ofdata.extend(n=[1.739,2.682], a=[[0.494,2.734],[ 0.172, 1.400]]) # method 1 data.extend(n,[1.739,2.682]) # method 2 data.extend(a,[[0.494,2.734],[ 0.172, 1.400]]) dd = dict(n=[1.739,2.682], a=[[0.494,2.734],[ 0.172, 1.400]]) # method 3 data.extend(dd)
adds two new random numbers to
data['n']
, and two new random vectors todata['a']
.This method can be used to merge two datasets, whether or not they share keys: for example,
data = Dataset("file1") data_extra = Dataset("file2") data.extend(data_extra) # data now contains all of data_extra
-
grep
(rexp)¶ Create new dataset containing items whose keys match
rexp
.Returns a new
gvar.dataset.Dataset`
containing only the itemsself[k]
whose keysk
match regular expressionrexp
(a string) according to Python modulere
:>>> a = Dataset() >>> a.append(xx=1.,xy=[10.,100.]) >>> a.append(xx=2.,xy=[20.,200.]) >>> print(a.grep('y')) {'yy': [array([ 10., 100.]), array([ 20., 200.])]} >>> print(a.grep('x')) {'xx': [1.0, 2.0], 'xy': [array([ 10., 100.]), array([ 20., 200.])]} >>> print(a.grep('x|y')) {'xx': [1.0, 2.0], 'xy': [array([ 10., 100.]), array([ 20., 200.])]} >>> print a.grep('[^y][^x]') {'xy': [array([ 10., 100.]), array([ 20., 200.])]}
Items are retained even if
rexp
matches only part of the item’s key.
-
slice
(sl)¶ Create new dataset with
self[k] -> self[k][sl].
Parameter
sl
is a slice object that is applied to every item in the dataset to produce a newgvar.Dataset
. Settingsl = slice(0,None,2)
, for example, discards every other sample for each quantity in the dataset. Settingsl = slice(100,None)
discards the first 100 samples for each quantity.If parameter
sl
is a tuple of slice objects, these are applied to successive indices ofself[k]
. An exception is called if the number of slice objects exceeds the number of dimensions for anyself[k]
.
-
arrayzip
(template)¶ Merge lists of random data according to
template
.template
is an array of keys in the dataset, where the shapes ofself[k]
are the same for all keysk
intemplate
.self.arrayzip(template)
merges the lists of random numbers/arrays associated with these keys to create a new list of (merged) random arrays whose layout is specified bytemplate
: for example,>>> d = Dataset() >>> d.append(a=1,b=10) >>> d.append(a=2,b=20) >>> d.append(a=3,b=30) >>> print(d) # three random samples each for a and b {'a': [1.0, 2.0, 3.0], 'b': [10.0, 20.0, 30.0]} >>> # merge into list of 2-vectors: >>> print(d.arrayzip(['a','b'])) [[ 1. 10.] [ 2. 20.] [ 3. 30.]] >>> # merge into list of (symmetric) 2x2 matrices: >>> print(d.arrayzip([['b','a'],['a','b']])) [[[ 10. 1.] [ 1. 10.]] [[ 20. 2.] [ 2. 20.]] [[ 30. 3.] [ 3. 30.]]]
The number of samples in each merged result is the same as the number samples for each key (here 3). The keys used in this example represent scalar quantities; in general, they could be either scalars or arrays (of any shape, so long as all have the same shape).
-
trim
()¶ Create new dataset where all entries have same sample size.
-
toarray
()¶ Create dictionary
d
whered[k]=numpy.array(self[k])
for allk
.
-