Spreadsheet-style functions for NumPy ndarray with structured dtype or recarray objects:
aggregate, aggregate_in, pivot, addrecords, addcols, deletecols, renamecol, replace, colstack, rowstack, nullvalue
Note that these functions are also wrapped as methods of the tabular tabarray object, which is a subclass of the numpy ndarray.
See Also:
Aggregate a ndarray with structured dtype (or recarray) on columns for given functions.
Aggregate a numpy recarray (or tabular tabarray) on a set of specified factors, using specified aggregation functions.
Intuitively, this function will aggregate the dataset X on a set of columns, whose names are listed in On, so that the resulting aggregate data set has one record for each unique tuples of values in those columns.
The more factors listed in On argument, the “finer” is the aggregation, the fewer factors, the “coarser” the aggregation. For example, if:
On = ['A','B']
the resulting data set will have one record for each unique value of pairs (a,b) in:
X[['A','B']]
The AggFuncDict argument specifies how to aggregate the factors _not_ listed in On, e.g. the so-called Off columns. For example, if
On = [‘A’,’B’]
and C is some other column, then:
AggFuncDict['C']
is the function that will be used to reduce to a single value the (potentially multiple) values in the C column corresponding to unique values in the A, B columns. For instance, if:
AggFuncDict['C'] = numpy.mean
then the result will be that the values in the C column corresponding to a single A, B value will be averaged.
If an Off column is _not_ provided as a key in AggFuncDict, a default aggregator function will be used: the sum function for numerical columns, concatenation for string columns.
Implemented by the tabarray method tabular.tab.tabarray.aggregate().
Parameters
X : numpy ndarray with structured dtype or recarray
The data set to aggregate.On : list of strings, optional
List of column names in X.AggFuncDict : dictionary, optional
Dictionary where
- keys are some (all) column names of X that are NOT in On
- values are functions that can be applied to lists or numpy arrays.
This specifies how to aggregate the factors _not_ listed in On, e.g. the so-called Off columns.
AggFunc : function, optional
Function that can be applied to lists or numpy arrays, specifying how to aggregate factors not listed in either On or the keys of AggFuncDict, e.g. a “default” aggregation function for the Off columns not explicitly listed in AggFuncDict.returnsort : Boolean, optional
If returnsort == True, then return a list of indices describing how X was sorted as a result of aggregation. Default value is False.
Returns
agg : numpy ndarray with structured dtype
Aggregated data set.index_array : numpy ndarray (int, 1D)
Returned only if returnsort == True. List of indices describing how X was sorted as a result of aggregation.
See also:
Aggregate a ndarray with structured dtype or recarray and include original data in the result.
Take aggregate of data set on specified columns, then add the resulting rows back into data set to make a composite object containing both original non-aggregate data rows as well as the aggregate rows.
First read comments for tabular.spreadsheet.aggregate().
This function returns a numpy ndarray, with the number of rows equaling:
len(Data) + len(A)
where A is the the result of:
Data.aggregate(On,AggFuncDict)
A represents the aggregate rows; the other rows were the original data rows.
This function supports _multiple_ aggregation, meaning that one can first aggregate on one set of factors, then repeat aggregation on the result for another set of factors, without the results of the first aggregation interfering the second. To achieve this, the method adds two new columns:
Implemented by the tabarray method tabular.tab.tabarray.aggregate_in().
Parameters
Data : numpy ndarray with structured dtype or recarray
The data set to aggregate in.On : list of strings, optional
List of column names in X.AggFuncDict : dictionary, optional
Dictionary where
- keys are some (all) column names of X that are NOT in On
- values are functions that can be applied to lists or numpy arrays.
This specifies how to aggregate the factors _not_ listed in On, e.g. the so-called Off columns.
AggFunc : function, optional
Function that can be applied to lists or numpy arrays, specifying how to aggregate factors not listed in either On or the keys of AggFuncDict, e.g. a “default” aggregation function for the Off columns not explicitly listed in AggFuncDict.interspersed : boolean, optional
- If True, aggregate rows are interleaved with the data of which they are aggregates.
- If False, all aggregate rows placed at the end of the array.
Returns
agg : numpy ndarray with structured dtype
Composite aggregated data set plus original data set.
See also:
Implements pivoting on numpy ndarrays (with structured dtype) or recarrays.
See http://en.wikipedia.org/wiki/Pivot_table for information about pivot tables.
Returns X pivoted on (a,b) with a as the row axis and b values as the column axis.
So-called “nontrivial columns relative to b” in X are added as color-grouped sets of columns, and “trivial columns relative to b” are also retained as cross-grouped sets of columns if they are listed in Keep argument.
Note that a column c in X is “trivial relative to b” if for all rows i, X[c][i] can be determined from X[b][i], e.g the elements in X[c] are in many-to-any correspondence with the values in X[b].
The function will raise an exception if the list of pairs of value in X[[a,b]] is not the product of the individual columns values, e.g.:
X[[a,b]] == set(X[a]) x set(X[b])
in some ordering.
Implemented by the tabarray method tabular.tab.tabarray.pivot()
Parameters
X : numpy ndarray with structured dtype or recarray
The data set to pivot.a : string
Column name in X.b : string
Another column name in X.Keep : list of strings, optional
List of other columns names in X.NullVals : optional
Dictionary mapping column names in X other than a or b to appropriate null values for their types.
If None, then the null values defined by the nullvalue function are used, see tabular.spreadsheet.nullvalue().
prefix : string, optional
Prefix to add to coloring keys corresponding to cross-grouped “trivial columns relative to b”. Default value is an underscore, ‘_’.
Returns
ptable : numpy ndarray with structured dtype
The resulting pivot table.coloring : dictionary
Dictionary whose keys are strings and corresponding values are lists of column names (e.g. strings).
There are two groups of keys:
So-called “nontrivial columns relative to b” in X. These correspond to columns in:
set(`X.dtype.names`) - set([a, b])
Cross-grouped “trivial columns relative to b”. The prefix is used to distinguish these.
The coloring parameter is used by the the tabarray pivot method, tabular.tab.tabarray.pivot().
See tabular.tab.tabarray.__new__() for more information about coloring.
Append one or more records to the end of a numpy recarray or ndarray .
Can take a single record, void or tuple, or a list of records, voids or tuples.
Implemented by the tabarray method tabular.tab.tabarray.addrecords().
Parameters
X : numpy ndarray with structured dtype or recarray
The array to add records to.new : record, void or tuple, or list of them
Record(s) to add to X.
Returns
out : numpy ndarray with structured dtype
New numpy array made up of X plus the new records.
See also: tabular.spreadsheet.rowstack()
Add one or more columns to a numpy ndarray.
Technical dependency of tabular.spreadsheet.aggregate_in().
Implemented by the tabarray method tabular.tab.tabarray.addcols().
Parameters
X : numpy ndarray with structured dtype or recarray
The recarray to add columns to.cols : numpy ndarray, or list of arrays of columns
Column(s) to add.names: list of strings, optional
Names of the new columns. Only applicable when cols is a list of arrays.
Returns
out : numpy ndarray with structured dtype
New numpy array made up of X plus the new columns.
See also: tabular.spreadsheet.colstack()
Delete columns from a numpy ndarry or recarray.
Can take a string giving a column name or comma-separated list of column names, or a list of string column names.
Implemented by the tabarray method tabular.tab.tabarray.deletecols().
Parameters
X : numpy recarray or ndarray with structured dtype
The numpy array from which to delete columns.cols : string or list of strings
Name or list of names of columns in X. This can be a string giving a column name or comma-separated list of column names, or a list of string column names.
Returns
out : numpy ndarray with structured dtype
New numpy ndarray with structured dtype given by X, excluding the columns named in cols.
Rename column of a numpy ndarray with structured dtype, in-place.
Implemented by the tabarray method tabular.tab.tabarray.renamecol().
Parameters
X : numpy ndarray with structured dtype
The numpy array for which a column is to be renamed.old : string
Old column name, e.g. a name in X.dtype.names.new : string
New column name to replace old.
Replace value old with new everywhere it appears in-place.
Implemented by the tabarray method tabular.tab.tabarray.replace().
Parameters
X : numpy ndarray with structured dtype
Numpy array for which in-place replacement of old with new is to be done.old : string
new : string
strict : boolean, optional
- If strict = True, replace only exact occurences of old.
- If strict = False, assume old and new are strings and replace all occurences of substrings (e.g. like str.replace())
cols : list of strings, optional
Names of columns to make replacements in; if None, make replacements everywhere.rows : list of booleans or integers, optional
Rows to make replacements in; if None, make replacements everywhere.
Note: This function does in-place replacements. Thus there are issues handling data types here when replacement dtype is larger than original dtype. This can be resolved later by making a new array when necessary ...
Horizontally stack a sequence of numpy ndarrays with structured dtypes
Analog of numpy.hstack for recarrays.
Implemented by the tabarray method tabular.tab.tabarray.colstack() which uses tabular.tabarray.tab_colstack().
Parameters
seq : sequence of numpy ndarray with structured dtype
List, tuple, etc. of numpy recarrays to stack vertically.mode : string in [‘first’,’drop’,’abort’,’rename’]
Denotes how to proceed if when multiple recarrays share the same column name:
- if mode == first, take the column from the first recarray in seq containing the shared column name.
- elif mode == abort, raise an error when the recarrays to stack share column names; this is the default mode.
- elif mode == drop, drop any column that shares its name with any other column among the sequence of recarrays.
- elif mode == rename, for any set of all columns sharing the same name, rename all columns by appending an underscore, ‘_’, followed by an integer, starting with ‘0’ and incrementing by 1 for each subsequent column.
Returns
out : numpy ndarray with structured dtype
Result of horizontally stacking the arrays in seq.
See also: numpy.hstack.
Vertically stack a sequence of numpy ndarrays with structured dtype
Analog of numpy.vstack
Implemented by the tabarray method tabular.tab.tabarray.rowstack() which uses tabular.tabarray.tab_rowstack().
Parameters
seq : sequence of numpy recarrays
List, tuple, etc. of numpy recarrays to stack vertically.mode : string in [‘nulls’, ‘commons’, ‘abort’]
Denotes how to proceed if the recarrays have different dtypes, e.g. different sets of named columns.
- if mode == nulls, the resulting set of columns is determined by the union of the dtypes of all recarrays to be stacked, and missing data is filled with null values as defined by tabular.spreadsheet.nullvalue(); this is the default mode.
- elif mode == commons, the resulting set of columns is determined by the intersection of the dtypes of all recarrays to be stacked, e.g. common columns.
- elif mode == abort, raise an error when the recarrays to stack have different dtypes.
Returns
out : numpy ndarray with structured dtype
Result of vertically stacking the arrays in seq.
See also: numpy.vstack.
Combine two or more numpy ndarray with structured dtype on common key column(s).
Merge a list (or dictionary) of numpy ndarray with structured dtype, given by L, on key columns listed in keycols.
This function is actually a wrapper for tabular.spreadsheet.strictjoin().
The strictjoin function has a few restrictions, and this join function will try to ensure that they are satisfied:
An error will be thrown if these conditions are not met.
If you don’t provide a value of keycols, the algorithm will attempt to infer which columns should be used by trying to find the largest set of common column names that contain unique values in each array and have the same data type. An error will be thrown if no such inference can be made.
Renaming of overlapping columns
If the non-keycol column names of the arrays overlap, join will by default attempt to rename the columns by using a simple convention:
- If L is a list, it will append the number in the list to the key associated with the array.
- If L is a dictionary, the algorithm will append the string representation of the key associated with an array to the overlapping columns from that array.
You can override the default renaming scheme using the renamer parameter.
Nullvalues for keycolumn differences
If there are regions of the keycolumns that are not overlapping between merged arrays, join will fill in the relevant entries with null values chosen by default:
- ‘0’ for integer columns
- ‘0.0’ for float columns
- the empty character (‘’) for string columns.
Parameters
L : list or dictionary
Numpy recarrays to merge. If L is a dictionary, the keys name each numpy recarray, and the corresponding values are the actual numpy recarrays.keycols : list of strings
List of the names of the key columns along which to do the merging.nullvals : function, optional
A function that returns a null value for a numpy format descriptor string, e.g. '<i4' or '|S5'.
See the default function for further documentation:
tabular.spreadsheet.DEFAULT_NULLVALUEFORMAT()renamer : function, optional
A function for renaming overlapping non-key column names among the numpy recarrays to merge.
See the default function for further documentation:
returnrenaming : Boolean, optional
Whether to return the result of the renamer function.
See the default function for further documentation:
Names: list of strings:
If L is a list, than names for elements of L can be specified with Names (without losing the ordering as you would if you did it with a dictionary).
len(L) must equal len(Names)
Returns
result : numpy ndarray with structured dtype
Result of the join, e.g. the result of merging the input numpy arrays defined in L on the key columns listed in keycols.renaming : dictionary of dictionaries, optional
The result returned by the renamer function. Returned only if returnrenaming == True.
See the default function for further documentation:
See Also:
Combine two or more numpy ndarray with structured dtypes on common key column(s).
Merge a list (or dictionary) of numpy arrays, given by L, on key columns listed in keycols.
The strictjoin assumes the following restrictions:
An error will be thrown if these conditions are not met.
For a wrapper that attempts to meet these restrictions, see tabular.spreadsheet.join().
If you don’t provide a value of keycols, the algorithm will attempt to infer which columns should be used by trying to find the largest set of common column names that contain unique values in each array and have the same data type. An error will be thrown if no such inference can be made.
Renaming of overlapping columns
If the non-keycol column names of the arrays overlap, join will by default attempt to rename the columns by using a simple convention:
- If L is a list, it will append the number in the list to the key associated with the array.
- If L is a dictionary, the algorithm will append the string representation of the key associated with an array to the overlapping columns from that array.
You can override the default renaming scheme using the renamer parameter.
Nullvalues for keycolumn differences
If there are regions of the keycolumns that are not overlapping between merged arrays, join will fill in the relevant entries with null values chosen by default:
- ‘0’ for integer columns
- ‘0.0’ for float columns
- the empty character (‘’) for string columns.
Parameters
L : list or dictionary
Numpy recarrays to merge. If L is a dictionary, the keys name each numpy recarray, and the corresponding values are the actual numpy recarrays.keycols : list of strings
List of the names of the key columns along which to do the merging.nullvals : function, optional
A function that returns a null value for a numpy format descriptor string, e.g. '<i4' or '|S5'.
See the default function for further documentation:
tabular.spreadsheet.DEFAULT_NULLVALUEFORMAT()renaming : dictionary of dictionaries, optional
Dictionary mapping each input numpy recarray to a dictionary mapping each original column name to its new name following the convention above.
For example, the result returned by:
Returns
result : numpy ndarray with structured dtype
Result of the join, e.g. the result of merging the input numpy arrays defined in L on the key columns listed in keycols.
See Also:
Renames overlapping column names of numpy ndarrays with structured dtypes
Rename the columns by using a simple convention:
Default renamer function used by tabular.spreadsheet.join()
Parameters
L : list or dictionary
Numpy recarrays with columns to be renamed.
Returns
D : dictionary of dictionaries
Dictionary mapping each input numpy recarray to a dictionary mapping each original column name to its new name following the convention above.