Import tabular like this:
>>> import tabular as tb
The class tabular.tabarray.tabarray is a data structure designed to hold and manipulate tabular data. tabarray objects are always created like this:
X = tb.tabarray(some data...)
Calling tb.tabarray always returns an instance of the tabular.tabarray.tabarray class, and such tabarray objects can be created with a variety of types of input data.
The basic idea is that your input data will often be a native Python object (like a list of tuples) or a text file (like a CSV). By putting this data into a tabarray object, you’ll get a representation of the data that has powerful row and column selection, filtering, analysis, and manipulation properties with a very simple syntax.
Row data: For instance, suppose your data is a Python list of tuples, each element of which represents a row:
>>> Recs = [('North', 'Service', 44.497368897699261, 85937),
('North', 'Manufacturing', 88.752549661587452, 305),
('North', 'Education', 46.304921839816174, 90568),
('North', 'Healthcare', 12.428147250824761, 52493),
('South', 'Service', 7.3632321953949864, 85280),
('South', 'Manufacturing', 2.35086845930782, 95578),
('South', 'Education', 53.272407154429423, 11823),
('South', 'Healthcare', 56.68070233253686, 32669),
('East', 'Service', 5.9745703176360525, 81417),
('East', 'Manufacturing', 93.840942619447858, 26229),
('East', 'Education', 78.067752261015244, 62454),
('East', 'Healthcare', 19.703894654017773, 63346),
('West', 'Service', 32.015715993438917, 72732),
('West', 'Manufacturing', 10.914876835171071, 23455),
('West', 'Education', 41.714531295305711, 85876),
('West', 'Healthcare', 81.458571751676629, 43660)]
Put the data into a tabular array with column names ‘Region’, ‘Sector’, ‘Amount’, ‘Population’, using the “records” and “names” keyword arguments:
>>> X = tb.tabarray(records = Recs, names=['Region','Sector','Amount','Population'])
>>> X
tabarray([('North', 'Service', 44.497368897699261, 85937),
('North', 'Manufacturing', 88.752549661587452, 305),
('North', 'Education', 46.304921839816174, 90568),
('North', 'Healthcare', 12.428147250824761, 52493),
. . .
('West', 'Manufacturing', 10.914876835171071, 23455),
('West', 'Education', 41.714531295305711, 85876),
('West', 'Healthcare', 81.458571751676629, 43660)],
dtype=[('Region', '|S5'), ('Sector', '|S13'), ('Amount', '<f8'), ('Population', '<i4')])
Column data: If your data is a list of columns, like this:
>>> Cols = [('North', 'North', 'North', 'North', 'South', 'South', 'South', 'South',
'East', 'East', 'East', 'East', 'West', 'West', 'West', 'West'),
('Service', 'Manufacturing', 'Education', 'Healthcare', 'Service',
'Manufacturing', 'Education', 'Healthcare', 'Service', 'Manufacturing',
'Education', 'Healthcare', 'Service', 'Manufacturing', 'Education', 'Healthcare'),
(81.749690102163314, 74.023848861551798, 32.515259350801131, 49.606447425664648,
11.748647385553801, 37.258888460591535, 34.364104643975423, 7.0963777523329608,
75.359680733433933, 46.357633526256734, 68.819758428065498, 67.493509469400962,
19.816892327675639, 55.942915632272758, 18.294195106203382, 55.55884722919324),
(99896, 83758, 9808, 60760, 60957, 99498, 74334, 66600, 62984, 5519,
96958, 7764, 2987, 88673, 7026, 32438)]
Then use the “columns” keyword argument to create a tabular array:
>>> X = tb.tabarray(columns = Cols, names=['Region','Sector','Amount','Population'])
>>> X
tabarray([('North', 'Service', 44.497368897699261, 85937),
('South', 'Service', 7.3632321953949864, 85280),
('East', 'Service', 5.9745703176360525, 81417),
('West', 'Service', 32.015715993438917, 72732)],
dtype=[('Region', '|S5'), ('Sector', '|S7'), ('Amount', '<f8'), ('Population', '<i4')])
Tabular arrays can also be created from NumPy arrays by using the “array” keyword argument, and a variety of other ways (see Creating Tabular Arrays for more details.)
Length and data type
The length of a tabular array is its number of records:
>>> len(X) #number of records
16
Tabarrays have associated dtype objects, containing the column names and the data type of each column:
>>> X.dtype
dtype([('Region', '|S5'), ('Sector', '|S13'), ('Amount', '<f8'), ('Population', '<i4')])
>>> len(X.dtype) #number of columns
4
>>> X.dtype.names #column names
('Region', 'Sector', 'Amount', 'Population')
Columns: Individual columns or groups of columns can be selected:
>>> X['Amount'] #select 'Amount' column
array([ 44.4973689 , 88.75254966, 46.30492184, 12.42814725,
7.3632322 , 2.35086846, 53.27240715, 56.68070233,
5.97457032, 93.84094262, 78.06775226, 19.70389465,
32.01571599, 10.91487684, 41.7145313 , 81.45857175])
>>> X[['Amount','Population']] #select both 'Amount' and 'Population' columns
tabarray([(44.497368897699261, 85937), (88.752549661587452, 305), (46.304921839816174, 90568),
. . .
(32.015715993438917, 72732), (10.914876835171071, 23455),
(41.714531295305711, 85876), (81.458571751676629, 43660)],
dtype=[('Amount', '<f8'), ('Population', '<i4')])
Rows: Individual rows or sets of rows can be selected:
>>> X[0] #select row 0
('North', 'Service', 44.497368897699261, 85937)
>>> X[4:9] #select rows 4 through 8
[('South', 'Service', 7.3632321953949864, 85280)
('South', 'Manufacturing', 2.35086845930782, 95578)
('South', 'Education', 53.272407154429423, 11823)
('South', 'Healthcare', 56.68070233253686, 32669)
('East', 'Service', 5.9745703176360525, 81417)]
>>> X[[1,5,9,14]] #select rows 1,5,9,and 14
[('North', 'Manufacturing', 88.752549661587452, 305)
('South', 'Manufacturing', 2.35086845930782, 95578)
('East', 'Manufacturing', 93.840942619447858, 26229)
('West', 'Education', 41.714531295305711, 85876)]
Comparisons: Rich, row-by-row comparisons can be made for columns, returning boolean arrays.
>>> X['Sector'] == 'Manufacturing' #compare to a scalar constant value
array([False, True, False, False, False, True, False, False, False,
True, False, False, False, True, False, False], dtype=bool)
>>> X['Amount'] > .0001*X['Population'] #compare one column to another array of the same length
array([ True, True, True, True, False, False, True, True, False,
True, True, True, True, True, True, True], dtype=bool)
Filtering: The resulting boolean arrays can be used to select rows:
>>> print X[X['Sector'] == 'Manufacturing'] #get rows where "Sector" entry is "Manufacturing"
[('North', 'Manufacturing', 88.752549661587452, 305)
('South', 'Manufacturing', 2.35086845930782, 95578)
('East', 'Manufacturing', 93.840942619447858, 26229)
('West', 'Manufacturing', 10.914876835171071, 23455)]
>>> print X[X['Amount'] > 50] #get rows where "Amount" entry exceeds 50
[('North', 'Manufacturing', 88.752549661587452, 305)
('South', 'Education', 53.272407154429423, 11823)
('South', 'Healthcare', 56.68070233253686, 32669)
('East', 'Manufacturing', 93.840942619447858, 26229)
('East', 'Education', 78.067752261015244, 62454)
('West', 'Healthcare', 81.458571751676629, 43660)]
Logical connectives ‘AND’ (&) and ‘OR’ (|) can be used to combine boolean arrays for more sophisticated filtering:
>>> X[(X['Amount'] > 50) & (X['Sector'] == 'Manufacturing')] #select rows where "Amount" > 50 and "Sector" = "Manufacturing"
[('North', 'Manufacturing', 88.752549661587452, 305)
('East', 'Manufacturing', 93.840942619447858, 26229)]
>>> X[((X['Amount'] > 50) & (X['Sector'] == 'Manufacturing')) | (X['Population'] > 90000)]
[('North', 'Manufacturing', 88.752549661587452, 305)
('North', 'Education', 46.304921839816174, 90568)
('South', 'Manufacturing', 2.35086845930782, 95578)
('East', 'Manufacturing', 93.840942619447858, 26229)]
Reassignment: The contents of a tabarray can be changed by assignment, both for individual values:
>>> print X['Amount'][0]
44.497
>>> X['Amount'][0] = 51.004
>>> print X['Amount']
51.004
as well as whole columns:
>>> X['Amount'] = X['Amount'] + 10
>>> X['Amount']
array([ 61.004 , 98.75254966, 56.30492184, 22.42814725,
17.3632322 , 12.35086846, 63.27240715, 66.68070233,
15.97457032, 103.84094262, 88.06775226, 29.70389465,
42.01571599, 20.91487684, 51.7145313 , 91.45857175])
Row and column selection features can be combined:
>>> X['Amount'][X['Sector'] == 'Manufacturing'] #get "Amount" entries for rows where "Sector" entry is "Manufacturing"
array([ 98.75254966, 12.35086846, 103.84094262, 20.91487684])
and also combined with assignments:
>>> X['Amount'][X['Sector'] == 'Manufacturing'] = X['Amount'][X['Sector'] == 'Manufacturing'] - 10 #add 10 to all values selected above
>>> X['Amount']
array([ 61.004 , 88.75254966, 56.30492184, 22.42814725,
17.3632322 , 2.35086846, 63.27240715, 66.68070233,
15.97457032, 93.84094262, 88.06775226, 29.70389465,
42.01571599, 10.91487684, 51.7145313 , 91.45857175])
For more information about selection and filtering, see Manipulating Tabular Arrays: Indexing and Filtering.
Numerical columns can be “extracted” into arrays on which NumPy mathematical operations can be performed:
>>> X[['Amount','Population']].extract()
array([[ 6.10040000e+01, 8.59370000e+04],
[ 8.87525497e+01, 3.05000000e+02],
[ 5.63049218e+01, 9.05680000e+04],
[ 2.24281473e+01, 5.24930000e+04],
[ 1.73632322e+01, 8.52800000e+04],
[ 2.35086846e+00, 9.55780000e+04],
[ 6.32724072e+01, 1.18230000e+04],
[ 6.66807023e+01, 3.26690000e+04],
[ 1.59745703e+01, 8.14170000e+04],
[ 9.38409426e+01, 2.62290000e+04],
[ 8.80677523e+01, 6.24540000e+04],
[ 2.97038947e+01, 6.33460000e+04],
[ 4.20157160e+01, 7.27320000e+04],
[ 1.09148768e+01, 2.34550000e+04],
[ 5.17145313e+01, 8.58760000e+04],
[ 9.14585718e+01, 4.36600000e+04]])
>>> print 3*X[['Amount','Population']].extract().transpose() + 2
[[ 1.85012000e+02 2.68257649e+02 1.70914766e+02 6.92844418e+01
5.40896966e+01 9.05260538e+00 1.91817221e+02 2.02042107e+02
4.99237110e+01 2.83522828e+02 2.66203257e+02 9.11116840e+01
1.28047148e+02 3.47446305e+01 1.57143594e+02 2.76375715e+02]
[ 2.57813000e+05 9.17000000e+02 2.71706000e+05 1.57481000e+05
2.55842000e+05 2.86736000e+05 3.54710000e+04 9.80090000e+04
2.44253000e+05 7.86890000e+04 1.87364000e+05 1.90040000e+05
2.18198000e+05 7.03670000e+04 2.57630000e+05 1.30982000e+05]]
There are a huge number of mathematical operations provided by NumPy (see Numpy Functions for a partial listing).
Tabarrays support a variety of “spreadsheet-style” operations, including row/column manipulations, replacements, aggregation, pivoting and joins.
Row & Column Operations:
>>> Y = X.deletecols(['Population']) #create new array with "Population" column deleted
>>> Y.dtype.names
('Region', 'Sector', 'Amount')
>>> Y = X.addrecords(('West','Surfing',10,10000)) #create new array with additional row
>>> print Y[-2:]
[('West', 'Healthcare', 91.458571751676629, 43660)
('West', 'Surfing', 10.0, 10000)]
>>> Importance = numpy.random.randint(10,size=(16,))
>>> Y = X.addcols([Importance],names=['Importance']) #create new array with new "Importance" column
>>> Y.dtype.names
('Region', 'Sector', 'Amount', 'Population', 'Importance')
>>> Y['Importance']
array([3, 9, 1, 3, 4, 8, 0, 9, 1, 7, 6, 6, 7, 4, 3, 6])
The “rowstack” method concatenates arrays with identical column names and data types along the “vertical” (row) axis:
>>> Y = X.rowstack([X,X,X]) #create new array by concatenating X to itself 3 times
>>> len(Y)
64
>>> Y[0]==Y[16]==Y[32]==Y[48]
True
The “colstack” method concatenates arrays with disjoint column names but identical lengths along the “horizontal” (column) axis:
>>> X[['Region','Sector']].colstack([X[['Amount']]])
tabarray([('North', 'Service', 11.568675158203657),
('North', 'Manufacturing', 78.486819271428658),
('North', 'Education', 57.60705040663634),
('North', 'Healthcare', 41.550376687594138),
. . .
('West', 'Manufacturing', 54.932402758880414),
('West', 'Education', 68.727826108077394),
('West', 'Healthcare', 81.726623074753874)],
dtype=[('Region', '|S5'), ('Sector', '|S13'), ('Amount', '<f8')])
For more information, see Manipulating Tabular Arrays: Row and Column Operations.
sort: The “sort” method provides fast in-place spreadsheet style multi-column sorting. (Manipulating Tabular Arrays: Sorting.)
>>> X.sort(order=['Region','Sector'])
>>> X
[('East', 'Education', 23.097478581186881, 78073)
('East', 'Healthcare', 10.52554515403321, 6751)
('East', 'Manufacturing', 87.72136064848614, 53744)
('East', 'Service', 54.896677290366689, 32747)
('North', 'Education', 57.60705040663634, 53327)
('North', 'Healthcare', 41.550376687594138, 7896)
('North', 'Manufacturing', 78.486819271428658, 39440)
('North', 'Service', 11.568675158203657, 80672)
('South', 'Education', 31.402298595542256, 39531)
('South', 'Healthcare', 79.898317595005963, 70297)
('South', 'Manufacturing', 84.012379396498261, 77165)
('South', 'Service', 62.1631456237237, 70377)
('West', 'Education', 68.727826108077394, 30589)
('West', 'Healthcare', 81.726623074753874, 63049)
('West', 'Manufacturing', 54.932402758880414, 84733)
('West', 'Service', 55.281648099477955, 98990)]
replace: The “replace” method does fast in-place replacements. (Replace.)
>>> X.replace('Education','Taxes')
>>> print X
[('North', 'Service', 20.67236621081404, 54249)
('North', 'Manufacturing', 32.000394550879761, 24839)
('North', 'Taxes', 5.8365701185767671, 97360)
. . .
('West', 'Service', 85.240242206812582, 88514)
('West', 'Manufacturing', 51.78347036293308, 37708)
('West', 'Taxes', 76.319284382925062, 97475)
('West', 'Healthcare', 24.449584044579531, 53317)]
aggregate: The “aggregate” method collapses a data set on unique values in columns specified by the “On” argument, using specified aggregation functions along the other columns (default is to sum).
>>> X.aggregate(On=['Region'], AggFuncDict={'Sector':lambda x : 'AllSectors'}) #create array containing sums of 'Amount' and 'Population' in each region
[('East', 'AllSectors', 278.57970406902308, 323710)
('North', 'AllSectors', 113.09165272461399, 194045)
('South', 'AllSectors', 329.83407710678244, 182472)
('West', 'AllSectors', 237.79258099725024, 277014)]
See Manipulating Tabular Arrays: Aggregation for more information.
pivot: The “pivot” method creates a pivot table relative to the specified column name arguments:
>>> Y = X.pivot('Region','Sector')
>>> print Y.dtype.names
('Region', 'Education_Amount', 'Healthcare_Amount', 'Manufacturing_Amount', 'Service_Amount', 'Education_Population', 'Healthcare_Population', 'Manufacturing_Population', 'Service_Population')
>>> Y[['Region','Healthcare_Amount','Healthcare_Population']]
[('East', 67.330421474002065, 84503) ('North', 54.582321844343419, 17597)
('South', 96.879266479772284, 78169) ('West', 24.449584044579531, 53317)]
See Manipulating Tabular Arrays: Pivot for more information.
join: The “join” method connects two or more tabarrays that share common “key” column(s):
>>> X1 = X[['Region','Sector','Amount']]
>>> X2 = X[['Region','Sector','Population']]
>>> X1.join(X2)
Inferring keycols to be: ['Region', 'Sector']
tabarray([('East', 'Education', 27.300913534209194, 57645),
('East', 'Healthcare', 79.315579226927042, 78858),
('East', 'Manufacturing', 11.550256098748601, 55331),
('East', 'Service', 79.895071543085521, 47919),
('North', 'Education', 74.546045755282648, 79467),
('North', 'Healthcare', 45.973679749219876, 15366),
('North', 'Manufacturing', 43.03577699412331, 46687),
('North', 'Service', 74.066922573923179, 41929),
('South', 'Education', 44.298421981780493, 8654),
('South', 'Healthcare', 2.5495475747373764, 9317),
('South', 'Manufacturing', 96.868543671291746, 72482),
('South', 'Service', 33.599182645022822, 25850),
('West', 'Education', 11.391185565781647, 62572),
('West', 'Healthcare', 56.204438615569586, 51704),
('West', 'Manufacturing', 67.868100963799833, 81922),
('West', 'Service', 48.381200406027581, 21577)],
dtype=[('Region', '|S5'), ('Sector', '|S13'), ('Amount', '<f8'), ('Population', '<i4')])
See Manipulating Tabular Arrays: Join for more information.
Tabular arrays can be saved to various file formats, like Comma-Separated-Value (CSV):
>>> X.saveSV('SampleData.csv')
This writes out the text file ‘SampleData.csv’ with content:
#types=str,str,float,int
#names=Region,Sector,Amount,Population
North,Service,61.003999999999998,85937
North,Manufacturing,88.752549661587452,305
North,Education,56.304921839816174,90568
North,Healthcare,22.428147250824761,52493
South,Service,17.363232195394986,85280
South,Manufacturing,2.3508684593078204,95578
South,Education,63.272407154429423,11823
South,Healthcare,66.680702332536868,32669
East,Service,15.974570317636053,81417
East,Manufacturing,93.840942619447858,26229
East,Education,88.067752261015244,62454
East,Healthcare,29.703894654017773,63346
West,Service,42.015715993438917,72732
West,Manufacturing,10.914876835171071,23455
West,Education,51.714531295305711,85876
West,Healthcare,91.458571751676629,43660
Conversely, tabular arrays can be loaded in from files:
>>> Y = tb.tabarray(SV='SampleData.csv')
>>> all(Y == X)
True
See Input and Output for more information, including on other formats.