biom-format Table objects¶
The biom-format project provides rich Table objects to support use of the BIOM file format. The objects encapsulate matrix data (such as OTU counts) and abstract the interaction away from the programmer. This provides the immediate benefit of the programmer not having to worry about what the underlying data object is, and in turn allows for different data representations to be supported. Currently, biom-format supports a dense object built off of numpy.array (NumPy) and a sparse object built off of Python dictionaries.
biom-format table_factory method¶
Generally, construction of a Table subclass will be through the table_factory method. This method facilitates any necessary data conversions and supports a wide variety of input data types.
- biom.table.table_factory(data, sample_ids, observation_ids, sample_metadata=None, observation_metadata=None, table_id=None, constructor=<class 'biom.table.SparseOTUTable'>, **kwargs)¶
Construct a table
Attempts to make ‘data’ sane with respect to the constructor type through various means of juggling. Data can be:
- numpy.array
- list of numpy.array vectors
- SparseObj representation
- dict representation
- list of SparseObj representation vectors
- list of lists of sparse values [[row, col, value], ...]
- list of lists of dense values [[value, value, ...], ...]
Example usage to create a SparseOTUTable object:
from biom.table import table_factory, SparseOTUTable from numpy import array sample_ids = ['s1','s2','s3','s4'] sample_md = [{'pH':4.2,'country':'Peru'}, {'pH':5.2,'country':'Peru'}, {'pH':5.0,'country':'Peru'}, {'pH':4.9,'country':'Peru'}] observation_ids = ['o1','o2','o3'] observation_md = [{'domain':'Archaea'}, {'domain':'Bacteria'}, {'domain':'Bacteria'}] data = array([[1,2,3,4], [-1,6,7,8], [9,10,11,12]]) t = table_factory(data, sample_ids, observation_ids, sample_md, observation_md, constructor=SparseOTUTable)
Description of available Table objects¶
There are multiple objects available but some of them are unofficial abstract base classes (does not use the abc module for historical reasons). In practice, the objects used should be the derived Tables such as SparseOTUTable or DenseGeneTable.
Abstract base classes¶
Abstract base classes establish standard interfaces for subclassed types and provide common functionality for derived types.
Table¶
Table is a container object and an abstract base class that provides a common and required API for subclassed objects. Through the use of private interfaces, it is possible to create public methods that operate on the underlying datatype without having to implement each method in each subclass. For instance, Table.iterSamplesData will return a generator that always yields numpy.array vectors for each sample regardless of how the table data is actually stored. This functionality results from derived classes implementing private interfaces, such as Table._conv_to_np.
- class biom.table.Table(Data, SampleIds, ObservationIds, SampleMetadata=None, ObservationMetadata=None, TableId=None, **kwargs)¶
Abstract base class representing a Table.
Once created, a Table object is immutable except for its sample/observation metadata, which can be modified in place via addSampleMetadata and addObservationMetadata.
- Code to simulate immutability taken from here:
- http://en.wikipedia.org/wiki/Immutable_object
- addObservationMetadata(md)¶
Take a dict of metadata and add it to an observation.
md should be of the form {observation_id:{dict_of_metadata}}
- addSampleMetadata(md)¶
Take a dict of metadata and add it to a sample.
md should be of the form {sample_id:{dict_of_metadata}}
- binObservationsByMetadata(f, constructor=None)¶
Yields tables by metadata
f is given the observation metadata by row and must return what “bin” the observation is part of.
constructor: the type of binned tables to create, e.g. SparseTaxonTable. If None, the binned tables will be the same type as this table.
- binSamplesByMetadata(f, constructor=None)¶
Yields tables by metadata
f is given the sample metadata by row and must return what “bin” the sample is part of.
constructor: the type of binned tables to create, e.g. SparseTaxonTable. If None, the binned tables will be the same type as this table.
- collapseObservationsByMetadata(metadata_f, reduce_f=<built-in function add>, norm=True, min_group_size=2, include_collapsed_metadata=True, constructor=None, one_to_many=False, one_to_many_mode='add', one_to_many_md_key='Path', strict=False)¶
Collapse observations in a table by observation metadata
Bin observations by metadata then collapse each bin into a single observation.
If include_collapsed_metadata is True, metadata for the collapsed observations are retained and can be referred to by the ObservationId from each observation within the bin.
constructor: the type of collapsed table to create, e.g. SparseTaxonTable. If None, the collapsed table will be the same type as this table.
The remainder is only relevant to setting one_to_many to True.
If one_to_many is True, allow observations to fall into multiple bins if the metadata describe a one-many relationship. Supplied functions must allow for iteration support over the metadata key and must return a tuple of (path, bin) as to describe both the path in the hierarchy represented and the specific bin being collapsed into. The uniqueness of the bin is _not_ based on the path but by the name of the bin.
The metadata value for the corresponding collapsed row may include more (or less) information about the collapsed data. For example, if collapsing “KEGG Pathways”, and there are observations that span three pathways A, B, and C, such that observation 1 spans A and B, observation 2 spans B and C and observation 3 spans A and C, the resulting table will contain three collapsed observations:
- A, containing original observation 1 and 3
- B, containing original observation 1 and 2
- C, containing original observation 2 and 3
If an observation maps to the same bin multiple times, it will be counted multiple times.
There are two supported modes for handling one-to-many relationships via one_to_many_mode: add and divide. add will add the observation counts to each bin that the observation maps to, which may increase the total number of counts in the output table. divide will divide an observation’s counts by the number of metadata that the observation has before adding the counts to each bin. This will not increase the total number of counts in the output table.
If one_to_many_md_key is specified, that becomes the metadata key that describes the collapsed path. If a value is not specified, then it defaults to ‘Path’.
If strict is specified, then all metadata pathways operated on must be indexable by metadata_f.
one_to_many and norm are not supported together.
one_to_many and reduce_f are not supported together.
one_to_many and min_group_size are not supported together.
A final note on space consumption. At present, the one_to_many functionality requires a temporary dense matrix representation. This was done so as it initially seems like true support requires rapid __setitem__ functionality on the SparseObj and at the time of implementation, CSMat was O(N) to the number of nonzero elements. This is a work around until either a better __setitem__ implementation is in play on CSMat or a hybrid solution that allows for multiple SparseObj types is used.
- collapseSamplesByMetadata(metadata_f, reduce_f=<built-in function add>, norm=True, min_group_size=2, include_collapsed_metadata=True, constructor=None, one_to_many=False, one_to_many_mode='add', one_to_many_md_key='Path', strict=False)¶
Collapse samples in a table by sample metadata
Bin samples by metadata then collapse each bin into a single sample.
If include_collapsed_metadata is True, metadata for the collapsed samples are retained and can be referred to by the SampleId from each sample within the bin.
constructor: the type of collapsed table to create, e.g. SparseTaxonTable. If None, the collapsed table will be the same type as this table.
The remainder is only relevant to setting one_to_many to True.
If one_to_many is True, allow samples to collapse into multiple bins if the metadata describe a one-many relationship. Supplied functions must allow for iteration support over the metadata key and must return a tuple of (path, bin) as to describe both the path in the hierarchy represented and the specific bin being collapsed into. The uniqueness of the bin is _not_ based on the path but by the name of the bin.
The metadata value for the corresponding collapsed column may include more (or less) information about the collapsed data. For example, if collapsing “FOO”, and there are samples that span three associations A, B, and C, such that sample 1 spans A and B, sample 2 spans B and C and sample 3 spans A and C, the resulting table will contain three collapsed samples:
- A, containing original sample 1 and 3
- B, containing original sample 1 and 2
- C, containing original sample 2 and 3
If a sample maps to the same bin multiple times, it will be counted multiple times.
There are two supported modes for handling one-to-many relationships via one_to_many_mode: add and divide. add will add the sample counts to each bin that the sample maps to, which may increase the total number of counts in the output table. divide will divide a sample’s counts by the number of metadata that the sample has before adding the counts to each bin. This will not increase the total number of counts in the output table.
If one_to_many_md_key is specified, that becomes the metadata key that describes the collapsed path. If a value is not specified, then it defaults to ‘Path’.
If strict is specified, then all metadata pathways operated on must be indexable by metadata_f.
one_to_many and norm are not supported together.
one_to_many and reduce_f are not supported together.
one_to_many and min_group_size are not supported together.
A final note on space consumption. At present, the one_to_many functionality requires a temporary dense matrix representation. This was done so as it initially seems like true support requires rapid __setitem__ functionality on the SparseObj and at the time of implementation, CSMat was O(N) to the number of nonzero elements. This is a work around until either a better __setitem__ implementation is in play on CSMat or a hybrid solution that allows for multiple SparseObj types is used.
- copy()¶
Returns a copy of the table
- delimitedSelf(delim='t', header_key=None, header_value=None, metadata_formatter=<type 'str'>, observation_column_name='#OTU ID')¶
Return self as a string in a delimited form
Default str output for the Table is just row/col ids and table data without any metadata
Including observation metadata in output: If header_key is not None, the observation metadata with that name will be included in the delimited output. If header_value is also not None, the observation metadata will use the provided header_value as the observation metadata name (i.e., the column header) in the delimited output.
metadata_formatter: a function which takes a metadata entry and returns a formatted version that should be written to file
observation_column_name: the name of the first column in the output table, corresponding to the observation IDs. For example, the default will look something like:
#OTU ID Sample1 Sample2 OTU1 10 2 OTU2 4 8
- descriptiveEquality(other)¶
For use in testing, describe how the tables are not equal
- filterObservations(f, invert=False)¶
Filter observations from self based on f
f must accept three variables, the observation values, observation ID and observation metadata. The function must only return True or False, where True indicates that an observation should be retained.
invert: if invert == True, a return value of True from f indicates that an observation should be discarded
- filterSamples(f, invert=False)¶
Filter samples from self based on f
f must accept three variables, the sample values, sample ID and sample metadata. The function must only return True or False, where True indicates that a sample should be retained.
invert: if invert == True, a return value of True from f indicates that a sample should be discarded
- getBiomFormatJsonString(generated_by, direct_io=None)¶
Returns a JSON string representing the table in BIOM format.
generated_by: a string describing the software used to build the table
If direct_io is not None, the final output is written directly to direct_io during processing.
- getBiomFormatObject(generated_by)¶
Returns a dictionary representing the table in BIOM format.
This dictionary can then be easily converted into a JSON string for serialization.
generated_by: a string describing the software used to build the table
TODO: This method may be very inefficient in terms of memory usage, so it needs to be tested with several large tables to determine if optimizations are necessary or not (i.e. subclassing JSONEncoder, using generators, etc...).
- getBiomFormatPrettyPrint(generated_by)¶
Returns a ‘pretty print’ format of a BIOM file
generated_by: a string describing the software used to build the table
WARNING: This method displays data values in a columnar format and can be misleading.
- getObservationIndex(obs_id)¶
Returns the observation index for observation obs_id
- getSampleIndex(samp_id)¶
Returns the sample index for sample samp_id
- getTableDensity()¶
Defined by subclass
- getValueByIds(obs_id, samp_id)¶
Return value in the matrix corresponding to (obs_id, samp_id)
- isEmpty()¶
Returns True if the table is empty
- iterObservationData()¶
Yields observation values
- iterObservations(conv_to_np=True)¶
Yields (observation_value, observation_id, observation_metadata)
NOTE: will return None in observation_metadata positions if self.ObservationMetadata is set to None
- iterSampleData()¶
Yields sample values
- iterSamples(conv_to_np=True)¶
Yields (sample_value, sample_id, sample_metadata)
NOTE: will return None in sample_metadata positions if self.SampleMetadata is set to None
- merge(other, Sample='union', Observation='union', sample_metadata_f=<function prefer_self at 0x10ab62668>, observation_metadata_f=<function prefer_self at 0x10ab62668>)¶
Merge two tables together
The axes, samples and observations, can be controlled independently. Both can either work on union or intersection.
sample_metadata_f and observation_metadata_f define how to merge metadata between tables. The default is to just keep the metadata associated to self if self has metadata otherwise take metadata from other. These functions are given both metadata dictsand must return a single metadata dict
NOTE: There is an implicit type conversion to float. Tables using strings as the type are not supported. No check is currently in place.
NOTE: The return type is always that of self
- nonzero()¶
Returns locations of nonzero elements within the data matrix
The values returned are (observation_id, sample_id)
- nonzeroCounts(axis, binary=False)¶
Get nonzero summaries about an axis
axis : either ‘sample’, ‘observation’, or ‘whole’ binary : sum of nonzero entries, or summing the values of the entries
Returns a numpy array in index order to the axis
- normObservationByMetadata(obs_metadata_id)¶
Return new table with vals divided by obs_metadata_id
- normObservationBySample()¶
Return new table with vals as relative abundances within each sample
- normSampleByObservation()¶
Return new table with vals as relative abundances within each obs
- observationData(id_)¶
Return samples associated with observation id id_
- observationExists(id_)¶
Returns True if observation id_ exists, False otherwise
- reduce(f, axis)¶
Reduce over axis with f
axis can be either sample or observation
- sampleData(id_)¶
Return observations associated with sample id id_
- sampleExists(id_)¶
Returns True if sample id_ exists, False otherwise
- sortByObservationId(sort_f=<function natsort at 0x10ab625f0>)¶
Return a table where observations are sorted by sort_f
sort_f must take a single parameter: the list of observation ids
- sortBySampleId(sort_f=<function natsort at 0x10ab625f0>)¶
Return a table where samples are sorted by sort_f
sort_f must take a single parameter: the list of sample ids
- sortObservationOrder(obs_order)¶
Return a new table with observations in observation order
- sortSampleOrder(sample_order)¶
Return a new table with samples in sample_order
- sum(axis='whole')¶
Returns the sum by axis
axis can be:
whole : whole matrix sum
sample : return a vector with a sum for each sample
observation : return a vector with a sum for each observation
- transformObservations(f)¶
Iterate over observations, applying a function f to each value
f must take three values: an observation value (int or float), an observation id, and an observation metadata entry, and return a single value (int or float) that replaces the provided observation value
- transformSamples(f)¶
Iterate over samples, applying a function f to each value
f must take three values: a sample value (int or float), a sample id, and a sample metadata entry, and return a single value (int or float) that replaces the provided sample value
- transpose()¶
Return a new table that is the transpose of this table.
The returned table will be an entirely new table, including copies of the (transposed) data, sample/observation IDs and metadata.
OTUTable¶
The OTUTable base class provides functionality specific for OTU tables. Currently, it only provides a static private member variable that describes its BIOM type. This object was stubbed out incase future methods are developed that do not make sense with the context of, say, an MG-RAST metagenomic abundance table. It is advised to always use an object that subclasses OTUTable if the analysis is on OTU data.
- class biom.table.OTUTable¶
OTU table abstract class
PathwayTable¶
A table type to represent gene pathways.
- class biom.table.PathwayTable¶
Pathway table abstract class
FunctionTable¶
A table type to represent gene functions.
- class biom.table.FunctionTable¶
Function table abstract class
OrthologTable¶
A table type to represent gene orthologs.
- class biom.table.OrthologTable¶
Ortholog table abstract class
Container classes¶
The container classes implement required private member variable interfaces as defined by the Table abstract base class. Specifically, these objects define the ways in which data is moved into and out of the contained data object. These are fully functional and usable objects, however they do not implement table type specifc functionality.
SparseTable¶
The subclass SparseTable can be derived for use with table data. This object implemented all of the required private interfaces specified by the Table base class. The object contains a _data private member variable that is an instance of the current sparse backend. It is advised to used derived objects of SparseTable if the data being operated on is sparse.
DenseTable¶
The DenseTable object fulfills all private member methods stubbed out by the Table base class. The dense table contains a private member variable that is an instance of numpy.array. The array object is a matrix that contains all values including zeros. It is advised to use this table only if the number of samples and observations is reasonble. Unfortunately, it isn’t reasonable to define reasonable in this context. However, if either the number of observations or the number of samples is > 1000, it would probably be a good idea to rely on a SparseTable.
Table type objects¶
The table type objects define variables and methods specific to a table type. These inherit from a Container Class and a table type base class, and are therefore instantiable. Generally you’ll instantiate tables with biom.table.table_factory, and one of these will be passed as the constructor argument.
DensePathwayTable¶
- class biom.table.DensePathwayTable(*args, **kwargs)¶
Instantiatable dense pathway table
SparsePathwayTable¶
- class biom.table.SparsePathwayTable(*args, **kwargs)¶
Instantiatable sparse pathway table
DenseFunctionTable¶
- class biom.table.DenseFunctionTable(*args, **kwargs)¶
Instantiatable dense function table
SparseFunctionable¶
- class biom.table.SparseFunctionTable(*args, **kwargs)¶
Instantiatable sparse function table
DenseOrthologTable¶
- class biom.table.DenseOrthologTable(*args, **kwargs)¶
Instantiatable dense ortholog table
SparseOrthologTable¶
- class biom.table.SparseOrthologTable(*args, **kwargs)¶
Instantiatable sparse ortholog table
SparseGeneTable¶
- class biom.table.SparseGeneTable(*args, **kwargs)¶
Instantiatable sparse gene table