biom-format.org

biom-format Table objects

«  The biom file format: Version 1.0   ::   Contents   ::   Converting between file formats  »

biom-format Table objects

The biom-format project provides rich Table objects to support use of the BIOM file format. The objects encapsulate matrix data (such as OTU counts) and abstract the interaction away from the programmer. This provides the immediate benefit of the programmer not having to worry about what the underlying data object is, and in turn allows for different data representations to be supported. Currently, biom-format supports a dense object built off of numpy.array (NumPy) and a sparse object built off of Python dictionaries.

biom-format table_factory method

Generally, construction of a Table subclass will be through the table_factory method. This method facilitates any necessary data conversions and supports a wide variety of input data types.

biom.table.table_factory(data, sample_ids, observation_ids, sample_metadata=None, observation_metadata=None, table_id=None, constructor=<class 'biom.table.SparseOTUTable'>, **kwargs)

Construct a table

Attempts to make ‘data’ sane with respect to the constructor type through various means of juggling. Data can be:

  • numpy.array
  • list of numpy.array vectors
  • SparseObj representation
  • dict representation
  • list of SparseObj representation vectors
  • list of lists of sparse values [[row, col, value], ...]
  • list of lists of dense values [[value, value, ...], ...]

Example usage to create a SparseOTUTable object:

from biom.table import table_factory, SparseOTUTable
from numpy import array

sample_ids = ['s1','s2','s3','s4']
sample_md = [{'pH':4.2,'country':'Peru'},
             {'pH':5.2,'country':'Peru'},
             {'pH':5.0,'country':'Peru'},
             {'pH':4.9,'country':'Peru'}]

observation_ids = ['o1','o2','o3']
observation_md = [{'domain':'Archaea'},
                  {'domain':'Bacteria'},
                  {'domain':'Bacteria'}]

data = array([[1,2,3,4],
              [-1,6,7,8],
              [9,10,11,12]])

t = table_factory(data,
                  sample_ids,
                  observation_ids,
                  sample_md,
                  observation_md,
                  constructor=SparseOTUTable)

Description of available Table objects

There are multiple objects available but some of them are unofficial abstract base classes (does not use the abc module for historical reasons). In practice, the objects used should be the derived Tables such as SparseOTUTable or DenseGeneTable.

Abstract base classes

Abstract base classes establish standard interfaces for subclassed types and provide common functionality for derived types.

Table

Table is a container object and an abstract base class that provides a common and required API for subclassed objects. Through the use of private interfaces, it is possible to create public methods that operate on the underlying datatype without having to implement each method in each subclass. For instance, Table.iterSamplesData will return a generator that always yields numpy.array vectors for each sample regardless of how the table data is actually stored. This functionality results from derived classes implementing private interfaces, such as Table._conv_to_np.

class biom.table.Table(Data, SampleIds, ObservationIds, SampleMetadata=None, ObservationMetadata=None, TableId=None, **kwargs)

Abstract base class representing a Table.

Once created, a Table object is immutable except for its sample/observation metadata, which can be modified in place via addSampleMetadata and addObservationMetadata.

Code to simulate immutability taken from here:
http://en.wikipedia.org/wiki/Immutable_object
addObservationMetadata(md)

Take a dict of metadata and add it to an observation.

md should be of the form {observation_id:{dict_of_metadata}}

addSampleMetadata(md)

Take a dict of metadata and add it to a sample.

md should be of the form {sample_id:{dict_of_metadata}}

binObservationsByMetadata(f, constructor=None)

Yields tables by metadata

f is given the observation metadata by row and must return what “bin” the observation is part of.

constructor: the type of binned tables to create, e.g. SparseTaxonTable. If None, the binned tables will be the same type as this table.

binSamplesByMetadata(f, constructor=None)

Yields tables by metadata

f is given the sample metadata by row and must return what “bin” the sample is part of.

constructor: the type of binned tables to create, e.g. SparseTaxonTable. If None, the binned tables will be the same type as this table.

collapseObservationsByMetadata(metadata_f, reduce_f=<built-in function add>, norm=True, min_group_size=2, include_collapsed_metadata=True, constructor=None, one_to_many=False, one_to_many_mode='add', one_to_many_md_key='Path', strict=False)

Collapse observations in a table by observation metadata

Bin observations by metadata then collapse each bin into a single observation.

If include_collapsed_metadata is True, metadata for the collapsed observations are retained and can be referred to by the ObservationId from each observation within the bin.

constructor: the type of collapsed table to create, e.g. SparseTaxonTable. If None, the collapsed table will be the same type as this table.

The remainder is only relevant to setting one_to_many to True.

If one_to_many is True, allow observations to fall into multiple bins if the metadata describe a one-many relationship. Supplied functions must allow for iteration support over the metadata key and must return a tuple of (path, bin) as to describe both the path in the hierarchy represented and the specific bin being collapsed into. The uniqueness of the bin is _not_ based on the path but by the name of the bin.

The metadata value for the corresponding collapsed row may include more (or less) information about the collapsed data. For example, if collapsing “KEGG Pathways”, and there are observations that span three pathways A, B, and C, such that observation 1 spans A and B, observation 2 spans B and C and observation 3 spans A and C, the resulting table will contain three collapsed observations:

  • A, containing original observation 1 and 3
  • B, containing original observation 1 and 2
  • C, containing original observation 2 and 3

If an observation maps to the same bin multiple times, it will be counted multiple times.

There are two supported modes for handling one-to-many relationships via one_to_many_mode: add and divide. add will add the observation counts to each bin that the observation maps to, which may increase the total number of counts in the output table. divide will divide an observation’s counts by the number of metadata that the observation has before adding the counts to each bin. This will not increase the total number of counts in the output table.

If one_to_many_md_key is specified, that becomes the metadata key that describes the collapsed path. If a value is not specified, then it defaults to ‘Path’.

If strict is specified, then all metadata pathways operated on must be indexable by metadata_f.

one_to_many and norm are not supported together.

one_to_many and reduce_f are not supported together.

one_to_many and min_group_size are not supported together.

A final note on space consumption. At present, the one_to_many functionality requires a temporary dense matrix representation. This was done so as it initially seems like true support requires rapid __setitem__ functionality on the SparseObj and at the time of implementation, CSMat was O(N) to the number of nonzero elements. This is a work around until either a better __setitem__ implementation is in play on CSMat or a hybrid solution that allows for multiple SparseObj types is used.

collapseSamplesByMetadata(metadata_f, reduce_f=<built-in function add>, norm=True, min_group_size=2, include_collapsed_metadata=True, constructor=None, one_to_many=False, one_to_many_mode='add', one_to_many_md_key='Path', strict=False)

Collapse samples in a table by sample metadata

Bin samples by metadata then collapse each bin into a single sample.

If include_collapsed_metadata is True, metadata for the collapsed samples are retained and can be referred to by the SampleId from each sample within the bin.

constructor: the type of collapsed table to create, e.g. SparseTaxonTable. If None, the collapsed table will be the same type as this table.

The remainder is only relevant to setting one_to_many to True.

If one_to_many is True, allow samples to collapse into multiple bins if the metadata describe a one-many relationship. Supplied functions must allow for iteration support over the metadata key and must return a tuple of (path, bin) as to describe both the path in the hierarchy represented and the specific bin being collapsed into. The uniqueness of the bin is _not_ based on the path but by the name of the bin.

The metadata value for the corresponding collapsed column may include more (or less) information about the collapsed data. For example, if collapsing “FOO”, and there are samples that span three associations A, B, and C, such that sample 1 spans A and B, sample 2 spans B and C and sample 3 spans A and C, the resulting table will contain three collapsed samples:

  • A, containing original sample 1 and 3
  • B, containing original sample 1 and 2
  • C, containing original sample 2 and 3

If a sample maps to the same bin multiple times, it will be counted multiple times.

There are two supported modes for handling one-to-many relationships via one_to_many_mode: add and divide. add will add the sample counts to each bin that the sample maps to, which may increase the total number of counts in the output table. divide will divide a sample’s counts by the number of metadata that the sample has before adding the counts to each bin. This will not increase the total number of counts in the output table.

If one_to_many_md_key is specified, that becomes the metadata key that describes the collapsed path. If a value is not specified, then it defaults to ‘Path’.

If strict is specified, then all metadata pathways operated on must be indexable by metadata_f.

one_to_many and norm are not supported together.

one_to_many and reduce_f are not supported together.

one_to_many and min_group_size are not supported together.

A final note on space consumption. At present, the one_to_many functionality requires a temporary dense matrix representation. This was done so as it initially seems like true support requires rapid __setitem__ functionality on the SparseObj and at the time of implementation, CSMat was O(N) to the number of nonzero elements. This is a work around until either a better __setitem__ implementation is in play on CSMat or a hybrid solution that allows for multiple SparseObj types is used.

copy()

Returns a copy of the table

delimitedSelf(delim='t', header_key=None, header_value=None, metadata_formatter=<type 'str'>, observation_column_name='#OTU ID')

Return self as a string in a delimited form

Default str output for the Table is just row/col ids and table data without any metadata

Including observation metadata in output: If header_key is not None, the observation metadata with that name will be included in the delimited output. If header_value is also not None, the observation metadata will use the provided header_value as the observation metadata name (i.e., the column header) in the delimited output.

metadata_formatter: a function which takes a metadata entry and returns a formatted version that should be written to file

observation_column_name: the name of the first column in the output table, corresponding to the observation IDs. For example, the default will look something like:

#OTU ID Sample1 Sample2 OTU1 10 2 OTU2 4 8
descriptiveEquality(other)

For use in testing, describe how the tables are not equal

filterObservations(f, invert=False)

Filter observations from self based on f

f must accept three variables, the observation values, observation ID and observation metadata. The function must only return True or False, where True indicates that an observation should be retained.

invert: if invert == True, a return value of True from f indicates that an observation should be discarded

filterSamples(f, invert=False)

Filter samples from self based on f

f must accept three variables, the sample values, sample ID and sample metadata. The function must only return True or False, where True indicates that a sample should be retained.

invert: if invert == True, a return value of True from f indicates that a sample should be discarded

getBiomFormatJsonString(generated_by, direct_io=None)

Returns a JSON string representing the table in BIOM format.

generated_by: a string describing the software used to build the table

If direct_io is not None, the final output is written directly to direct_io during processing.

getBiomFormatObject(generated_by)

Returns a dictionary representing the table in BIOM format.

This dictionary can then be easily converted into a JSON string for serialization.

generated_by: a string describing the software used to build the table

TODO: This method may be very inefficient in terms of memory usage, so it needs to be tested with several large tables to determine if optimizations are necessary or not (i.e. subclassing JSONEncoder, using generators, etc...).

getBiomFormatPrettyPrint(generated_by)

Returns a ‘pretty print’ format of a BIOM file

generated_by: a string describing the software used to build the table

WARNING: This method displays data values in a columnar format and can be misleading.

getObservationIndex(obs_id)

Returns the observation index for observation obs_id

getSampleIndex(samp_id)

Returns the sample index for sample samp_id

getTableDensity()

Defined by subclass

getValueByIds(obs_id, samp_id)

Return value in the matrix corresponding to (obs_id, samp_id)

isEmpty()

Returns True if the table is empty

iterObservationData()

Yields observation values

iterObservations(conv_to_np=True)

Yields (observation_value, observation_id, observation_metadata)

NOTE: will return None in observation_metadata positions if self.ObservationMetadata is set to None

iterSampleData()

Yields sample values

iterSamples(conv_to_np=True)

Yields (sample_value, sample_id, sample_metadata)

NOTE: will return None in sample_metadata positions if self.SampleMetadata is set to None

merge(other, Sample='union', Observation='union', sample_metadata_f=<function prefer_self at 0x10ab62668>, observation_metadata_f=<function prefer_self at 0x10ab62668>)

Merge two tables together

The axes, samples and observations, can be controlled independently. Both can either work on union or intersection.

sample_metadata_f and observation_metadata_f define how to merge metadata between tables. The default is to just keep the metadata associated to self if self has metadata otherwise take metadata from other. These functions are given both metadata dictsand must return a single metadata dict

NOTE: There is an implicit type conversion to float. Tables using strings as the type are not supported. No check is currently in place.

NOTE: The return type is always that of self

nonzero()

Returns locations of nonzero elements within the data matrix

The values returned are (observation_id, sample_id)

nonzeroCounts(axis, binary=False)

Get nonzero summaries about an axis

axis : either ‘sample’, ‘observation’, or ‘whole’ binary : sum of nonzero entries, or summing the values of the entries

Returns a numpy array in index order to the axis

normObservationByMetadata(obs_metadata_id)

Return new table with vals divided by obs_metadata_id

normObservationBySample()

Return new table with vals as relative abundances within each sample

normSampleByObservation()

Return new table with vals as relative abundances within each obs

observationData(id_)

Return samples associated with observation id id_

observationExists(id_)

Returns True if observation id_ exists, False otherwise

reduce(f, axis)

Reduce over axis with f

axis can be either sample or observation

sampleData(id_)

Return observations associated with sample id id_

sampleExists(id_)

Returns True if sample id_ exists, False otherwise

sortByObservationId(sort_f=<function natsort at 0x10ab625f0>)

Return a table where observations are sorted by sort_f

sort_f must take a single parameter: the list of observation ids

sortBySampleId(sort_f=<function natsort at 0x10ab625f0>)

Return a table where samples are sorted by sort_f

sort_f must take a single parameter: the list of sample ids

sortObservationOrder(obs_order)

Return a new table with observations in observation order

sortSampleOrder(sample_order)

Return a new table with samples in sample_order

sum(axis='whole')

Returns the sum by axis

axis can be:

whole : whole matrix sum

sample : return a vector with a sum for each sample

observation : return a vector with a sum for each observation

transformObservations(f)

Iterate over observations, applying a function f to each value

f must take three values: an observation value (int or float), an observation id, and an observation metadata entry, and return a single value (int or float) that replaces the provided observation value

transformSamples(f)

Iterate over samples, applying a function f to each value

f must take three values: a sample value (int or float), a sample id, and a sample metadata entry, and return a single value (int or float) that replaces the provided sample value

transpose()

Return a new table that is the transpose of this table.

The returned table will be an entirely new table, including copies of the (transposed) data, sample/observation IDs and metadata.

OTUTable

The OTUTable base class provides functionality specific for OTU tables. Currently, it only provides a static private member variable that describes its BIOM type. This object was stubbed out incase future methods are developed that do not make sense with the context of, say, an MG-RAST metagenomic abundance table. It is advised to always use an object that subclasses OTUTable if the analysis is on OTU data.

class biom.table.OTUTable

OTU table abstract class

PathwayTable

A table type to represent gene pathways.

class biom.table.PathwayTable

Pathway table abstract class

FunctionTable

A table type to represent gene functions.

class biom.table.FunctionTable

Function table abstract class

OrthologTable

A table type to represent gene orthologs.

class biom.table.OrthologTable

Ortholog table abstract class

GeneTable

A table type to represent genes.

class biom.table.GeneTable

Gene table abstract class

MetaboliteTable

A table type to represent metabolite profiles.

class biom.table.MetaboliteTable

Metabolite table abstract class

TaxonTable

A table type to represent taxonomies.

class biom.table.TaxonTable

Taxon table abstract class

Container classes

The container classes implement required private member variable interfaces as defined by the Table abstract base class. Specifically, these objects define the ways in which data is moved into and out of the contained data object. These are fully functional and usable objects, however they do not implement table type specifc functionality.

SparseTable

The subclass SparseTable can be derived for use with table data. This object implemented all of the required private interfaces specified by the Table base class. The object contains a _data private member variable that is an instance of the current sparse backend. It is advised to used derived objects of SparseTable if the data being operated on is sparse.

class biom.table.SparseTable(*args, **kwargs)
getTableDensity()

Returns the fraction of nonzero elements in the table.

DenseTable

The DenseTable object fulfills all private member methods stubbed out by the Table base class. The dense table contains a private member variable that is an instance of numpy.array. The array object is a matrix that contains all values including zeros. It is advised to use this table only if the number of samples and observations is reasonble. Unfortunately, it isn’t reasonable to define reasonable in this context. However, if either the number of observations or the number of samples is > 1000, it would probably be a good idea to rely on a SparseTable.

class biom.table.DenseTable(*args, **kwargs)
getTableDensity()

Returns the fraction of nonzero elements in the table.

Table type objects

The table type objects define variables and methods specific to a table type. These inherit from a Container Class and a table type base class, and are therefore instantiable. Generally you’ll instantiate tables with biom.table.table_factory, and one of these will be passed as the constructor argument.

DenseOTUTable

class biom.table.DenseOTUTable(*args, **kwargs)

Instantiatable dense OTU table

SparseOTUTable

class biom.table.SparseOTUTable(*args, **kwargs)

Instantiatable sparse OTU table

DensePathwayTable

class biom.table.DensePathwayTable(*args, **kwargs)

Instantiatable dense pathway table

SparsePathwayTable

class biom.table.SparsePathwayTable(*args, **kwargs)

Instantiatable sparse pathway table

DenseFunctionTable

class biom.table.DenseFunctionTable(*args, **kwargs)

Instantiatable dense function table

SparseFunctionable

class biom.table.SparseFunctionTable(*args, **kwargs)

Instantiatable sparse function table

DenseOrthologTable

class biom.table.DenseOrthologTable(*args, **kwargs)

Instantiatable dense ortholog table

SparseOrthologTable

class biom.table.SparseOrthologTable(*args, **kwargs)

Instantiatable sparse ortholog table

DenseGeneTable

class biom.table.DenseGeneTable(*args, **kwargs)

Instantiatable dense gene table

SparseGeneTable

class biom.table.SparseGeneTable(*args, **kwargs)

Instantiatable sparse gene table

DenseMetaboliteTable

class biom.table.DenseMetaboliteTable(*args, **kwargs)

Instantiatable dense metabolite table

SparseMetaboliteTable

class biom.table.SparseMetaboliteTable(*args, **kwargs)

Instantiatable sparse metabolite table

«  The biom file format: Version 1.0   ::   Contents   ::   Converting between file formats  »