SlipGURU Dipartimento di Informatica e Scienze dell'Informazione Università Degli Studi di Genova

data Package

data Package

Provides specific functionality for data–driven activities, such as data subset generation, specific categorizers based on data properties, specific selectors etc.

Null Module

Provides functionality for ‘null’ data–driven activities. Null activities are neutral against their arguments, simply passing them.

class kdvs.fw.impl.data.Null.NullCategorizer(ID='NullCategorizer', null_category='__all__')

Bases: kdvs.fw.Categorizer.Categorizer

Null categorizer that uses single virtual ‘category’ that passes the data subset without checking it and marks it with that category. Useful as placeholder and on the top of categorizers hierarchy, where more specialized work is done with more specialized categorizers.

Parameters :

ID : string

identifier of this categorizer; can be skipped to use default name

null_category : string

name of single virtual category that null categorizer marks all subsets with; can be skipped to use default name

class kdvs.fw.impl.data.Null.NullOrderer

Bases: kdvs.fw.Categorizer.Orderer

Null orderer that simply returns given iterable without actually ordering it. Useful as placeholder and on lower levels of categorizers hierarchy, when the ordering has already been performed upstream.

build(iterable)

‘Build’ the order by simply keeping the input iterable as is.

order()

Return the order by simply returning an input iterable.

PKDrivenData Module

Provides specialized functionality for data–driven activities. This includes the concrete producer of data subsets, according to the philosophy of creating smaller subsets according to prior knowledge. The subset producer also categorizes them actively with given categorizer(s).

class kdvs.fw.impl.data.PKDrivenData.PKDrivenDataManager

Bases: object

Base class for data–driven subset producer. The concrete subclass needs to implement getSubset() method that returns single DataSet instance for single prior knowledge concept specified. The implementation shall create subset according to prior knowledge information. It may also re–implement categorizeSubset() method if necessary.

By default, the constructor does nothing.

getSubset(*args, **kwargs)
static categorizeSubset(subset_inst, subsetCategorizer)

Categorize input DataSet using categories from specified categorizer.

Parameters :

subset_inst : DataSet

instance of a subset to be categorized

subsetCategorizer : Categorizer

instance of Categorizer to be used

Returns :

category : string

category that has been associated by categorizer to input data subset

class kdvs.fw.impl.data.PKDrivenData.PKDrivenDBDataManager(main_dtable, pkcidmap_inst)

Bases: kdvs.fw.impl.data.PKDrivenData.PKDrivenDataManager

Concrete implementation of data–driven subset producer that creates overlapping DataSet instances based on prior knowledge information.

Parameters :

main_dbtable : DBTable

database table that holds primary non–partitioned input data set with all measurements; overlapping subsets will be created based on it

pkcidmap_inst : PKCIDMap

concrete instance of fully constructed PKCIDMap that contains mapping between individual measurements and prior knowledge concepts; overlapping subsets will be created based on that mapping

getSubset(pkcID, forSamples='*', get_ssinfo=True, get_dataset=True)

Generate data subset for specific prior knowledge concept, and wrap it into DataSet instance if requested. Optionally, it can also generate only the information needed to create subset manually and not the subset itself; this may be useful e.g. if data come from remote source that offers no complete control over querying.

Parameters :

pkcID : string

identifier of prior knowledge concept for which the data subset will be generated

forSamples : iterable/string

samples that will be used to generate data subset; by default, prior knowledge is associated with individual measurements and treats samples as equal; this may be changed by specifying the individual samples to focus on (as tuple of strings) or specifying string ‘*’ for considering all samples; ‘*’ by default

get_ssinfo : boolean

if True, generate runtime information about the data subset and return it; True by default

get_dataset : boolean

if True, generate an instance of DataSet that wraps the data subset and return it; True by default

Returns :

ssinfo : dict/None

runtime information as a dictionary of the following elements

  • ‘dtable’ – DBTable instance of the primary input data set
  • ‘rows’ – row IDs for the subset (typically, measurement IDs)
  • ‘cols’ – column IDs for the subset (typically, sample names)
  • ‘pkcID’ – prior knowledge concept ID used to generate the subset; can be None if ‘get_ssinfo’ parameter was False

subset_ds : DataSet/None

DataSet instance that holds the numerical information of the subset; can be None if ‘get_dataset’ parameter was False

Raises :

Error :

if forSamples parameter value was incorrectly specified

class kdvs.fw.impl.data.PKDrivenData.PKDrivenDBSubsetHierarchy(pkdm_inst, samples_iter)

Bases: kdvs.fw.SubsetHierarchy.SubsetHierarchy

Concrete instance of SubsetHierarchy class that generates proper data subsets for given symbol (i.e. prior knowledge concept). This implementation uses data–driven subset producer as a subset generator.

Parameters :

pkdm_inst : PKDrivenDBDataManager

concrete instance of data–driven subset producer

samples_iter : iterable of string

iterable of sample names that will be used during subset generation; typically, it contains all sample names as read from primary input data set

build(categorizers_list, categorizers_inst_dict, initial_symbols)

Build categories hierarchy and symboltree.

Parameters :

categorizers_list : iterable of string

iterable of identifiers of Categorizer instances, starting from root of the tree

categorizers_inst_dict : dict

dictionary of Categorizer instances, identified by specified identifiers

initial_symbols : iterable of string

initial pool of symbols to be ‘partitioned’ with nested categorizers into symboltree; typically, contains all prior knowledge concepts (from single domain or all domains if necessary)

Raises :

Error :

if requested Categorizer instance is not found in the instances dictionary

obtainDatasetFromSymbol(symbol)

Obtain data subset for specific symbol (i.e. prior knowledge concept) and return it.

Parameters :

symbol : string

symbol to return data subset for

Returns :

pkcDS : DataSet

data subset for specific symbol

SubsetSize Module

Provides specific functionality for activities connected to size of the data subsets. For instance, data subsets can be categorized based on their size.

class kdvs.fw.impl.data.SubsetSize.SubsetSizeCategorizer(size_threshold, ID='SubsetSizeCategorizer', size_lesser_category='<=', size_greater_category='>')

Bases: kdvs.fw.Categorizer.Categorizer

Categorizer that checks the size of the data subset (in terms of number of variables associated; also ‘rows’ in KDVS internal implementation terminology), and classifies it into one of two categories: ‘lesser than’ (if size <= threshold) or ‘greater than’ (if size > threshold).

Parameters :

size_threshold : integer

size threshold that categorized data subset will be checked against

ID : string

identifier of this categorizer; can be skipped to use default value; ‘SubsetSizeCategorizer’ by default

size_lesser_category : string

identifier of ‘lesser than’ category; can be skipped to use default value; ‘<=’ by default

size_greater_category : string

identifier of ‘greater than’ category; can be skipped to use default value; ‘>’ by default

ROW_SIZE_LESSER = '<='
ROW_SIZE_GREATER = '>'
getThreshold()

Return size threshold for that categorizer as an integer.

class kdvs.fw.impl.data.SubsetSize.SubsetSizeOrderer(descending=True)

Bases: kdvs.fw.Categorizer.Orderer

Concrete Orderer that is closely associated with data–driven activities. It is used to change ordering of the elements associated with subset sizes. For instance, one may expect data subsets to be processed starting from the largest ones and going progressively towards smaller ones. The build() method of this class accepts iterable of tuples (pkcID, size), where ‘pkcID’ is PKC (prior knowledge concept) ID, and ‘size’ is the size of associated data subset (in terms of number of variables associated; also ‘rows’ in KDVS internal implementation terminology). NOTE: this class must be given as input the already sorted iterable, when sort is done according to descending size of associated data subsets.

Parameters :

descending : boolean

if True, the order is descending wrt subset size; if False, the order is ascending wrt subset size; True by default

build(pkc2ss)

Build appropriate order from given iterable of tuples (pkcID, size). This method expects iterable that is already sorted in descending way (i.e. starting from largest data subsets).

Parameters :

pkc2ss : iterable of (string, integer)

iterable of tuples (pkcID, size), sorted in descending order wrt subset size (i.e. starting from largest)

order()

Return specific order for this orderer. NOTE: the order is the iterable of pkcIDs alone; the size information is omitted, but the original iterable can still be accessed as self.pkc2ss.

Returns :

order : iterable of string

ordered iterable of PKC IDs

Table Of Contents