SlipGURU Dipartimento di Informatica e Scienze dell'Informazione Università Degli Studi di Genova

fw Package

fw Package

Provides abstract framework classes that expose KDVS API, as well as some low–level concrete functionalities.

Annotation Module

Provides abstract functionality for handling annotations.

kdvs.fw.Annotation.PKC2EM_TMPL = <kdvs.fw.DBTable.DBTemplate object at 0x42fd150>

Database table template for storing mapping between prior knowledge concepts and measurements (PKC->M). It defines the name ‘pkc2em’ and columns ‘pkc_id’, ‘em_id’, ‘pkc_data’. The ID column ‘pkc_id’ is also indexed. This general table utilizes multifield ‘pkc_data’ of the following format:

  • ‘feature1=value1,...,featureN=valueN’,

where ‘feature’ is the property specific for PK source, and ‘value’ is the value of this property for specified PKC. Typical features of PK may be: PKC name, PKC description, higher level grouping of PKCs, e.g. into domains, etc. NOTE: it is advisable to create and use more specific tables tailored for specificity of individual PK sources, due to limited querying power of this table and potentially high computational cost of parsing the multifield.

kdvs.fw.Annotation.EM2ANNOTATION_TMPL = <kdvs.fw.DBTable.DBTemplate object at 0x42f5f90>

Database table template that provides annotations for measurements. This template is tailored specifically for Gene Ontology. It defines the name ‘em2annotation’ and columns ‘em_id’, ‘gene_symbol’, ‘repr_id’, ‘gb_acc’, ‘entrez_gene_id’, ‘ensembl_id’, ‘refseq_id’. The ID column ‘em_id’ is also indexed. This table contains selected set of annotations for given measurement (i.e. probeset for microarray, gene for RNASeq etc). The selection is arbitrary and may not reflect all needs of the user; in that case it is advisable to use different, more specific table.

kdvs.fw.Annotation.MULTIFIELD_SEP = ';'

Default separator for the multifield ‘pkc_data’ used in generic annotation database template.

kdvs.fw.Annotation.get_em2annotation(em2annotation_dt)

Obtain the dictionary with mapping between measurements and annotations, stored in specified DBTable instance.

Parameters :

em2annotation_dt : DBTable

wrapped content of ‘em2annotation’ database table

Returns :

em2a : collections.defaultdict

mapping between measurements and annotations

App Module

Provides abstract functionality of applications built on KDVS API. Each concrete application class must be derived from App class.

class kdvs.fw.App.App

Bases: object

Abstract KDVS application.

By default, the constructor calls ‘self.prepareEnv’.

See also

prepareEnv

prepareEnv()

Must be implemented in subclass. The implementation MUST assign fully configured concrete ExecutionEnvironment instance to self.env in order for application to be runnable from within KDVS in normal way. However, if one wants greater control over application behavior, ‘run’ method must be re–implemented as well. See ‘run’ for more details.

appProlog()

By default it does nothing.

appEpilog()

By default it does nothing.

prepare()

By default it does nothing.

run()

By default it performs the following sequence of calls: self.appProlog, self.prepare, self.env.execute, self.appEpilog.

class kdvs.fw.App.AppProfile(expected_cfg={}, actual_cfg={})

Bases: kdvs.core.util.Configurable

Abstract class for application profile. Concrete KDVS application uses specialized profile for configuration; it reads the profile and verifies if all the configuration elements are present and valid. See profile for ‘experiment’ application in ‘example_experiment’ directory for the complete example.

Parameters :

expected_cfg : dict

expected configuration; empty dictionary by default

actual_cfg : dict

actual supplied configuration; empty dictionary by default

Raises :

Error :

if any element is missing from expected configuration

Error :

if expected type of any element is wrong

Categorizer Module

Provides base functionality for categorizers and orderers. Categorizers can divide data subsets into useful categories that can be nested and resemble a tree. This could be useful for assigning selected statistical techniques to specific data subsets only. Orderers control the order in which data subsets are being processed; each category can have its own orderer.

kdvs.fw.Categorizer.NOTCATEGORIZED = <NotCategorized>

Informs KDVS that data subset could not be categorized, for whatever reason. Used in concrete derivations of Categorizer.

kdvs.fw.Categorizer.UC = 'C[%s]->c[%s]'

Standard way to present category stemming from categorizer, as follows: C[“categorizer_name”]->c[“category_name”].

class kdvs.fw.Categorizer.Categorizer(IDstr, categorizeFuncTable)

Bases: object

Base class for categorizers. Categorizer must be supplied with dictionary of functions that categorize given subsets. Each function accepts DataSet instance and outputs, as string, either chosen category name or NOTCATEGORIZED. Dictionary maps category names to categorization functions. One must be careful to assign only one category to single data subset; if not, dataset will be permanently NOTCATEGORIZED without warning.

Parameters :

IDstr : string

identifier for the categorizer; usage of only alphanumerical characters is preferred; descriptive identifiers are preferred due to heavy usage in KDVS logs

categorizeFuncTable : dict(string->callable)

function table for the categorizer; keys are category names that this categorizer will use; values are callables that return that exact category name or NOTCATEGORIZED

Raises :

Error :

if identifier is not a string

Error :

if function table is not a dictionary of it is empty

Error :

if any key in function table is not a string

categories()

Returns all categories that this categorizer handles. Essentially, returns keys from categorization function table.

categorize(dataset_inst)

Categorizes given data subset by running all categorization functions on it, collecting the categories, and checking for their uniqueness. If exactly one single category is recognized, it is returned. If not, NOTCATEGORIZED is returned.

Parameters :

dataset_inst : DataSet

instance of data subset to be categorized

Returns :

category : string

category this data subset falls under; can be NOTCATEGORIZED in the following cases:
  • more than one function return different valid categories
  • all functions return NOTCATEGORIZED
uniquifyCategory(category)

Given category name, makes it unique by binding it to categorizer name. It uses format specified in global variable UC.

Parameters :

category : string

category name to uniquify; typically one of the keys from function table

Returns :

uniquified_category : string

uniquified category; format is taken from global variable UC

static deuniquifyCategory(uniquified_category)

Reverse the effect of ‘uniquifying’ the category name. Returns tuple (categorizer_name, category_name).

Parameters :

uniquified_category : string

uniquified category

Returns :

uniquified_components : tuple

parsed elements as tuple: (categorizer_name, category_name), or (None, None) if parsing was not successful

See also

uniquifyCategory, UC

class kdvs.fw.Categorizer.Orderer

Bases: object

Base class for orderers. In general, orderer accepts an iterable of data subset IDs, reorders it as it sees fit, and presents it through its API.

build(*args, **kwargs)

Must be implemented in subclass. The implementation MUST assign reordered iterable to self.ordering.

order()

Returns the ordering built by this orderer.

Returns :

ordering : iterable

properly ordered iterable of data subset IDs

Raises :

Error :

if ordering has not been built yet

DBResult Module

Provides unified wrapper over results from queries performed on underlying database tables controlled by KDVS.

class kdvs.fw.DBResult.DBResult(dbtable, cursor, rowbufsize=1000)

Bases: object

Wrapper class over query results. Before using this class, correct SQL query must be issued and concrete Cursor instance must be obtained. Typically, all of this is performed inside DBTable, but the class is exposed for those that want greater control over querying. Essentially, this class wraps the the Cursor instance and controls fetching process.

Parameters :

dbtable : DBTable

database table to be queried

cursor : Cursor

cursor used to query the specified table

rowbufsize : integer

size of internal buffer used during results fetching; 1000 by default

See also

PEP 249

get()

Generator that yields fetched results one by one. Underlying fetching is buffered. Results are returned as–is; the parsing is left to the user.

Returns :

result_row : iterable

single row with query result; the format of the row depends on the underlying database table (obvious), but also on the database provider (not so obvious)

Raises :

Error :

if whatever error prevented result row from being obtained; NOTE: essentially, it watches for raising of OperationalError specific for the database provider

See also

Cursor.fetchmany, OperationalError

getAll(as_dict=False, dict_on_rows=True)

Returns all fetched results at once, wrapped in desired structure (list or dictionary). NOTE: depending on the query itself, it may consume a lot of memory.

Parameters :

as_dict : boolean

specify if the results are to be wrapped in dictionary instead of a list; False by default

dict_on_rows : boolean

valid if previous argument is True; specify if dictionary should be keyed by content of the first column (effectively creating dictionary on rows), instead of column name

Returns :

result : list/dict

query results wrapped either in list or in dictionary; dictionary can be keyed by column name (dictionary on columns) or content of the first column (dictionary on rows); e.g. if database has columns

  • ‘ID’, ‘A’, ‘B’, ‘C’

and result comprises of two tuples

  • (‘ID_111’, 1.0, 2.0, 3.0)
  • (‘ID_222’, 4.0, 5.0, 6.0),

the ‘dictionary on columns’ will contain:

  • {‘ID’ : [‘ID_111’, ‘ID_222’], ‘A’ : [1.0, 4.0], ‘B’ : [2.0, 5.0], ‘C’ : [3.0, 6.0]}

and the ‘dictionary on rows’ will contain:

  • {‘ID_111’ : [1.0 , 2.0, 3.0], ‘ID_222’ : [4.0, 5.0, 6.0]}
Raises :

Error :

if whatever error prevented result row from being obtained; NOTE: essentially, it watches for raising of OperationalError specific for the database provider

See also

Cursor.fetchall, OperationalError

close()

Closes wrapped Cursor instances and frees all the resouces allocated. Shall always be used when DBResult object is no longer needed.

DBShelve Module

Provides simple wrapper over database table to act like a dictionary that can hold any Python object, essentially a shelve with database backend.

kdvs.fw.DBShelve.DBSHELVE_TMPL = <kdvs.fw.DBTable.DBTemplate object at 0x67bb190>

Instance of default DBTemplate used to construct underlying database table that serves under DBShelve. It defines the name ‘shelve’ and columns ‘key’, ‘value’. The ID column ‘key’ is indexed.

class kdvs.fw.DBShelve.DBShelve(dbm, db_key, protocol=None)

Bases: _abcoll.MutableMapping

Class that exposes dictionary behavior of database table that can hold any Python object. By default, it governs database table created according to DBTemplate template DBSHELVE_TMPL.

Parameters :

dbm : DBManager

instance of DBManager that will control the database table

db_key : string

identifier of the database table that will be used by DBManager

protocol : integer/None

pickling protocol; if None, then the highest one will be used

See also

pickle

keys()

See also

dict.keys

values()

See also

dict.values

update(items=(), **kwds)

See also

dict.update

clear()

See also

dict.clear

close()

See also

dict.close

view()

See also

dict.view

DBTable Module

Provides low–level functionality of the management of database table under KDVS DB manager. Also provides simple wrapper over templated database tables.

class kdvs.fw.DBTable.DBTable(dbm, db_key, columns, name=None, id_col=None)

Bases: object

Low–level wrapper over database table managed by KDVS DB manager. KDVS uses database tables to manage query–intensive information, such as the robust generation of data subsets from single main input data set. The wrapper encapsulates basic functionality incl. table creation, table filling from specific generator function, querying with conditions over colums and rows (in case where first column holds row IDs), generation of associated numpy.ndarray object (if possible), as well as basic counting routines.

Parameters :

dbm : DBManager

an instance of DB manager that is managing this table

db_key : string

internal ID of the table used by DB manager instance; it is NOT the name of physical database table in underlying RDBMS; typically, user of DB manager refers to the table by this ID and not by its physical name

columns : list/tuple of strings

column names for the table

name : string/None

physical name of database table in underlying RDBMS; if None, the name is generated semi–randomly; NOTE: ordinary user of DB manager shall refer to the table with ‘db_key’ ID

id_col : string/None

designates specific column to be “ID column”; if None, the first column is designated as ID column

Raises :

Error :

if DBManager instance is not present

Error :

if list/tuple with column names is not present

Error :

if ID column name is not the one of existing columns

create(indexed_columns='*', debug=False)

Physically create the table in underlying RDBMS; the creation is deferred until this call. The table is created empty.

Parameters :

indexed_columns : list/tuple/’*’

list/tuple of column names to be indexed by underlying RDBMS; if string ‘*’ is specified, all columns will be indexed; ‘*’ by default

debug : boolean

provides debug mode for table creation; if True, collect all SQL statements produced by underlying RDBMS and return them as list of strings; if False, return None

Returns :

statements : list of strings/None

RDBMS SQL statements issued during table creation, if debug mode is requested; or None otherwise

Raises :

Error :

if table creation or indexing was interrupted with an error; essentially, reraise OperationalError from underlying RDBMS

load(content=<generator object emptyGenerator at 0x42ce5f0>, debug=False)

Fill the already created table with some data, coming from specified generator callable.

Parameters :

content : generator callable

generator callable that furnish the data; this method DOES NOT check the correctness of furnished data, this is left to the user; by default, empty generator callable is used

debug : boolean

provides debug mode for table filling; if True, collect all SQL statements produced by underlying RDBMS and return them as list of strings; if False, return None

Returns :

statements : list of strings/None

RDBMS SQL statements issued during table filling, if debug mode is requested; or None otherwise

Raises :

Error :

if table filling was interrupted with an error; essentially, reraise OperationalError from underlying RDBMS

get(columns='*', rows='*', filter_clause=None, debug=False)

Perform query from the table under specified conditions and return corresponding Cursor instance; the Cursor may be used immediately in straightforward manner or may be wrapped in DBResult instance.

Parameters :

columns : list/tuple/’*’

list of column names that the quering will be performed from; if string ‘*’ is specified instead, all columns will be queried; ‘*’ by default

rows: list/tuple/’*’ :

list of rows (i.e. list of values from designated ID column) that the quering will be performed for; if string ‘*’ is specified instead, all rows (i.e. whole content of ID column) will be queried; ‘*’ by default

filter_clause : string/None

additional filtering conditions stated in the form of correct SQL WHERE clause suitable for underlying RDBMS; if None, no additional filtering is added; None by default

debug : boolean

provides debug mode for table querying; if True, collect all SQL statements produced by underlying RDBMS and return them as list of strings; if False, return None; False by default; NOTE: for this method, debug mode DOES NOT perform any physical querying, it just produces underlyng SQL statements and returns them

Returns :

cs/statements : Cursor/list of strings

if debug mode was not requested: proper Cursor instance that may be used immediately or wrapped into DBResult object; if debug mode was requested: RDBMS SQL statements issued during table querying

Raises :

Error :

if list/tuple of columns/rows was specified incorrectly if specified list of columns/rows is empty if table querying was interrupted with an error; essentially, reraise OperationalError from underlying RDBMS

See also

PEP 249

getAll(columns='*', rows='*', filter_clause=None, as_dict=False, dict_on_rows=False, debug=False)

Convenient wrapper that does the following: performs query under specified conditions, wraps resulting Cursor into DBResult instance, and gets ALL the results wrapped into desired data structure, as per DBResult.getAll.

Parameters :

columns : list/tuple/’*’

list of column names that the quering will be performed from; if string ‘*’ is specified instead, all columns will be queried; ‘*’ by default

rows: list/tuple/’*’ :

list of rows (i.e. list of values from designated ID column) that the quering will be performed for; if string ‘*’ is specified instead, all rows (i.e. whole content of ID column) will be queried; ‘*’ by default

filter_clause : string/None

additional filtering conditions stated in the form of correct SQL WHERE clause suitable for underlying RDBMS; if None, no additional filtering is added; None by default

as_dict : boolean

specify if the results are to be wrapped in dictionary instead of a list; False by default

dict_on_rows : boolean

valid if previous argument is True; specify if dictionary should be keyed by content of the first column (effectively creating dictionary on rows), instead of column name

debug : boolean

if True, activates debug mode identical to one used for method ‘get’, i.e. collect all SQL statements produced by underlying RDBMS and return them as list of strings; if False, perform physical querying and return results in desired data structure as per DBResult

Returns :

results/statements : list/dict / list of strings

if debug mode was requested, return list of underlying SQL statements; if debug mode was not requested, return all results in requested data structure

getArray(columns='*', rows='*', filter_clause=None, remove_id_col=True, debug=False)

Convenient wrapper that does the following: performs query under specified conditions, and builds corresponding numpy.ndarray object that contains queried data. Uses numpy.loadtxt() function for building the instance of numpy.ndarray. If resulting ndarray has dimension of 1 (i.e. (p,)), reshape it into one–dimensional matrix (i.e. (1,p)).

Parameters :

columns : list/tuple/’*’

list of column names that the quering will be performed from; if string ‘*’ is specified instead, all columns will be queried; ‘*’ by default

rows: list/tuple/’*’ :

list of rows (i.e. list of values from designated ID column) that the quering will be performed for; if string ‘*’ is specified instead, all rows (i.e. whole content of ID column) will be queried; ‘*’ by default

filter_clause : string/None

additional filtering conditions stated in the form of correct SQL WHERE clause suitable for underlying RDBMS; if None, no additional filtering is added; None by default

remove_id_col : boolean

discard content of ID column if such effect is desired; True by default

debug : boolean

if True, activates debug mode identical to one used for method ‘get’, i.e. collect all SQL statements produced by underlying RDBMS and return them as list of strings; if False, perform physical querying, build corresponding numpy.ndarray object and return it; False by default

Returns :

mat/statements : numpy.ndarray/list of strings

if debug mode was not requested: numpy.ndarray object that contains queried data; if debug mode was requested: RDBMS SQL statements issued during table querying

Raises :

Error :

if list/tuple of columns/rows was specified incorrectly

Error :

if specified list of columns/rows is empty

Error :

if table querying was interrupted with an error; essentially, reraise OperationalError from underlying RDBMS

Error :

if error was encountered during building of numpy.ndarray object; refer to numpy documentation for more details about possible errors

countRows()

Counts number of rows for the table. Table must be filled to obtain count >0. Counting is performed with SQL standard function ‘count’ in underlying RDBMS.

Returns :

count : integer/None

row count for the table, or None if underlying ‘count’ returns null

Raises :

Error :

if table is not yet created

Error :

if counting was interrupted with an error; essentially, reraise OperationalError from underlying RDBMS

getIDs()

Get content of designated ID column and return it as list of values.

Returns :

IDs : list of strings

list of values from ID column, as queried by underlying RDBMS; the list is sorted in insert order

Raises :

Error :

if table has not yet been created

Error :

if querying of ID column was interrupted with an error; essentially, reraise OperationalError from underlying RDBMS

isCreated()

Returns True if the table has been physically created, False otherwise.

isEmpty()

Returns True if the table is empty, False otherwise.

Raises :

Error :

if table has not yet been created

static fromTemplate(dbm, db_key, template)

Create an instance of DBTable based on specified DBTemplate instance.

Parameters :

dbm : DBManager

an instance of DB manager that is managing the requested table

db_key : string

internal ID of the table used by DB manager instance; it is NOT the name of physical database table in underlying RDBMS; typically, user of DB manager refers to the table by this ID and not by its physical name

template : DBTemplate

an DBTemplate instance that contains specification for new table

Returns :

dbtable : DBTable

instance of DBTable built based on the specified template; the table is not created physically until ‘create’ method is called

Raises :

Error :

if proper template was not specified

kdvs.fw.DBTable.dbtemplate_keys = ('name', 'columns', 'id_column', 'indexes')

Recognized keys used in DBTemplate wrapper object.

class kdvs.fw.DBTable.DBTemplate(in_dict)

Bases: object

The template object that contains simplified directives how to build a database table. It is essentially a wrapper over a dictionary that contains the following elements:

  • ‘name’ – specifies the physical name of the table for underlying RDBMS,
  • ‘columns’ – non–empty list/tuple of column names of standard type (the type is taken from getTextColumnType() method of the underlying DB provider),
  • ‘id_column’ – name of the column designated to be an ID column for that table,
  • ‘indexes’ – list/tuple of column names to be indexed by underlying RDBMS, or string ‘*’ for indexing all columns.
Parameters :

in_dict : dict

dictionary containing simplified directives; the constructor checks if all required elements are present

Raises :

Error :

if list/tuple of column names are not specified and/or empty if input dictionary is missing any of required elements

DSV Module

Provides low–level wrapper over tables that hold delimiter separated values (DSV). Such tables are referred to as DSV tables.

kdvs.fw.DSV.DSV_DEFAULT_ID_COLUMN = 'ID'

Default ID column for DSV table.

class kdvs.fw.DSV.DSV(dbm, db_key, filehandle, dtname=None, delimiter=None, comment=None, header=None, make_missing_ID_column=True)

Bases: kdvs.fw.DBTable.DBTable

Create an instance of DBTable and immediately wrap it into DSV table. DSV table manages additional details such as initialization from associated DSV file and handling underlying DSV dialect.

Parameters :

dbm : DBManager

an instance of DB manager that is managing this table

db_key : string

internal ID of the table used by DB manager instance; it is NOT the name of physical database table in underlying RDBMS; typically, user of DB manager refers to the table by this ID and not by its physical name

filehandle : file–like

file handle to associated DSV file that contains the data that DSV table will hold; the file remains open but the data loading is deferred until requested

dtname : string/None

physical name of database table in underlying RDBMS; if None, the name is generated semi–randomly; NOTE: ordinary user of DB manager shall refer to the table with ‘db_key’ ID

delimiter : string/None

delimiter string of length 1 that should be used for parsing of DSV data; if None, the constructor tries to deduce delimiter by looking into first 10 lines of associated DSV file; None by default; NOTE: giving explicit delimiter instead of deducing it dynamically greatly reduces possibility of errors during parsing DSV data

comment : string/None

comment prefix used in associated DSV file, or None if comments are not used; None by default

header : list/tuple of string / None

if header is present in the form of list/tuple of strings, it will be used as list of columns for the underlying database table; if None, the constructor tries to deduce the correct header by looking into first two lines of associated DSV file; None by default; NOTE: for well formed DSV files, header should be present, so it is relatively safe to deduce it automatically

make_missing_ID_column : boolean

used in connection with previous argument; sometimes one can encounter DSV files that contain NO first column name in the header (e.g. generated from various R functions), and while they contain correct data, such files are syntactically incorrect; if the constructor sees lack of the first column name, it can proceed according to this parameter; if True, it inserts the content of DSV_DEFAULT_ID_COLUMN variable as the missing column name; if False, it inserts empty string “” as the missing column name; True by default

Raises :

Error :

if proper comment string was not specified

Error :

if underlying DSV dialect of associated DSV file has not been resolved correctly

Error :

if delimiter has not been specified correctly

Error :

if header iterable has not been specified correctly

Error :

if parsing of DSV data during deducing was interrupted with an error; essentially, it reraises underlying csv.Error

See also

csv

static getHandle(file_path, *args, **kwargs)

Open specified file and return its handle. Uses fileProvider() to transparently open and read compressed files; additional arguments are passed to file provider directly.

Parameters :

file_path : string

filesystem path to specified file

args : iterable

positional arguments to pass to file provider

kwargs : dict

keyword arguments to pass to file provider

Returns :

handle : file–like

opened handle to the file

Raises :

Error :

if the file could not be accessed

getCommentSkipper(iterable)

Build instance of CommentSkipper for this DSV table using specified ‘comment’ string.

Parameters :

iterable : iterable

iterable of strings to be used by CommentSkipper

Returns :

cs : CommentSkipper

an instance of CommentSkipper

loadAll(debug=False)

Fill the DSV table with data coming from associated DSV file. The input generator is the CommentSkipper instance that is obtained automatically. This method handles all underlying low–level activities. NOTE: the associated DSV file remains open until closed with close() method manually.

Parameters :

debug : boolean

provides debug mode for table filling; if True, collect all SQL statements produced by underlying RDBMS and return them as list of strings; if False, return None

Returns :

statements : list of string/None

RDBMS SQL statements issued during table filling, if debug mode is requested; or None otherwise

Raises :

Error :

if underlying table has not yet been created

Error :

if data could not be loaded for whatever reason; see DBTable.load for more details

close()

Close associated DSV file.

DataSet Module

Provides unified interface for data sets processed by KDVS.

class kdvs.fw.DataSet.DataSet(input_array=None, dbtable=None, cols='*', rows='*', filter_clause=None, remove_id_col=True)

Bases: object

Wrapper object that represents data set processed by KDVS. It can wrap two types of objects:

  • an existing DBTable object that KDVS uses for data storage in relational database
  • an existing numpy.ndarray

In case of wrapping DBTable object, it creates additional numpy object of class ndarray, as returned by numpy.loadtxt() family of functions. The additional ndarray object is cached with the DataSet instance, and can be recached on demand; this may be useful if the content of underlying DBTable object changes dynamically.

Parameters :

input_array : numpy.ndarray

existing numpy.ndarray object to be wrapped, or None if DBTable is to be wrapped; NOTE: when this argument is not None, the next one must be None

dbtable : DBTable

existing DBTable object to be wrapped, or None if numpy.ndarray is to be wrapped; NOTE: when this argument is not None, the previous one must be None

cols : iterable/’*’

valid for wrapping DBTable object, names of database columns to be used in extracting data set from database table, or ‘*’ if using all columns; see DBTable for more details

rows : iterable/’*’

valid for wrapping DBTable object, IDs of database rows to be used in extracting data set from database table, or ‘*’ if using all rows; see DBTable for more details

filter_clause : string/None

valid for wrapping DBTable object, SQL filter clause that allows more sophisticated selection of columns/rows of database table to be used in extracting data set from database table, or None if not used; see DBTable for more details

remove_id_col : boolean

valid for wrapping DBTable object, specifies if the content of ID column of DBTable should be included in data set extracted from database table, True by default; NOTE: should be set to False for extracting purely numerical data sets; see DBTable for more details

Raises :

Error :

if object of wrong type is presented to wrap, or if none of the object is presented at all

recache()

Perform recaching of underlying ndarray object. Usable only when DBTable object is wrapped.

EnvOp Module

Provides root functionality for KDVS EnvOps (environment–wide operations).

IMPORTANT NOTE! In principle, EnvOp is devised as a self-contained function that can access and modify environment explicitly. Other modular execution blocks, such as techniques, reporters, orderers and selectors, by default are isolated from environment as much as possible and are accessible only through API. This is done to minimize possible devastating impact of erroneous code on the whole environment, that must manage other vital things and cannot croak that easily. Therefore, operations should be used only if absolutely necessary since they introduce states that are potentially very hard to debug.

kdvs.fw.EnvOp.DEFAULT_ENVOP_PARAMETERS = ()

Default parameters for EnvOp.

class kdvs.fw.EnvOp.EnvOp(ref_parameters, **kwargs)

Bases: kdvs.core.util.Parametrizable

Encapsulates an EnvOp. Environment–wide operation is parametrizable and affects all execution environment. As such, it can potentially cause substantial problems if applied incorrectly. The EnvOp is called automatically during execution in callback fashion. In ‘experiment’ application, two types of EnvOps are available: pre–EnvOp, that will be executed BEFORE all computational jobs produced by statistical techniques for current category, and post–EnvOp, that will be executed AFTER all computational jobs for current category. EnvOps are executed at the category level.

Parameters :

ref_parameters : iterable

reference parameters to be checked against during instantiation; empty tuple by default

kwargs : dict

actual parameters supplied during instantiation; they will be checked against reference ones

Raises :

Error :

if some supplied parameters are not on the reference list

perform(env)

By default does nothing. Accepts an instance of execution environment.

Job Module

Provides high–level functionality for handling of computational jobs by KDVS.

kdvs.fw.Job.NOTPRODUCED = <NotProduced>

Constant used to signal when job produced no results.

kdvs.fw.Job.JOBERROR = <JobError>

Constant used to signal when job ended with an error.

kdvs.fw.Job.DEFAULT_CALLARGS_LISTING_THR = 2

Default number of job arguments presented, used in job listings, logs, etc.

class kdvs.fw.Job.JobStatus

Bases: object

A container for constants that represent the state of the job during its lifecycle. Also provides list of those statuses.

CREATED = <CREATED>
ADDED = <ADDED>
EXECUTING = <EXECUTING>
FINISHED = <FINISHED>
statuses = (<CREATED>, <ADDED>, <EXECUTING>, <FINISHED>)
class kdvs.fw.Job.Job(call_func, call_args, additional_data={})

Bases: object

High–level wrapper over computational job that KDVS manages. Job consists of a function with arguments and possibly with some additional data. Newly created Job is in the state of CREATED, and its results are NOTPRODUCED.

Parameters :

call_func : function

function to be executed as computational job

call_args : list/tuple

positional arguments for the function to be executed as computational job

additional_data : dict

any additional data that are associated with the job

Raises :

Error :

if proper job function was not specified

Error :

if proper list/tuple of job arguments was not specified

Error :

if additional data could not be accessed (in sense of dict.update)

execute()

Execute specified job function with specified arguments and return the result. Job execution is considered successful if no exception has been raised during running of job function.

Returns :

result : object

Result of job function; jobs can also return JOBERROR if necessary

Raises :

Error :

if any exception was raised during an execution; essentially, reraise Error with the underlying details

class kdvs.fw.Job.JobContainer(incrementID=False)

Bases: object

An abstract container that manages jobs. Must be subclassed.

Parameters :

incrementID : boolean

if True, each new job receives simple integer as its ID: 1,2,3,...; if False, each new job receives more complex UUID type 4 as its ID

See also

uuid

addJob(job, **kwargs)

Add job to the execution queue and schedule for execution. Job changes its status to ADDED.

Parameters :

job : Job

job to be added

kwargs : dict

any additional keyword arguments; used in subclasses for finer control

Returns :

jobID : string

identifier assigned to the new job

getJobCount()

Return number of jobs currently managed by this container. NOTE: this method does not differentiate between executed and not executed jobs.

hasJobs()

Return True if container manages any jobs, False otherwise.

getJob(jobID)

Return instance of Job by its jobID.

Parameters :

jobID : string

job ID

Returns :

job : Job

job with requested ID, if exists

getJobStatus(jobID)

Return status of requested job.

Parameters :

jobID : string

job ID

Returns :

status : one of JobStatus.statuses

status of job with requested ID, if exists

getJobResult(jobID)

Return results produced by requested job. May be NOTPRODUCED.

Parameters :

jobID : string

job ID

Returns :

result : object

result of job with requested ID, if exists

removeJob(jobID)

Remove requested job from this manager.

Parameters :

jobID : string

job ID

Raises :

Warn :

if the job is not yet executed and/or finished

Warn :

if the job has unrecognized status

clear()

Remove all jobs from this container.

start()

Must be implemented in subclass.

close()

Typically implemented in subclass to clean after itself. By default it does nothing.

postClose(destPath, *args)

Used by subclasses. Currently used only in ‘experiment’ application. By default it checks if given destination path exists.

getMiscData()

Return any miscellaneous data associated with this container. Typically, subclasses add some to improve job management or provide some debug information.

clearMiscData()

Remove any miscellaneous data associated with this container.

class kdvs.fw.Job.JobGroupManager(**kwargs)

Bases: object

Simple manager of groups of jobs. Can be used for finer execution control and to facilitate reporting.

Parameters :

kwargs : dict

any keyworded arguments that may be used by the user for finer control (e.g. in sublass); currently, no arguments are used

addJobIDToGroup(group_name, jobID)

Add requested job to specified job group. If group was not defined before, it will be created.

Parameters :

group_name : string

name of the group

jobID : string

job ID

addGroup(group_name, group_job_ids)

Add series of jobs to specified job group (shortcut). If group was not defined before, it will be created.

Parameters :

group_name : string

name of the group

group_job_ids : iterable of string

job IDs

remGroup(group_name)

Remove specified job group from this manager. All associated job IDs are removed as well. NOTE: physical jobs are left intact.

Parameters :

group_name : string

name of the group

clear()

Removes all job groups from this manager.

getGroupJobsIDs(group_name)

Get list of job IDs associated with specified job group name.

Parameters :

group_name : string

name of the group

Returns :

jobIDs : iterable of string

all job IDs from requested group, if exists

findGroupByJobID(jobID)

Identify job group of the requested job ID.

Parameters :

jobID : string

job ID

Returns :

group_name : string

name of the group with requested job, if exists

Raises :

Error :

if jobID is found in more than one group

getGroups()

Get list of all job group names managed by this manager.

Map Module

Provides high–level functionality for mappings constructed by KDVS.

kdvs.fw.Map.NOTMAPPED = <NotMapped>

Constant used to signal that entity is not mapped.

class kdvs.fw.Map.ChainMap(initial_dict=None)

Bases: object

This map uses dictionaries of interlinked partial single mappings to derive final mapping. For instance, for single mappings

  • {‘a’ : 1, ‘b’ : 2, ‘c’ : 3}
  • {1 : ‘baa’, 2 : ‘boo’, 3 : ‘bee’}
  • {‘baa’ : ‘x’, ‘boo’ : ‘y’, ‘bee’ : ‘z’}

the derived final mapping has the form

  • {‘a’ : ‘x’, ‘b’ : ‘y’, ‘c’ : ‘z’}

Each single partial mapping is wrapped into an instance of this class, and deriving is done with class–wide static method. This class exposes partial dict API and re–implements methods __setitem__ and __getitem__. NOTE: for this map, order of derivation, and therefore, order of single partial mappings processed, is important.

Parameters :

initial_dict : dict/None

dictionary that contains the partial mapping, or None if to be constructed partially (this class implements subset of dictionary methods to do so); None by default

Raises :

Error :

if mapping could not be initiated from initial dictionary

getMap()

Return partial single mapping as a dictionary.

update(map_dict, replace=True)

Update partial single mapping with all key–value pairs at once from given dictionary, with possible replacement.

Parameters :

map_dict : dict

dictionary that contains all key–value pairs to be added

replace : boolean

if True, if any key from given dictionary already exists in the partial mapping, replace existing key–value pair with given key–value pair; if False, raise Error; True by default

Raises :

Error :

if replacement was not requested and any key–value pair is already present

static deriveMap(key, maps)

Derive single final value for given single key, computed across all given partial single mappings.

Parameters :

key : object

key for which the final value will be derived

maps : iterable of ChainMap

all single partial mappings, wrapped in ChainMap instance, that will be used for deriving the final value, in given order

Returns :

key, interms, value : object/NOTMAPPED, list of object, object/None

the tuple with the following elements: lookup key or NOTMAPPED if at any stage of derivation NOTMAPPED was encountered; all intermediate values encountered during derivation; final derived value or None if not found

Raises :

Error :

if iterable of partial single maps is incorrectly specified

static buildDerivedMap(keys, maps)

Build mapping of key–value pairs that come from deriving of final values for specified keys.

Parameters :

keys : iterable of object

keys for which the final values will be derived and final map will be built

maps : iterable of ChainMap

all single partial mappings, wrapped in ChainMap instance, that will be used for deriving the final values, in given order

Returns :

dmap, interms : dict, iterable of object

the tuple with the following elements: dictionary that contains derived final mapping for specified keys (may contain NOTMAPPED and None as values if some keys were not mapped correctly); all intermediate values encountered during derivation

Raises :

Error :

if iterable of partial single maps is incorrectly specified

See also

deriveMap

class kdvs.fw.Map.BDMap(factory_obj, add_op_name, initial_map=None)

Bases: object

This map stores bi–directional mappings. For such map, values can repeat. To reflect that, values in both direction (forward and backward) will be binned. For instance, for given initial mapping

  • {‘a’ : 1, ‘b’ : 1, ‘c’ : 2, ‘d’ : 3}

the following forward mapping will be constructed

  • {‘a’ : [1], ‘b’ : [1], ‘c’ : [2], ‘d’ : [3]}

and the following backward mapping will be constructed as well

  • {1 : [‘a’,’b’], 2 : [‘c’], 3 : [‘d’]}

The exact underlying data structure that hold binned values (“binning container”) depends on the specific map subtype. This class exposes partial dict API and re–implements methods __setitem__, __getitem__, and __delitem__.

Parameters :

factory_obj : callable

callable that returns new empty data structure for binning of repeated values, e.g. list(), set(), etc.

add_op_name : string

name of the function that appends repeated value to binning container, e.g. “append” for list, “add” for set, etc.

initial_map : dict/None

initial key–value pairs to add to bi–directional mapping, or None if nothing is to be added initially (this class implements subset of dictionary methods to add key–value pairs later); None by default

clear()

Clear bi–directional mapping. The underlying forward and backward mappings will be cleared.

keyIsMissing()

Perform specific activity when given key is missing in the map during construction of bi–directional mapping. By default, it creates new binning container by calling factory_obj().

getFwdMap()

Return forward mapping as an instance of collections.defaultdict.

getBwdMap()

Return backward mapping as an instance of collections.defaultdict.

getMap()

Return forward mapping as an instance of collections.defaultdict.

dumpFwdMap()

Return forward mapping as a dictionary. NOTE: the resulting dictionary may not be suitable for printing, depending on the type of underlying binning container.

dumpBwdMap()

Return backward mapping as a dictionary. NOTE: the resulting dictionary may not be suitable for printing, depending on the type of underlying binning container.

class kdvs.fw.Map.ListBDMap(initial_map=None)

Bases: kdvs.fw.Map.BDMap

Specialized BDMap that uses lists as binning containers. Repeated values are added to binning container with append() method. NOTE: all specialized behavior incl. exceptions raised, depends on list type. Refer to documentation of list type for more details.

See also

list

class kdvs.fw.Map.SetBDMap(initial_map=None)

Bases: kdvs.fw.Map.BDMap

Specialized BDMap that uses sets as binning containers. Repeated values are added to binning container with add() method. NOTE: all specialized behavior incl. exceptions raised, depends on set type. Refer to documentation of set type for more details.

See also

set

class kdvs.fw.Map.PKCIDMap

Bases: object

Abstract bi–directional mapping (binned in sets) between prior knowledge concepts and individual measurements. The concrete implementation must implement the “build” method, where the mapping is built and SetBDMap instance of self.pkc2emid is filled with it.

build(*args, **kwargs)
class kdvs.fw.Map.PKCGeneMap

Bases: object

Abstract bi–directional mapping (binned in sets) between prior knowledge concepts and gene symbols. Used for gene expression data analysis, may not be present in all KDVS applications. The concrete implementation must implement the “build” method, where the mapping is built and SetBDMap instance of self.pkc2gene is filled with it.

build(*args, **kwargs)
class kdvs.fw.Map.GeneIDMap

Bases: object

Abstract bi–directional mapping (binned in sets) between gene symbols and individual measurements. Used for gene expression data analysis, may not be present in all KDVS applications. The concrete implementation must implement the “build” method, where the mapping is built and SetBDMap instance of self.gene2emid is filled with it.

build(*args, **kwargs)

PK Module

Provides high–level functionality for entities related to prior knowledge, such as prior knowledge concepts, prior knowledge managers, etc.

kdvs.fw.PK.PKC_DETAIL_ELEMENTS = ['conceptID', 'conceptName', 'domainID', 'description', 'additionalInfo']

Default elements of each prior knowledge concept recognized by KDVS.

class kdvs.fw.PK.PriorKnowledgeConcept(conceptid, name, domain_id=None, description=None, additionalInfo={})

Bases: object

The general representation of prior knowledge concept. Specific details depend on the knowledge itself. For example, in gene expression data analysis, genes may be grouped into functional classes, and each class may be represented by single prior knowledge concept. Prior knowledge concepts may be additionally grouped in domains if necessary; the concept of domain is used by prior knowledge manager to expose selected “subset” of knowledge, without the need of exposing all of it. The concept is thinly wrapped in a dictionary.

Parameters :

conceptid : string

unique identifier of the concept across the whole knowledge

name : string

name of the concept

domain_id : string/None

unique identifier of the domain the concept is associated with, or None if the knowledge spans no domains

description : string/None

optional textual description of the concept

additionalInfo : dict

optional additional information associated with the concept; empty dictionary by default

keys()

Return all keys of the associated dictionary that holds the elements of the concept.

class kdvs.fw.PK.PKCManager

Bases: object

Abstract prior knowledge manager. The role of prior knowledge manager in KDVS is to read any specific representation of the knowledge, memorize the individual prior knowledge concepts, optionally map concepts to domains if necessary, and expose individual concepts through its API. The concrete implementation must implement the configure(), getPKC(), and dump() methods, and re–implement load() method. The manager must be configured before knowledge can be loaded. Concrete implementation may cache instances of PriorKnowledgeConcept or create them on the fly. Dump must be in serializable format, and should be human readable if possible. Mapping between concepts and domains by default is bi–directional (via SetBDMap).

configure(**kwargs)
isConfigured()

Return True if manager has been configured, False otherwise.

load(fh, **kwargs)

By default, this method raises Error if manager has not been configured yet.

getPKC(conceptID)
dump()

Report Module

Provides high–level functionality for generation of reports by KDVS.

kdvs.fw.Report.DEFAULT_REPORTER_PARAMETERS = ()

Default parameters for reporter.

class kdvs.fw.Report.Reporter(ref_parameters, **kwargs)

Bases: kdvs.core.util.Parametrizable

Abstract reporter. Reporter produces reports based on results obtained from statistical techniques, where each subset has associated single technique, and each computational job executes technique on a subset. Reporter may work across single category of results (in that case reports are “local”), or can cross boundaries of individual categories (in that case reports are “global”). Each reporter may produce many single reports. Reporters are parametrizable, and report generation is done in the background in callback fashion after all computational jobs has been executed. Reporters are closely tied with respected statistical techniques.

Parameters :

ref_parameters : iterable

reference parameters to be checked against during instantiation; empty tuple by default

kwargs : dict

actual parameters supplied during instantiation; they will be checked against reference ones

produce(resultsIter)

Produce reports. This method works across single category of results. By default, it does nothing. The implementation should fill self._reports with mapping

  • {file_name : [file_content_lines]}

By default, all report files will be created in standard sublocation results. This may be changed by specifying ‘subloc1/.../sublocN/file_name’ as file name. The new sublocation paths may be constructed with given location separator self.locsep.

Parameters :

resultsIter : iterable of Results

iterable of results obtained across single category

produceForHierarchy(subsetHierarchy, ssIndResults, currentCategorizerID, currentCategoryID)

Produce reports. This method works across the whole category tree. By default, it does nothing. The implementation should fill self._reports with mapping

  • {file_name : [file_content_lines]}

By default, all report files will be created in standard sublocation results. This may be changed by specifying ‘subloc1/.../sublocN/file_name’ as file name. The new sublocation paths may be constructed with given location separator self.locsep. NOTE: reporter of this type may be requested to work starting on specific level of category tree; level is given by categorizer and category; in that case, it has access to the whole starting categorizer, and all subtree below it, and can start from given category.

Parameters :

subsetHierarchy : SubsetHierarchy

current hierarchy of subsets that contains whole category tree

ssIndResults : dict of iterable of kdvs.fw.Stat.Results

iterables of Results obtained for all categories at once

currentCategorizerID : string

identifier of Categorizer from which the reporter will start work

currentCategoryID : string

optionally, identifier of category the reporter shall start with

initialize(storageManager, subsets_results_location, additionalData)

Initialize the reporter. Since reporter produces physical files, the concrete storage must be assigned for them. Also, it may accept any additional data necessary for its work.

Parameters :

storageManager : StorageManager

instance of storage manager that will govern the production of physical files

subsets_results_location : string

identifier of standard location used to store KDVS results

additionalData : object

any additional data used by the reporter

finalize()

Finalize reporter’s work by writing report files and clearing them.

getReports()

Get currently generated reports as dictionary

  • {‘file_name’ : [‘file_content_lines’]}
openReport(rlocation, content)

Request opening of new report in given location with specified content.

Parameters :

rlocation : string

location of new report; equivalent to ‘file_name’

content : iterable

content of new report; equivalent of ‘file_content_lines’

getAdditionalData()

Return any additional data associated with this reporter.

Stat Module

Provides high–level functionality for statistical techniques. Statistical technique accepts data subset and processes it as it sees fit. Technique shall produce Results object that is stored physically and may be used later to generate reports. Typically, each technique has its own specific Reporter associated.

class kdvs.fw.Stat.Labels(source, unused_sample_label=0)

Bases: object

Provides uniform information about labels. In supervised machine learning, when the algorithms learn generalities from incoming known samples, the samples are of different types (typically two, sometimes more), and each type has a label associated to it. This information is present only when statistical technique uses supervised classification; in that case, the label information shall be supplied as an additional input file and load into DBTable instance. Typically, in the scenario with two classes of samples, first class has label ‘1’ associated, and second class has label ‘-1’ associated. See ‘example_experiment’ directory for an example of labels file.

Parameters :

source : DBTable

DBTable instance that contains label information

unused_sample_label : integer

if this label is specified in label information, the related samples are skipped from processing entirely; 0 by default

Raises :

Error :

if source is incorrectly specified

getLabels(samples_order, as_array=False)

Return labels in samples order, as read from input label information. When primary data set is read, samples are in specific order, e.g.

  • S1, S2, ..., S40

However, label information specified in separated input file can have different sample order, e.g.

  • S32, S33, ..., S40, S1, S2, S3, ..., S31

Here it is ensured that labels are ordered according to specified sample order.

Parameters :

sample_order : iterable

iterable of sample names; returned labels will be ordered according to the order of samples

as_array : boolean

return labels as numpy.array of floats; False by default

Returns :

lbsord : iterable

labels ordered according to specified samples order; labels are returned as plain text or as numerical numpy.array if requested

getSamples(samples_order)

Return used samples in samples order, as read from input label information. Useful when reordering samples according to specific order, and skipping unused samples (that have ‘unused_sample_label’ associated).

Parameters :

samples_order : iterable

iterable of sample names; returned used samples will be ordered according to this order

Returns :

smpord : iterable

used samples ordered according to specified order

kdvs.fw.Stat.NOTPRESENT = <NotPresent>

Constant that represents element that has not been present among Results.

kdvs.fw.Stat.DEFAULT_RESULTS = ()

Default empty content of Results wrapper object.

kdvs.fw.Stat.RESULTS_SUBSET_ID_KEY = 'SubsetID'

Standard Results element that refers to ID of data subset processed; typically, equivalent to associated prior knowledge concept identifier.

kdvs.fw.Stat.RESULTS_PLOTS_ID_KEY = 'Plots'

Standard Results element that refers to dictionary of plots associated with the result. Plots are produced with Plot according to specification.

kdvs.fw.Stat.RESULTS_RUNTIME_KEY = 'Runtime'

Standard Results element that refers to any information available in runtime, that needs to be included with the result itself.

class kdvs.fw.Stat.Results(ssID, elements=None)

Bases: object

Wrapper for results obtained from statistical technique. Result is typically composed of various elements produced by the technique. The element can be any object of any valid Python/numpy type. Elements are referred to by their names, and Results instance works like a dictionary. If an element is a dictionary itself, it can contain nested dictionaries, so the following syntax also works:

  • Results[‘element_name’][‘subelement_name1’]...[‘subelement_nameN’]

In the documentation, this is represented as:

  • ‘element_name’->’subelement_name1’->...->’subelement_nameN’

Each valid statistical technique shall produce exactly one instance of Results for exactly one data subset. This class exposes partial dict API and implements __getitem__ and __setitem__ methods.

Parameters :

ssID : string

identifier of data subset processed; this identifier will be referred later as the content of RESULTS_SUBSET_ID_KEY element

elements : iterable of string

elements that will be present in the instance; by default, each element is NOTPRESENT

keys()

Return all element names.

kdvs.fw.Stat.DEFAULT_CLASSIFICATION_RESULTS = ('Classification Error',)

Default Results element that shall always be produced by techniques that incorporate classification.

kdvs.fw.Stat.DEFAULT_SELECTION_RESULTS = ('Selection',)

Default Results element that shall always be produced by techniques that incorporate some kind of ‘selection’ (including variable selection).

kdvs.fw.Stat.DEFAULT_GLOBAL_PARAMETERS = ('global_degrees_of_freedom', 'job_importable')

Default statistical technique parameters that shall always be present.

class kdvs.fw.Stat.Technique(ref_parameters, **kwargs)

Bases: kdvs.core.util.Parametrizable

Abstract statistical technique that processes data subsets one at a time. Technique is parametrizable and is initialized during instance creation. Technique processes data subset by creating one or more jobs to be executed by specified job container. After job(s) are finished, the technique produces single Results instance. This split of functionalities was introduced to ease implementation of techniques that use cross validation extensively. Concrete implementation must implement produceResults() and reimplement createJob() methods. In the simplest case, single job that wraps single function call may be generated. More complicated implementations may require generation of cross validation splits, processing them in separated jobs, and merging partial results into single one.

Parameters :

ref_parameters : iterable

reference parameters to be checked against during instantiation; empty tuple by default

kwargs : dict

actual parameters supplied during instantiation; they will be checked against reference ones

createJob(ssname, data, labels=None, additionalJobData={})

This method must be reimplemented as a generator that yields jobs to be executed. By default, it only checks if input data are correctly specified.

Parameters :

ssname : string

identifier of data subset being processed; typically, equivalent to associated prior knowledge concept

data : numpy.ndarray

data to be processed; could be whole data subset or its part (e.g. training or test split)

labels : numpy.ndarray/None

associated label information to be processed; used when technique incorporates supervised classification; None if not needed

additionalJobData : dict

any additional information that will be associated with each job produced; empty dictionary by default

Returns :

(jID, job) : string, Job

tuple of the following: custom job ID, and Job instance to be executed

Notes

Proper order of data and labels must be ensured in order for the technique to work. Typically, subsets are generated according to samples order specified within the primary data set; labels must be in the same order. By definition, it is not checked during job execution.

produceResults(ssname, jobs, runtime_data)

Must be implemented in subclass. It returns single Results instance.

Parameters :

ssname : string

identifier of data subset being processed; typically, equivalent to associated prior knowledge concept

jobs : iterable of Job

executed job(s) that contain(s) raw results

runtime_data : dict

data collected in runtime that shall be included in the final Results instance

Returns :

final_results : Results

Results instance that contains final results of the technique

kdvs.fw.Stat.NOTSELECTED = <NotSelected>

Constant that refers to entity being not selected.

kdvs.fw.Stat.SELECTED = <Selected>

Constant that refers to entity being selected.

kdvs.fw.Stat.SELECTIONERROR = <SelectionError>

Constant that refers to error encountered during selection process.

class kdvs.fw.Stat.Selector(parameters, **kwargs)

Bases: kdvs.core.util.Parametrizable

Abstract parametrizable wrapper for selection activity. Generally, for KDVS ‘selection’ is understood in much wider context than in machine learning community. Both prior knowledge concepts and variables from data subsets can be ‘selected’. Some statistical techniques incorporate variable selection (in machine learning sense), some do not. In order to unify the concept, KDVS introduced ‘selection activity’ that marks specified entities as ‘properly selected’. For example, if the technique incorporates proper variable selection, concrete Selector instance will simply recognize it and mark selected variables as ‘properly selected’. If the technique does not involve variable selection, concrete Selector instance may simply declare some variables as ‘properly selected’ or not, depending on the needs. If some prior knowledge concepts could be ‘selected’ in any sense, another concrete Selector can accomplish this as well. Selectors produce ‘selection markings’ that can be saved later and reported. The concrete subclass must implement perform() method. Selectors are closely tied with techniques and reporters.

Parameters :

parameters : iterable

reference parameters to be checked against during instantiation; empty tuple by default

kwargs : dict

actual parameters supplied during instantiation; they will be checked against reference ones

perform(*args, **kwargs)

Perform selection activity. Typically, Selector accepts Results instance and, depending on the needs, may go through individual variables of the data subset marking them as ‘properly selected’ or not, or may mark whole data subset (that has associated prior knowledge concept) as ‘selected’. In dubious cases, the selector can use constant value ‘selection error’. The associated Reporter instance shall recognize properly selected prior knowledge concepts and/or variables, and report them accordingly. This method must also return ‘selection markings’ in the format understandable for Reporter.

class kdvs.fw.Stat.Plot

Bases: object

Abstract wrapper for plot. Concrete implementation must implement configure(), create(), and plot() methods. When using the plotter, the following sequence of calls shall be issued: configure(), create(), plot().

configure(**kwargs)
create(**kwargs)
plot(**kwargs)
kdvs.fw.Stat.calculateConfusionMatrix(original_labels, predicted_labels, positive_label=1, negative_label=-1)

Calculate confusion matrix for original and predicted labels. It is used when labels reflect two classes, and one class is referred to as ‘cases’ (that has positive label associated), and second class is referred to as ‘control’ (that has negative label associated).

Parameters :

original_labels : iterable of integer

original labels as integers

predicted_labels : iterable of integer

predicted labels as integers

positive_label : integer

label that refers to ‘cases’ class; 1 by default (hence positive label)

negative_label : integer

label that refers to ‘control’ class; -1 by default (hence negative label)

Returns :

(tp, tn, fp, fn) : tuple of integer

tuple of the following: number of true positives, number of true negatives, number of false positives, number of false negatives

Raises :

Error :

if number of original and predicted variables differ

Error :

if labels have different values than ‘positive’ or ‘negative’

kdvs.fw.Stat.calculateMCC(tp, tn, fp, fn)

Calculate Matthews Correlation Coefficient for given confusion matrix.

Parameters :

tp : integer

number of true positives

tn : integer

number of true negatives

fp : integer

number of false positives

fn : integer

number of false negatives

Returns :

mcc : float

StorageManager Module

Provides functionality for path and subdirectory management.

kdvs.fw.StorageManager.SUBLOCATION_SEPARATOR = '/'

Standard separator used for specifying sublocations. It may differ from path separator on current platform.

class kdvs.fw.StorageManager.StorageManager(name=None, root_path=None, create_dbm=False)

Bases: object

Storage manager that operates on file system provided by operating system and accessible by Python interpreter through os module. Storage manager manages ‘locations’ that refer to subdirectories under specified root path, and manipulation of concrete directory paths are hidden from the user.

Parameters :

name : string/None

name of the current instance; it will be used to identify all managed locations; if None, the name is generated randomly (UUID4)

root_path : string/None

directory path that refers to the root of locations that will be managed by this instance; if None, default root path will be used (‘~/.kdvs/’)

create_dbm : boolean

if True, default DBManager instance will be created as well, rooted on specified root path; False by default

createLocation(location=None)

Create specified location. Location may be specified as

  • ‘loc’

or

  • ‘loc/loc1/loc2/.../locN’

In the first case, subdirectory

  • ‘loc’

will be created under the root path of the manager, with concrete path

  • ‘root/loc’

In the second case, all nested subdirectories will be created, if not created already, and the concrete path will be

  • ‘root/loc/loc1/loc2/.../locN’

In addition, all partial sublocations

  • ‘root/loc’
  • ‘root/loc/loc1’
  • ‘root/loc/loc1/loc2’
  • ...
will be managed as well under the names
  • ‘loc’
  • ‘loc/loc1’
  • ‘loc/loc1/loc2’
  • ...

Path separators may differ with the platform.

Parameters :

location : string/None

new location to create under the manager root path; if None, random location name will be used (UUID4)

See also

uuid, os.path

getLocation(location)

Return physical directory path for given location.

Parameters :

location : string

managed location

Returns :

path : string/None

physical path associated with specified location, or None if location does not exist

removeLocation(location, leafonly=True)

Remove location from managed locations. This method considers two cases. When location is e.g.

  • ‘loc/loc1/loc2’

and leaf mode is not requested, physical subdirectory

  • ‘root/loc/loc1/loc2’

will be deleted along with all nested subdirectories, and all managed sublocations. If leaf mode is requested, only the most nested subdirectory

  • ‘root/loc/loc1/loc2’

will be deleted and

  • ‘root/loc/loc1’

will be left, along with all managed sublocations.

Parameters :

location : string

managed location to remove

leafonly : boolean

if True, leaf mode will be used during removal; True by default

getRootLocationID()

Return identifier of root location for this manager instance.

getRootLocation()

Return physical directory path of root location for this manager instance.

SubsetHierarchy Module

Provides high–level functionality for management of hierarchy of data subsets. Subsets can be hierarchical according to prior knowledge domains, or according to user specific criteria. Hierarchy is built based on categorizers.

class kdvs.fw.SubsetHierarchy.SubsetHierarchy

Bases: object

Abstract subset hierarchy manager. It constructs and manages two entities available as global attributes: hierarchy and symboltree. Data subsets may be categorized with categorizers, and categories may be nested. Hierarchy refers to the nested categories as the dictionary of the following format:

  • {parent_category : child_categorizer_id}

where root categorizer is keyed with None (because of no parent), and categories in last categorizer are valued with None (because of no children). Symboltree refers to symbols categorized by categories as the dictionary of the following format:

  • {parent_category : {child_category1 : [associated symbols], ..., child_categoryN : [associated symbols]}}

In contrast to hierarchy, symboltree does not contain ‘None’ keys/values. Typically, symbols refer to prior knowledge concepts. Concrete implementation must implement obtainDatasetFromSymbol() method.

build(categorizers_list, categorizers_inst_dict, initial_symbols)

Build categories hierarchy and symboltree.

Parameters :

categorizers_list : iterable of string

iterable of identifiers of Categorizer instances, starting from root of the tree

categorizers_inst_dict : dict

dictionary of Categorizer instances, identified by specified identifiers

initial_symbols : iterable of string

initial pool of symbols to be ‘partitioned’ with nested categorizers into symboltree; typically, contains all prior knowledge concepts (from single domain or all domains if necessary)

Raises :

Error :

if requested Categorizer instance is not found in the instances dictionary

obtainDatasetFromSymbol(symbol)

Must be implemented in subclass. It shall return an instance of DataSet for given symbol.