modularml.core.data_structures

class modularml.core.data_structures.Batch(role_samples: ~typing.Dict[str, ~modularml.core.data_structures.sample_collection.SampleCollection], role_sample_weights: ~typing.Dict[str, ~modularml.core.data_structures.data.Data] = None, label: str | None = None, uuid: str = <factory>)

Bases: object

Container for a single batch of samples

role_samples

Sample collections in batch assigned to a string-based “role”. E.g., for triplet-based batches, you’d have _samples={‘anchor’:List[Sample], ‘negative’:List[Sample], …}.

Type:

Dict[str, SampleCollection]

role_sample_weights

(Dict[str, Data]): List of weights applied to samples in this batch, using the same string-based “role” dictionary. E.g., _sample_weights={‘anchor’:List[float], ‘negative’:…, …}. If None, all samples will have the same weight.

Type:

Dict[str, modularml.core.data_structures.data.Data]

label

Optional user-assigned label.

Type:

str, optional

uuid

A globally unique ID for this batch. Automatically assigned if not provided.

Type:

str

property available_roles: List[str]

All assigned roles in this batch.

class modularml.core.data_structures.BatchComponentSelector(role: str, sample_attribute: SampleAttribute, attribute_key: str | None = None)

Bases: object

A selector class to wrap accessing specific components of the Batch object.

role

The role within Batch to use

Type:

str

sample_attribute

The attribute of Sample to use.

Type:

SampleAttribute

attribute_key

An optional subset of the specified Sample.sample_attribute. E.g., if Sample.features contains {‘voltage’:…, ‘current’:…}, you can access just the ‘voltage’ component using: sample_attribute=’features’, attribute_key=’voltage’

Type:

str, optional

class modularml.core.data_structures.Data(value: Any)

Bases: object

A container to wrap any backend-specific data type

class modularml.core.data_structures.FeatureSet(label: str, samples: List[Sample])

Bases: SampleCollection

Container for structured data.

Organizes any raw data into a standardized format.

Initiallize a new FeatureSet.

Parameters:
  • label (str) – Name to assign to this FeatureSet

  • samples (List[Sample]) – List of samples

__init__(label: str, samples: List[Sample])

Initiallize a new FeatureSet.

Parameters:
  • label (str) – Name to assign to this FeatureSet

  • samples (List[Sample]) – List of samples

add_subset(subset: FeatureSubset)

Adds a new FeatureSubset (view of FeatureSet.samples).

Parameters:

subset (FeatureSubset) – The subset to add.

clear_subsets() None

Remove all previously defined subsets.

filter(**conditions: Dict[str, Any | List[Any] | Callable]) FeatureSubset | None

Filter samples using conditions applied to tags, features, or targets.

Parameters:

conditions (Dict[str, Union[Any, List[Any], Callable]) – Key-value pairs where keys correspond to any attribute of the samples’ tags, features, or targets, and values specify the filter condition. Values can be: - A literal value (== match) - A list/tuple/set/ndarray of values - A callable (e.g., lambda x: x < 100)

Example: For a FeatureSet where its samples have the following attributes:

  • Sample.tags.keys() -> ‘cell_id’, ‘group_id’, ‘pulse_type’

  • Sample.features.keys() -> ‘voltage’, ‘current’,

  • Sample.targets.keys() -> ‘soh’

a filter condition can be applied such that:
  • cell_id is in [1, 2, 3]

  • group_id is greater than 1, and

  • pulse_type equals ‘charge’.

` python >>> FeatureSet.filter(cell_id=[1,2,3], group_id=(lambda x: x > 1), pulse_type='charge') `

Generally, filtering is applied on the attributes of Sample.tags, but can also be useful to apply them to the Sample.target keys. For example, we might want to filter to a specific state-of-health (soh) range: ` python # Assuming Sample.targets.keys() returns 'soh', ... >>> FeatureSet.filter(soh=lambda x: (x > 85) & (x < 95)) ` This returns a FeatureSubset that contains a view of the samples in FeatureSet that have a state-of-health between 85% and 95%.

Returns:

A new subset containing samples that match all conditions. None is returned if there are no such samples.

Return type:

FeatureSubset | None

classmethod from_df(label: str, df: pandas.DataFrame, feature_cols: str | List[str], target_cols: str | List[str], groupby_cols: str | List[str] | None = None, tag_cols: str | List[str] | None = None, feature_transform: FeatureTransform | None = None, target_transform: FeatureTransform | None = None) FeatureSet

Construct a FeatureSet from a pandas DataFrame.

Parameters:
  • label (str) – Label to assign to this FeatureSet.

  • df (pd.DataFrame) – DataFrame to construct FeatureSet from.

  • feature_cols (Union[str, List[str]]) – Column name(s) in df to use as features.

  • target_cols (Union[str, List[str]]) – Column name(s) in df to use as targets.

  • groupby_cols (Union[str, List[str]], optional) – If a single feature spans multiple rows in df, groupby_cols are used to define groups where each group represents a single feature sequence. Defaults to None.

  • tag_cols (Union[str, List[str]], optional) – Column name(s) corresponding to identifying information that should be retained in the FeatureSet. Defaults to None.

  • feature_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the feature(s). Defaults to None.

  • target_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the target(s). Defaults to None.

classmethod from_dict(label: str, data: Dict[str, List[Any]], feature_keys: str | List[str], target_keys: str | List[str], tag_keys: str | List[str] | None = None) FeatureSet

Construct a FeatureSet from a dictionary of column -> list of values. Each list should be of the same length (one entry per sample).

Parameters:
  • label (str) – Name to assign to this FeatureSet

  • data (Dict[str, List[Any]]) – Input dictionary. Each key maps to a list of values.

  • feature_keys (Union[str, List[str]]) – Keys in data to be used as features.

  • target_keys (Union[str, List[str]]) – Keys in data to be used as targets.

  • tag_keys (Optional[Union[str, List[str]]]) – Keys to use as tags. Optional.

Returns:

A new FeatureSet instance.

Return type:

FeatureSet

classmethod from_pandas(label: str, df: pandas.DataFrame, feature_cols: str | List[str], target_cols: str | List[str], groupby_cols: str | List[str] | None = None, tag_cols: str | List[str] | None = None, feature_transform: FeatureTransform | None = None, target_transform: FeatureTransform | None = None) FeatureSet

Construct a FeatureSet from a pandas DataFrame.

Parameters:
  • label (str) – Label to assign to this FeatureSet.

  • df (pd.DataFrame) – DataFrame to construct FeatureSet from.

  • feature_cols (Union[str, List[str]]) – Column name(s) in df to use as features.

  • target_cols (Union[str, List[str]]) – Column name(s) in df to use as targets.

  • groupby_cols (Union[str, List[str]], optional) – If a single feature spans multiple rows in df, groupby_cols are used to define groups where each group represents a single feature sequence. Defaults to None.

  • tag_cols (Union[str, List[str]], optional) – Column name(s) corresponding to identifying information that should be retained in the FeatureSet. Defaults to None.

  • feature_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the feature(s). Defaults to None.

  • target_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the target(s). Defaults to None.

get_all_features(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all features across all samples in the specified format.

get_all_tags(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all tags across all samples in the specified format.

Each tag will be returned as a list of values across samples. The format argument controls the output structure.

get_all_targets(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all targets across all samples in the specified format.

get_config(sample_path: str | Path | None = None) Dict[str, Any]

Get the serialzied configuration of this FeatureSet. Sample data is saved to a file and only the filepath is seriallized.

Parameters:

sample_path (Optional[Union[str, Path]], optional) – A path to save the sample data to. If None, the save path is ‘./data/{self.label}_samples.pkl’. Defaults to None.

Returns:

The config dict

Return type:

Dict[str, Any]

get_subset(name: str) FeatureSubset

Returns the specified subset of this FeatureSet. Use FeatureSet.available_subsets to view available subset names.

Parameters:

name (str) – Subset name to return.

Returns:

A named view of this FeatureSet.

Return type:

FeatureSubset

plot_sankey()

Plot a Sankey diagram showing how sample IDs flow across nested FeatureSubsets. Subset hierarchy is determined using dot-notation (e.g., ‘train.pretrain’). Samples with multiple subset memberships will show multiple paths.

pop_subset(name: str) FeatureSubset

Pops the specified subset (removed from FeatureSet and returned).

Parameters:

name (str) – Subset name to pop

Returns:

The removed subset.

Return type:

FeatureSubset

remove_subset(name: str) None

Deletes the specified subset from this FeatureSet.

Parameters:

name (str) – Subset name to remove.s

save_samples(path: str | Path)

Save the sample data to the specified path.

split(splitter: BaseSplitter) List[FeatureSubset]

Split the current FeatureSet into multiple FeatureSubsets. The created splits are automatically added to `FeatureSet.subsets`, in addition to being returned.

Parameters:

splitter (BaseSplitter) – The splitting method.

Returns:

The created subsets.

Return type:

List[FeatureSubset]

split_by_condition(**conditions: Dict[str, Dict[str, Any]]) List[FeatureSubset]

Convenience method to split samples using condition-based rules. This is equivalent to calling FeatureSet.split(splitter=ConditionSplitter(…)).

Parameters:

**conditions (Dict[str, Dict[str, Any]]) – Keyword arguments where each key is a subset name and each value is a dictionary of filter conditions. The filter conditions use the same format as .filter() method.

Examples: Below defines three subsets (‘low_temp’, ‘high_temp’, and ‘cell_5’). The ‘low_temp’ subset contains all samples with temperatures under 20, the ‘high_temp’ subsets contains all samples with temperature greater than 20, and the ‘cell_5’ subset contains all samples where cell_id is 5. Note that subsets can have overlapping samples if the split conditions are not carefully **defined. A UserWarning will be raised when this happens, **

``` python
FeatureSet.split_by_condition(

low_temp={‘temperature’: lambda x: x < 20}, high_temp={‘temperature’: lambda x: x >= 20}, cell_5={‘cell_id’: 5}

)

```

Returns:

The created subsets.

Return type:

List[FeatureSubset]

split_random(ratios: Dict[str, float], seed: int = 42) List[FeatureSubset]

Convenience method to split samples randomly based on given ratios. This is equivalent to calling FeatureSet.split(splitter=RandomSplitter(…)).

Parameters:
  • ratios (Dict[str, float]) – Dictionary mapping subset names to their respective split ratios. E.g., ratios={‘train’:0.5, ‘test’:0.5). All values must add to exactly 1.0.

  • seed (int, optional) – Random seed for reproducibility. Defaults to 42.

Returns:

The created subsets.

Return type:

List[FeatureSubset]

to_backend(backend: str | Backend) SampleCollection

Returns a new SampleCollection with all Data objects converted to the specified backend.

class modularml.core.data_structures.FeatureSubset(label: str, parent: FeatureSet, sample_uuids: List[str])

Bases: SampleCollection

A filtered subset of samples from a parent FeatureSet. Holds weak reference to parent FeatureSet to avoid circular memory references.

get_all_features(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all features across all samples in the specified format.

get_all_tags(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all tags across all samples in the specified format.

Each tag will be returned as a list of values across samples. The format argument controls the output structure.

get_all_targets(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all targets across all samples in the specified format.

is_disjoint_with(other: FeatureSubset) bool

Checks whether this subset is disjoint from another (i.e., no overlapping samples).

Parameters:

other (FeatureSubset) – Another subset to compare against.

Returns:

True if the subsets are disjoint (no shared samples), False otherwise.

Return type:

bool

property parent: FeatureSet

Access the parent FeatureSet. Raises error if parent is no longer alive.

split(splitter: BaseSplitter) List[FeatureSubset]

Split the current FeatureSubset into multiple FeatureSubsets. The created splits are automatically added to the parent `FeatureSet.subsets`, in addition to being returned.

Parameters:

splitter (BaseSplitter) – The splitting method.

Returns:

The created subsets.

Return type:

List[FeatureSubset]

split_by_condition(**conditions: Dict[str, Dict[str, Any]]) List[FeatureSubset]

Convenience method to split samples using condition-based rules. This is equivalent to calling FeatureSet.split(splitter=ConditionSplitter(…)).

Parameters:

**conditions (Dict[str, Dict[str, Any]]) – Keyword arguments where each key is a subset name and each value is a dictionary of filter conditions. The filter conditions use the same format as .filter() method.

Examples: Below defines three subsets (‘low_temp’, ‘high_temp’, and ‘cell_5’). The ‘low_temp’ subset contains all samples with temperatures under 20, the ‘high_temp’ subsets contains all samples with temperature greater than 20, and the ‘cell_5’ subset contains all samples where cell_id is 5. Note that subsets can have overlapping samples if the split conditions are not carefully **defined. A UserWarning will be raised when this happens, **

``` python
FeatureSet.split_by_condition(

low_temp={‘temperature’: lambda x: x < 20}, high_temp={‘temperature’: lambda x: x >= 20}, cell_5={‘cell_id’: 5}

)

```

Returns:

The created subsets.

Return type:

List[FeatureSubset]

split_random(ratios: Dict[str, float], seed: int = 42) List[FeatureSubset]

Convenience method to split samples randomly based on given ratios. This is equivalent to calling FeatureSet.split(splitter=RandomSplitter(…)).

Parameters:
  • ratios (Dict[str, float]) – Dictionary mapping subset names to their respective split ratios. E.g., ratios={‘train’:0.5, ‘test’:0.5). All values must add to exactly 1.0.

  • seed (int, optional) – Random seed for reproducibility. Defaults to 42.

Returns:

The created subsets.

Return type:

List[FeatureSubset]

to_backend(backend: str | Backend) SampleCollection

Returns a new SampleCollection with all Data objects converted to the specified backend.

class modularml.core.data_structures.MultiBatch(batches: Dict[str, Batch])

Bases: object

Container for batches from multiple FeatureSets, keyed by FeatureSet.label

class modularml.core.data_structures.Sample(features: ~typing.Dict[str, ~modularml.core.data_structures.data.Data], targets: ~typing.Dict[str, ~modularml.core.data_structures.data.Data], tags: ~typing.Dict[str, ~modularml.core.data_structures.data.Data], label: str | None = None, uuid: str = <factory>)

Bases: object

Container for a single sample.

features

A set of input features. Example: {‘voltage’: np.ndarray}

Type:

Dict[str, Any]

targets

A set of target values. Example: {‘soh’: float}

Type:

Dict[str, Any]

tags

Metadata used for filtering, grouping, or tracking.

Type:

Dict[str, Any]

label

Optional user-assigned label.

Type:

str, optional

uuid

A globally unique ID for this sample. Automatically assigned if not provided.

Type:

str

class modularml.core.data_structures.SampleCollection(samples: List[Sample])

Bases: object

A lightweight container for a list of Sample instances.

samples

A list of Sample instances.

Type:

List[Sample]

get_all_features(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all features across all samples in the specified format.

get_all_tags(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all tags across all samples in the specified format.

Each tag will be returned as a list of values across samples. The format argument controls the output structure.

get_all_targets(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all targets across all samples in the specified format.

to_backend(backend: str | Backend) SampleCollection

Returns a new SampleCollection with all Data objects converted to the specified backend.

Modules

batch

data

feature_set

feature_subset

feature_transform

multi_batch

sample

sample_collection