modularml.core
- class modularml.core.Batch(role_samples: ~typing.Dict[str, ~modularml.core.data_structures.sample_collection.SampleCollection], role_sample_weights: ~typing.Dict[str, ~modularml.core.data_structures.data.Data] = None, label: str | None = None, uuid: str = <factory>)
Bases:
objectContainer for a single batch of samples
- role_samples
Sample collections in batch assigned to a string-based “role”. E.g., for triplet-based batches, you’d have _samples={‘anchor’:List[Sample], ‘negative’:List[Sample], …}.
- Type:
Dict[str, SampleCollection]
- role_sample_weights
(Dict[str, Data]): List of weights applied to samples in this batch, using the same string-based “role” dictionary. E.g., _sample_weights={‘anchor’:List[float], ‘negative’:…, …}. If None, all samples will have the same weight.
- Type:
Dict[str, modularml.core.data_structures.data.Data]
- label
Optional user-assigned label.
- Type:
str, optional
- uuid
A globally unique ID for this batch. Automatically assigned if not provided.
- Type:
str
- property available_roles: List[str]
All assigned roles in this batch.
- class modularml.core.Data(value: Any)
Bases:
objectA container to wrap any backend-specific data type
- class modularml.core.FeatureSet(label: str, samples: List[Sample])
Bases:
SampleCollectionContainer for structured data.
Organizes any raw data into a standardized format.
Initiallize a new FeatureSet.
- Parameters:
label (str) – Name to assign to this FeatureSet
samples (List[Sample]) – List of samples
- __init__(label: str, samples: List[Sample])
Initiallize a new FeatureSet.
- Parameters:
label (str) – Name to assign to this FeatureSet
samples (List[Sample]) – List of samples
- add_subset(subset: FeatureSubset)
Adds a new FeatureSubset (view of FeatureSet.samples).
- Parameters:
subset (FeatureSubset) – The subset to add.
- clear_subsets() None
Remove all previously defined subsets.
- filter(**conditions: Dict[str, Any | List[Any] | Callable]) FeatureSubset | None
Filter samples using conditions applied to tags, features, or targets.
- Parameters:
conditions (Dict[str, Union[Any, List[Any], Callable]) – Key-value pairs where keys correspond to any attribute of the samples’ tags, features, or targets, and values specify the filter condition. Values can be: - A literal value (== match) - A list/tuple/set/ndarray of values - A callable (e.g., lambda x: x < 100)
Example: For a FeatureSet where its samples have the following attributes:
Sample.tags.keys() -> ‘cell_id’, ‘group_id’, ‘pulse_type’
Sample.features.keys() -> ‘voltage’, ‘current’,
Sample.targets.keys() -> ‘soh’
- a filter condition can be applied such that:
cell_id is in [1, 2, 3]
group_id is greater than 1, and
pulse_type equals ‘charge’.
` python >>> FeatureSet.filter(cell_id=[1,2,3], group_id=(lambda x: x > 1), pulse_type='charge') `Generally, filtering is applied on the attributes of Sample.tags, but can also be useful to apply them to the Sample.target keys. For example, we might want to filter to a specific state-of-health (soh) range:
` python # Assuming Sample.targets.keys() returns 'soh', ... >>> FeatureSet.filter(soh=lambda x: (x > 85) & (x < 95)) `This returns a FeatureSubset that contains a view of the samples in FeatureSet that have a state-of-health between 85% and 95%.- Returns:
A new subset containing samples that match all conditions. None is returned if there are no such samples.
- Return type:
FeatureSubset | None
- classmethod from_df(label: str, df: pandas.DataFrame, feature_cols: str | List[str], target_cols: str | List[str], groupby_cols: str | List[str] | None = None, tag_cols: str | List[str] | None = None, feature_transform: FeatureTransform | None = None, target_transform: FeatureTransform | None = None) FeatureSet
Construct a FeatureSet from a pandas DataFrame.
- Parameters:
label (str) – Label to assign to this FeatureSet.
df (pd.DataFrame) – DataFrame to construct FeatureSet from.
feature_cols (Union[str, List[str]]) – Column name(s) in df to use as features.
target_cols (Union[str, List[str]]) – Column name(s) in df to use as targets.
groupby_cols (Union[str, List[str]], optional) – If a single feature spans multiple rows in df, groupby_cols are used to define groups where each group represents a single feature sequence. Defaults to None.
tag_cols (Union[str, List[str]], optional) – Column name(s) corresponding to identifying information that should be retained in the FeatureSet. Defaults to None.
feature_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the feature(s). Defaults to None.
target_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the target(s). Defaults to None.
- classmethod from_dict(label: str, data: Dict[str, List[Any]], feature_keys: str | List[str], target_keys: str | List[str], tag_keys: str | List[str] | None = None) FeatureSet
Construct a FeatureSet from a dictionary of column -> list of values. Each list should be of the same length (one entry per sample).
- Parameters:
label (str) – Name to assign to this FeatureSet
data (Dict[str, List[Any]]) – Input dictionary. Each key maps to a list of values.
feature_keys (Union[str, List[str]]) – Keys in data to be used as features.
target_keys (Union[str, List[str]]) – Keys in data to be used as targets.
tag_keys (Optional[Union[str, List[str]]]) – Keys to use as tags. Optional.
- Returns:
A new FeatureSet instance.
- Return type:
- classmethod from_pandas(label: str, df: pandas.DataFrame, feature_cols: str | List[str], target_cols: str | List[str], groupby_cols: str | List[str] | None = None, tag_cols: str | List[str] | None = None, feature_transform: FeatureTransform | None = None, target_transform: FeatureTransform | None = None) FeatureSet
Construct a FeatureSet from a pandas DataFrame.
- Parameters:
label (str) – Label to assign to this FeatureSet.
df (pd.DataFrame) – DataFrame to construct FeatureSet from.
feature_cols (Union[str, List[str]]) – Column name(s) in df to use as features.
target_cols (Union[str, List[str]]) – Column name(s) in df to use as targets.
groupby_cols (Union[str, List[str]], optional) – If a single feature spans multiple rows in df, groupby_cols are used to define groups where each group represents a single feature sequence. Defaults to None.
tag_cols (Union[str, List[str]], optional) – Column name(s) corresponding to identifying information that should be retained in the FeatureSet. Defaults to None.
feature_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the feature(s). Defaults to None.
target_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the target(s). Defaults to None.
- get_all_features(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all features across all samples in the specified format.
- get_all_tags(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all tags across all samples in the specified format.
Each tag will be returned as a list of values across samples. The format argument controls the output structure.
- get_all_targets(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all targets across all samples in the specified format.
- get_config(sample_path: str | Path | None = None) Dict[str, Any]
Get the serialzied configuration of this FeatureSet. Sample data is saved to a file and only the filepath is seriallized.
- Parameters:
sample_path (Optional[Union[str, Path]], optional) – A path to save the sample data to. If None, the save path is ‘./data/{self.label}_samples.pkl’. Defaults to None.
- Returns:
The config dict
- Return type:
Dict[str, Any]
- get_subset(name: str) FeatureSubset
Returns the specified subset of this FeatureSet. Use FeatureSet.available_subsets to view available subset names.
- Parameters:
name (str) – Subset name to return.
- Returns:
A named view of this FeatureSet.
- Return type:
- plot_sankey()
Plot a Sankey diagram showing how sample IDs flow across nested FeatureSubsets. Subset hierarchy is determined using dot-notation (e.g., ‘train.pretrain’). Samples with multiple subset memberships will show multiple paths.
- pop_subset(name: str) FeatureSubset
Pops the specified subset (removed from FeatureSet and returned).
- Parameters:
name (str) – Subset name to pop
- Returns:
The removed subset.
- Return type:
- remove_subset(name: str) None
Deletes the specified subset from this FeatureSet.
- Parameters:
name (str) – Subset name to remove.s
- save_samples(path: str | Path)
Save the sample data to the specified path.
- split(splitter: BaseSplitter) List[FeatureSubset]
Split the current FeatureSet into multiple FeatureSubsets. The created splits are automatically added to `FeatureSet.subsets`, in addition to being returned.
- Parameters:
splitter (BaseSplitter) – The splitting method.
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- split_by_condition(**conditions: Dict[str, Dict[str, Any]]) List[FeatureSubset]
Convenience method to split samples using condition-based rules. This is equivalent to calling FeatureSet.split(splitter=ConditionSplitter(…)).
- Parameters:
**conditions (Dict[str, Dict[str, Any]]) – Keyword arguments where each key is a subset name and each value is a dictionary of filter conditions. The filter conditions use the same format as .filter() method.
Examples: Below defines three subsets (‘low_temp’, ‘high_temp’, and ‘cell_5’). The ‘low_temp’ subset contains all samples with temperatures under 20, the ‘high_temp’ subsets contains all samples with temperature greater than 20, and the ‘cell_5’ subset contains all samples where cell_id is 5. Note that subsets can have overlapping samples if the split conditions are not carefully **defined. A UserWarning will be raised when this happens, **
- ``` python
- FeatureSet.split_by_condition(
low_temp={‘temperature’: lambda x: x < 20}, high_temp={‘temperature’: lambda x: x >= 20}, cell_5={‘cell_id’: 5}
)
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- split_random(ratios: Dict[str, float], seed: int = 42) List[FeatureSubset]
Convenience method to split samples randomly based on given ratios. This is equivalent to calling FeatureSet.split(splitter=RandomSplitter(…)).
- Parameters:
ratios (Dict[str, float]) – Dictionary mapping subset names to their respective split ratios. E.g., ratios={‘train’:0.5, ‘test’:0.5). All values must add to exactly 1.0.
seed (int, optional) – Random seed for reproducibility. Defaults to 42.
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- to_backend(backend: str | Backend) SampleCollection
Returns a new SampleCollection with all Data objects converted to the specified backend.
- class modularml.core.FeatureSubset(label: str, parent: FeatureSet, sample_uuids: List[str])
Bases:
SampleCollectionA filtered subset of samples from a parent FeatureSet. Holds weak reference to parent FeatureSet to avoid circular memory references.
- get_all_features(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all features across all samples in the specified format.
- get_all_tags(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all tags across all samples in the specified format.
Each tag will be returned as a list of values across samples. The format argument controls the output structure.
- get_all_targets(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all targets across all samples in the specified format.
- is_disjoint_with(other: FeatureSubset) bool
Checks whether this subset is disjoint from another (i.e., no overlapping samples).
- Parameters:
other (FeatureSubset) – Another subset to compare against.
- Returns:
True if the subsets are disjoint (no shared samples), False otherwise.
- Return type:
bool
- property parent: FeatureSet
Access the parent FeatureSet. Raises error if parent is no longer alive.
- split(splitter: BaseSplitter) List[FeatureSubset]
Split the current FeatureSubset into multiple FeatureSubsets. The created splits are automatically added to the parent `FeatureSet.subsets`, in addition to being returned.
- Parameters:
splitter (BaseSplitter) – The splitting method.
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- split_by_condition(**conditions: Dict[str, Dict[str, Any]]) List[FeatureSubset]
Convenience method to split samples using condition-based rules. This is equivalent to calling FeatureSet.split(splitter=ConditionSplitter(…)).
- Parameters:
**conditions (Dict[str, Dict[str, Any]]) – Keyword arguments where each key is a subset name and each value is a dictionary of filter conditions. The filter conditions use the same format as .filter() method.
Examples: Below defines three subsets (‘low_temp’, ‘high_temp’, and ‘cell_5’). The ‘low_temp’ subset contains all samples with temperatures under 20, the ‘high_temp’ subsets contains all samples with temperature greater than 20, and the ‘cell_5’ subset contains all samples where cell_id is 5. Note that subsets can have overlapping samples if the split conditions are not carefully **defined. A UserWarning will be raised when this happens, **
- ``` python
- FeatureSet.split_by_condition(
low_temp={‘temperature’: lambda x: x < 20}, high_temp={‘temperature’: lambda x: x >= 20}, cell_5={‘cell_id’: 5}
)
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- split_random(ratios: Dict[str, float], seed: int = 42) List[FeatureSubset]
Convenience method to split samples randomly based on given ratios. This is equivalent to calling FeatureSet.split(splitter=RandomSplitter(…)).
- Parameters:
ratios (Dict[str, float]) – Dictionary mapping subset names to their respective split ratios. E.g., ratios={‘train’:0.5, ‘test’:0.5). All values must add to exactly 1.0.
seed (int, optional) – Random seed for reproducibility. Defaults to 42.
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- to_backend(backend: str | Backend) SampleCollection
Returns a new SampleCollection with all Data objects converted to the specified backend.
- class modularml.core.MultiBatch(batches: Dict[str, Batch])
Bases:
objectContainer for batches from multiple FeatureSets, keyed by FeatureSet.label
- class modularml.core.Sample(features: ~typing.Dict[str, ~modularml.core.data_structures.data.Data], targets: ~typing.Dict[str, ~modularml.core.data_structures.data.Data], tags: ~typing.Dict[str, ~modularml.core.data_structures.data.Data], label: str | None = None, uuid: str = <factory>)
Bases:
objectContainer for a single sample.
- features
A set of input features. Example: {‘voltage’: np.ndarray}
- Type:
Dict[str, Any]
- targets
A set of target values. Example: {‘soh’: float}
- Type:
Dict[str, Any]
- tags
Metadata used for filtering, grouping, or tracking.
- Type:
Dict[str, Any]
- label
Optional user-assigned label.
- Type:
str, optional
- uuid
A globally unique ID for this sample. Automatically assigned if not provided.
- Type:
str
- class modularml.core.SampleCollection(samples: List[Sample])
Bases:
objectA lightweight container for a list of Sample instances.
- get_all_features(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all features across all samples in the specified format.
- get_all_tags(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all tags across all samples in the specified format.
Each tag will be returned as a list of values across samples. The format argument controls the output structure.
- get_all_targets(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all targets across all samples in the specified format.
- to_backend(backend: str | Backend) SampleCollection
Returns a new SampleCollection with all Data objects converted to the specified backend.
- class modularml.core.StageInput(source: str, key: str | None = None)
Bases:
object
Modules