modularml.core.data_structures.feature_set
Classes
|
Container for structured data. |
- class modularml.core.data_structures.feature_set.FeatureSet(label: str, samples: List[Sample])
Bases:
SampleCollectionContainer for structured data.
Organizes any raw data into a standardized format.
Initiallize a new FeatureSet.
- Parameters:
label (str) – Name to assign to this FeatureSet
samples (List[Sample]) – List of samples
- __init__(label: str, samples: List[Sample])
Initiallize a new FeatureSet.
- Parameters:
label (str) – Name to assign to this FeatureSet
samples (List[Sample]) – List of samples
- add_subset(subset: FeatureSubset)
Adds a new FeatureSubset (view of FeatureSet.samples).
- Parameters:
subset (FeatureSubset) – The subset to add.
- clear_subsets() None
Remove all previously defined subsets.
- filter(**conditions: Dict[str, Any | List[Any] | Callable]) FeatureSubset | None
Filter samples using conditions applied to tags, features, or targets.
- Parameters:
conditions (Dict[str, Union[Any, List[Any], Callable]) – Key-value pairs where keys correspond to any attribute of the samples’ tags, features, or targets, and values specify the filter condition. Values can be: - A literal value (== match) - A list/tuple/set/ndarray of values - A callable (e.g., lambda x: x < 100)
Example: For a FeatureSet where its samples have the following attributes:
Sample.tags.keys() -> ‘cell_id’, ‘group_id’, ‘pulse_type’
Sample.features.keys() -> ‘voltage’, ‘current’,
Sample.targets.keys() -> ‘soh’
- a filter condition can be applied such that:
cell_id is in [1, 2, 3]
group_id is greater than 1, and
pulse_type equals ‘charge’.
` python >>> FeatureSet.filter(cell_id=[1,2,3], group_id=(lambda x: x > 1), pulse_type='charge') `Generally, filtering is applied on the attributes of Sample.tags, but can also be useful to apply them to the Sample.target keys. For example, we might want to filter to a specific state-of-health (soh) range:
` python # Assuming Sample.targets.keys() returns 'soh', ... >>> FeatureSet.filter(soh=lambda x: (x > 85) & (x < 95)) `This returns a FeatureSubset that contains a view of the samples in FeatureSet that have a state-of-health between 85% and 95%.- Returns:
A new subset containing samples that match all conditions. None is returned if there are no such samples.
- Return type:
FeatureSubset | None
- classmethod from_df(label: str, df: pandas.DataFrame, feature_cols: str | List[str], target_cols: str | List[str], groupby_cols: str | List[str] | None = None, tag_cols: str | List[str] | None = None, feature_transform: FeatureTransform | None = None, target_transform: FeatureTransform | None = None) FeatureSet
Construct a FeatureSet from a pandas DataFrame.
- Parameters:
label (str) – Label to assign to this FeatureSet.
df (pd.DataFrame) – DataFrame to construct FeatureSet from.
feature_cols (Union[str, List[str]]) – Column name(s) in df to use as features.
target_cols (Union[str, List[str]]) – Column name(s) in df to use as targets.
groupby_cols (Union[str, List[str]], optional) – If a single feature spans multiple rows in df, groupby_cols are used to define groups where each group represents a single feature sequence. Defaults to None.
tag_cols (Union[str, List[str]], optional) – Column name(s) corresponding to identifying information that should be retained in the FeatureSet. Defaults to None.
feature_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the feature(s). Defaults to None.
target_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the target(s). Defaults to None.
- classmethod from_dict(label: str, data: Dict[str, List[Any]], feature_keys: str | List[str], target_keys: str | List[str], tag_keys: str | List[str] | None = None) FeatureSet
Construct a FeatureSet from a dictionary of column -> list of values. Each list should be of the same length (one entry per sample).
- Parameters:
label (str) – Name to assign to this FeatureSet
data (Dict[str, List[Any]]) – Input dictionary. Each key maps to a list of values.
feature_keys (Union[str, List[str]]) – Keys in data to be used as features.
target_keys (Union[str, List[str]]) – Keys in data to be used as targets.
tag_keys (Optional[Union[str, List[str]]]) – Keys to use as tags. Optional.
- Returns:
A new FeatureSet instance.
- Return type:
- classmethod from_pandas(label: str, df: pandas.DataFrame, feature_cols: str | List[str], target_cols: str | List[str], groupby_cols: str | List[str] | None = None, tag_cols: str | List[str] | None = None, feature_transform: FeatureTransform | None = None, target_transform: FeatureTransform | None = None) FeatureSet
Construct a FeatureSet from a pandas DataFrame.
- Parameters:
label (str) – Label to assign to this FeatureSet.
df (pd.DataFrame) – DataFrame to construct FeatureSet from.
feature_cols (Union[str, List[str]]) – Column name(s) in df to use as features.
target_cols (Union[str, List[str]]) – Column name(s) in df to use as targets.
groupby_cols (Union[str, List[str]], optional) – If a single feature spans multiple rows in df, groupby_cols are used to define groups where each group represents a single feature sequence. Defaults to None.
tag_cols (Union[str, List[str]], optional) – Column name(s) corresponding to identifying information that should be retained in the FeatureSet. Defaults to None.
feature_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the feature(s). Defaults to None.
target_transform (FeatureTransform, optional) – An optional FeatureTransform to apply to the target(s). Defaults to None.
- get_all_features(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all features across all samples in the specified format.
- get_all_tags(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all tags across all samples in the specified format.
Each tag will be returned as a list of values across samples. The format argument controls the output structure.
- get_all_targets(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all targets across all samples in the specified format.
- get_config(sample_path: str | Path | None = None) Dict[str, Any]
Get the serialzied configuration of this FeatureSet. Sample data is saved to a file and only the filepath is seriallized.
- Parameters:
sample_path (Optional[Union[str, Path]], optional) – A path to save the sample data to. If None, the save path is ‘./data/{self.label}_samples.pkl’. Defaults to None.
- Returns:
The config dict
- Return type:
Dict[str, Any]
- get_subset(name: str) FeatureSubset
Returns the specified subset of this FeatureSet. Use FeatureSet.available_subsets to view available subset names.
- Parameters:
name (str) – Subset name to return.
- Returns:
A named view of this FeatureSet.
- Return type:
- plot_sankey()
Plot a Sankey diagram showing how sample IDs flow across nested FeatureSubsets. Subset hierarchy is determined using dot-notation (e.g., ‘train.pretrain’). Samples with multiple subset memberships will show multiple paths.
- pop_subset(name: str) FeatureSubset
Pops the specified subset (removed from FeatureSet and returned).
- Parameters:
name (str) – Subset name to pop
- Returns:
The removed subset.
- Return type:
- remove_subset(name: str) None
Deletes the specified subset from this FeatureSet.
- Parameters:
name (str) – Subset name to remove.s
- save_samples(path: str | Path)
Save the sample data to the specified path.
- split(splitter: BaseSplitter) List[FeatureSubset]
Split the current FeatureSet into multiple FeatureSubsets. The created splits are automatically added to `FeatureSet.subsets`, in addition to being returned.
- Parameters:
splitter (BaseSplitter) – The splitting method.
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- split_by_condition(**conditions: Dict[str, Dict[str, Any]]) List[FeatureSubset]
Convenience method to split samples using condition-based rules. This is equivalent to calling FeatureSet.split(splitter=ConditionSplitter(…)).
- Parameters:
**conditions (Dict[str, Dict[str, Any]]) – Keyword arguments where each key is a subset name and each value is a dictionary of filter conditions. The filter conditions use the same format as .filter() method.
Examples: Below defines three subsets (‘low_temp’, ‘high_temp’, and ‘cell_5’). The ‘low_temp’ subset contains all samples with temperatures under 20, the ‘high_temp’ subsets contains all samples with temperature greater than 20, and the ‘cell_5’ subset contains all samples where cell_id is 5. Note that subsets can have overlapping samples if the split conditions are not carefully **defined. A UserWarning will be raised when this happens, **
- ``` python
- FeatureSet.split_by_condition(
low_temp={‘temperature’: lambda x: x < 20}, high_temp={‘temperature’: lambda x: x >= 20}, cell_5={‘cell_id’: 5}
)
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- split_random(ratios: Dict[str, float], seed: int = 42) List[FeatureSubset]
Convenience method to split samples randomly based on given ratios. This is equivalent to calling FeatureSet.split(splitter=RandomSplitter(…)).
- Parameters:
ratios (Dict[str, float]) – Dictionary mapping subset names to their respective split ratios. E.g., ratios={‘train’:0.5, ‘test’:0.5). All values must add to exactly 1.0.
seed (int, optional) – Random seed for reproducibility. Defaults to 42.
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- to_backend(backend: str | Backend) SampleCollection
Returns a new SampleCollection with all Data objects converted to the specified backend.