modularml.core.data_structures.feature_subset
Classes
|
A filtered subset of samples from a parent FeatureSet. |
- class modularml.core.data_structures.feature_subset.FeatureSubset(label: str, parent: FeatureSet, sample_uuids: List[str])
Bases:
SampleCollectionA filtered subset of samples from a parent FeatureSet. Holds weak reference to parent FeatureSet to avoid circular memory references.
- get_all_features(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all features across all samples in the specified format.
- get_all_tags(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all tags across all samples in the specified format.
Each tag will be returned as a list of values across samples. The format argument controls the output structure.
- get_all_targets(format: str | DataFormat = DataFormat.DICT_NUMPY) Any
Returns all targets across all samples in the specified format.
- is_disjoint_with(other: FeatureSubset) bool
Checks whether this subset is disjoint from another (i.e., no overlapping samples).
- Parameters:
other (FeatureSubset) – Another subset to compare against.
- Returns:
True if the subsets are disjoint (no shared samples), False otherwise.
- Return type:
bool
- property parent: FeatureSet
Access the parent FeatureSet. Raises error if parent is no longer alive.
- split(splitter: BaseSplitter) List[FeatureSubset]
Split the current FeatureSubset into multiple FeatureSubsets. The created splits are automatically added to the parent `FeatureSet.subsets`, in addition to being returned.
- Parameters:
splitter (BaseSplitter) – The splitting method.
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- split_by_condition(**conditions: Dict[str, Dict[str, Any]]) List[FeatureSubset]
Convenience method to split samples using condition-based rules. This is equivalent to calling FeatureSet.split(splitter=ConditionSplitter(…)).
- Parameters:
**conditions (Dict[str, Dict[str, Any]]) – Keyword arguments where each key is a subset name and each value is a dictionary of filter conditions. The filter conditions use the same format as .filter() method.
Examples: Below defines three subsets (‘low_temp’, ‘high_temp’, and ‘cell_5’). The ‘low_temp’ subset contains all samples with temperatures under 20, the ‘high_temp’ subsets contains all samples with temperature greater than 20, and the ‘cell_5’ subset contains all samples where cell_id is 5. Note that subsets can have overlapping samples if the split conditions are not carefully **defined. A UserWarning will be raised when this happens, **
- ``` python
- FeatureSet.split_by_condition(
low_temp={‘temperature’: lambda x: x < 20}, high_temp={‘temperature’: lambda x: x >= 20}, cell_5={‘cell_id’: 5}
)
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- split_random(ratios: Dict[str, float], seed: int = 42) List[FeatureSubset]
Convenience method to split samples randomly based on given ratios. This is equivalent to calling FeatureSet.split(splitter=RandomSplitter(…)).
- Parameters:
ratios (Dict[str, float]) – Dictionary mapping subset names to their respective split ratios. E.g., ratios={‘train’:0.5, ‘test’:0.5). All values must add to exactly 1.0.
seed (int, optional) – Random seed for reproducibility. Defaults to 42.
- Returns:
The created subsets.
- Return type:
List[FeatureSubset]
- to_backend(backend: str | Backend) SampleCollection
Returns a new SampleCollection with all Data objects converted to the specified backend.