modularml.core.data_structures.feature_subset

Classes

FeatureSubset(label, parent, sample_uuids)

A filtered subset of samples from a parent FeatureSet.

class modularml.core.data_structures.feature_subset.FeatureSubset(label: str, parent: FeatureSet, sample_uuids: List[str])

Bases: SampleCollection

A filtered subset of samples from a parent FeatureSet. Holds weak reference to parent FeatureSet to avoid circular memory references.

get_all_features(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all features across all samples in the specified format.

get_all_tags(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all tags across all samples in the specified format.

Each tag will be returned as a list of values across samples. The format argument controls the output structure.

get_all_targets(format: str | DataFormat = DataFormat.DICT_NUMPY) Any

Returns all targets across all samples in the specified format.

is_disjoint_with(other: FeatureSubset) bool

Checks whether this subset is disjoint from another (i.e., no overlapping samples).

Parameters:

other (FeatureSubset) – Another subset to compare against.

Returns:

True if the subsets are disjoint (no shared samples), False otherwise.

Return type:

bool

property parent: FeatureSet

Access the parent FeatureSet. Raises error if parent is no longer alive.

split(splitter: BaseSplitter) List[FeatureSubset]

Split the current FeatureSubset into multiple FeatureSubsets. The created splits are automatically added to the parent `FeatureSet.subsets`, in addition to being returned.

Parameters:

splitter (BaseSplitter) – The splitting method.

Returns:

The created subsets.

Return type:

List[FeatureSubset]

split_by_condition(**conditions: Dict[str, Dict[str, Any]]) List[FeatureSubset]

Convenience method to split samples using condition-based rules. This is equivalent to calling FeatureSet.split(splitter=ConditionSplitter(…)).

Parameters:

**conditions (Dict[str, Dict[str, Any]]) – Keyword arguments where each key is a subset name and each value is a dictionary of filter conditions. The filter conditions use the same format as .filter() method.

Examples: Below defines three subsets (‘low_temp’, ‘high_temp’, and ‘cell_5’). The ‘low_temp’ subset contains all samples with temperatures under 20, the ‘high_temp’ subsets contains all samples with temperature greater than 20, and the ‘cell_5’ subset contains all samples where cell_id is 5. Note that subsets can have overlapping samples if the split conditions are not carefully **defined. A UserWarning will be raised when this happens, **

``` python
FeatureSet.split_by_condition(

low_temp={‘temperature’: lambda x: x < 20}, high_temp={‘temperature’: lambda x: x >= 20}, cell_5={‘cell_id’: 5}

)

```

Returns:

The created subsets.

Return type:

List[FeatureSubset]

split_random(ratios: Dict[str, float], seed: int = 42) List[FeatureSubset]

Convenience method to split samples randomly based on given ratios. This is equivalent to calling FeatureSet.split(splitter=RandomSplitter(…)).

Parameters:
  • ratios (Dict[str, float]) – Dictionary mapping subset names to their respective split ratios. E.g., ratios={‘train’:0.5, ‘test’:0.5). All values must add to exactly 1.0.

  • seed (int, optional) – Random seed for reproducibility. Defaults to 42.

Returns:

The created subsets.

Return type:

List[FeatureSubset]

to_backend(backend: str | Backend) SampleCollection

Returns a new SampleCollection with all Data objects converted to the specified backend.