Working with datasets subpackage
The datasets
subpackage is designed to provide robust and flexible data loading and management functionalities tailored for machine learning models.
This tutorial will guide you through using this subpackage to handle and prepare your data efficiently.
Using the DatasetsManager class
The DatasetsManager
class in the MED3pa.datasets
submodule is designed to facilitate the management of various datasets needed for model training and evaluation. This tutorial provides a step-by-step guide on setting up and using the DatasetsManager
to handle data efficiently.
Step 1: Importing the DatasetsManager
First, import the DatasetsManager
from the MED3pa.datasets
submodule:
from MED3pa.datasets import DatasetsManager
Step 2: Creating an Instance of DatasetsManager
Create an instance of DatasetsManager
. This instance will manage all operations related to datasets:
manager = DatasetsManager()
Step 3: Loading Datasets
With the DatasetsManager
, you can load various segments of your base model datasets, such as training, validation, reference, and testing datasets. You don’t need to load all datasets at once. Provide the path to your dataset and the name of the target column:
Loading from File
manager.set_from_file(dataset_type="training", file='./path_to_training_dataset.csv', target_column_name='target_column')
Loading from NumPy Arrays
You can also load the datasets as NumPy arrays. For this, you need to specify the features, true labels, and column labels as a list (excluding the target column) if they are not already set.
import numpy as np
import pandas as pd
df = pd.read_csv('./path_to_validation_dataset.csv')
# Extract labels and features
X_val = df.drop(columns='target_column').values
y_val = df['target_column'].values
# Example of setting data from numpy arrays
manager.set_from_data(dataset_type="validation", observations=X_val, true_labels=y_val)
Step 4: Ensuring Feature Consistency
Upon loading the first dataset, the DatasetsManager
automatically extracts and stores the names of features. You can retrieve the list of these features using:
features = manager.get_column_labels()
Ensure that the features of subsequent datasets (e.g., validation or testing) match those of the initially loaded dataset to avoid errors and maintain data consistency.
Step 5: Retrieving Data
Retrieve the loaded data in different formats as needed.
As NumPy Arrays
observations, labels = manager.get_dataset_by_type(dataset_type="training")
As a MaskedDataset Instance
To work with the data encapsulated in a MaskedDataset
instance, which might include more functionalities, retrieve it by setting return_instance
to True
:
training_dataset = manager.get_dataset_by_type(dataset_type="training", return_instance=True)
Step 6: Getting a Summary
You can print a summary of the DatasetsManager
to see the status of the datasets:
manager.summarize()
Step 7: Saving and Resetting Datasets
You can save a specific dataset to a CSV file or reset all datasets managed by the DatasetsManager
.
Save to CSV
manager.save_dataset_to_csv(dataset_type="training", file_path='./path_to_save_training_dataset.csv')
Reset Datasets
manager.reset_datasets()
manager.summarize() # Verify that all datasets are reset
Summary of Outputs
When you run the summary
method, you should get an output similar to this, indicating the status and details of each dataset:
training_set: {'num_samples': 151, 'num_features': 23, 'has_pseudo_labels': False, 'has_pseudo_probabilities': False, 'has_confidence_scores': False}
validation_set: {'num_samples': 1000, 'num_features': 10, 'has_pseudo_labels': False, 'has_pseudo_probabilities': False, 'has_confidence_scores': False}
reference_set: Not set
testing_set: Not set
column_labels: ['feature_1', 'feature_2', ..., 'feature_23']
Using the MaskedDataset Class
The MaskedDataset
class, a crucial component of the MED3pa.datasets
submodule, facilitates nuanced data operations that are essential for custom data manipulation and model training processes. This tutorial details common usage scenarios of the MaskedDataset
.
Step 1: Importing Necessary Modules
Begin by importing the MaskedDataset
and DatasetsManager
, along with NumPy for additional data operations:
from MED3pa.datasets import MaskedDataset, DatasetsManager
import numpy as np
Step 2: Loading Data with DatasetsManager
Retrieve the dataset as a MaskedDataset
instance:
manager = DatasetsManager()
manager.set_from_file(dataset_type="training", file='./path_to_training_dataset.csv', target_column_name='target_column')
training_dataset = manager.get_dataset_by_type(dataset_type="training", return_instance=True)
Step 3: Performing Operations on MaskedDataset
Once you have your dataset loaded as a MaskedDataset
instance, you can perform various operations:
Cloning the Dataset
Create a copy of the dataset to ensure the original data remains unchanged during experimentation:
cloned_instance = training_dataset.clone()
Sampling the Dataset
Randomly sample a subset of the dataset, useful for creating training or validation splits:
sampled_instance = training_dataset.sample(N=20, seed=42)
Refining the Dataset
Refine the dataset based on a boolean mask, which is useful for filtering out unwanted data points:
mask = np.random.rand(len(training_dataset)) > 0.5
remaining_samples = training_dataset.refine(mask=mask)
Setting Pseudo Labels and Probabilities
Set pseudo labels and probabilities for the dataset, for this you only need to pass the pseudo_probabilities along with the threshold to extract the pseudo_labels from:
pseudo_probs = np.random.rand(len(training_dataset))
training_dataset.set_pseudo_probs_labels(pseudo_probabilities=pseudo_probs, threshold=0.5)
Getting Feature Vectors and Labels
Retrieve the feature vectors, true labels, and pseudo labels:
observations = training_dataset.get_observations()
true_labels = training_dataset.get_true_labels()
pseudo_labels = training_dataset.get_pseudo_labels()
Getting Confidence Scores
Get the confidence scores if available:
confidence_scores = training_dataset.get_confidence_scores()
Converting to DataFrame and Saving to CSV
### Saving the dataset You can save the dataset as a .csv file, but using save_to_csv and providing the path this will save the observations, true_labels, pseudo_labels and pseudo_probabilities, alongside confidence_scores if they were set:
df = training_dataset.to_dataframe()
training_dataset.save_to_csv('./path_to_save_training_dataset.csv')
Getting Dataset Information
Get detailed information about the dataset, or you can directly use summary
:
training_dataset.summarize()
When you run the summarize
method, you should get an output similar to this, indicating the status and details of the dataset:
Number of samples: 151
Number of features: 23
Has pseudo labels: False
Has pseudo probabilities: False
Has confidence scores: False