ise.data package
Submodules
ise.data.dataclasses module
- class ise.data.dataclasses.EmulatorDataset(X, y, sequence_length=5, projection_length=86)[source]
Bases:
DatasetA PyTorch dataset for loading emulator data.
- Parameters:
X (pandas.DataFrame, numpy.ndarray, or torch.Tensor) - The input data.
y (pandas.DataFrame, numpy.ndarray, or torch.Tensor) - The target data.
sequence_length (int) - The length of the input sequence.
- X
The input data as a PyTorch tensor.
- Type:
torch.Tensor
- y
The target data as a PyTorch tensor.
- Type:
torch.Tensor
- sequence_length
The length of the input sequence.
- Type:
int
- __to_tensor(x)
Converts input data to a PyTorch tensor.
- class ise.data.dataclasses.PyTorchDataset(X, y)[source]
Bases:
DatasetA PyTorch dataset for general data loading.
- Parameters:
X (pandas.DataFrame, numpy.ndarray, or torch.Tensor) - The input data.
y (pandas.DataFrame, numpy.ndarray, or torch.Tensor) - The target data.
- class ise.data.dataclasses.TSDataset(X, y, sequence_length=5)[source]
Bases:
DatasetA PyTorch dataset for time series data.
- Parameters:
X (pandas.DataFrame, numpy.ndarray, or torch.Tensor) - The input data.
y (pandas.DataFrame, numpy.ndarray, or torch.Tensor) - The target data.
sequence_length (int) - The length of the input sequence.
- X
The input data as a PyTorch tensor.
- Type:
torch.Tensor
- y
The target data as a PyTorch tensor.
- Type:
torch.Tensor
- sequence_length
The length of the input sequence.
- Type:
int
ise.data.feature_engineer module
- class ise.data.feature_engineer.FeatureEngineer(ice_sheet, data: DataFrame, fill_mrro_nans: bool = False, split_dataset: bool = False, train_size: float = 0.7, val_size: float = 0.15, test_size: float = 0.15, output_directory: str | None = None)[source]
Bases:
objectA class for feature engineering operations on a given dataset.
- Parameters:
data (pd.DataFrame) - The input dataset.
fill_mrro_nans (bool, optional) - Flag indicating whether to fill missing values in the ‘mrro’ column. Defaults to False.
split_dataset (bool, optional) - Flag indicating whether to split the dataset into training, validation, and test sets. Defaults to False.
train_size (float, optional) - The proportion of the dataset to be used for training. Defaults to 0.7.
val_size (float, optional) - The proportion of the dataset to be used for validation. Defaults to 0.15.
test_size (float, optional) - The proportion of the dataset to be used for testing. Defaults to 0.15.
output_directory (str, optional) - The directory to save the split datasets. Defaults to None.
- fill_mrro_nans(method, data=None)[source]
Fills missing values in the ‘mrro’ column of the dataset.
- Parameters:
method (str) - The method to use for filling missing values.
data (pd.DataFrame, optional) - The input dataset. If not provided, the class attribute ‘data’ will be used. Defaults to None.
- Returns:
The dataset with missing values in the ‘mrro’ column filled.
- Return type:
pd.DataFrame
- split_data(data=None, train_size=None, val_size=None, test_size=None, output_directory=None, random_state=42)[source]
Splits the dataset into training, validation, and test sets.
- Parameters:
data (pd.DataFrame, optional) - The input dataset. If not provided, the class attribute ‘data’ will be used. Defaults to None.
train_size (float, optional) - The proportion of the dataset to be used for training. If not provided, the class attribute ‘train_size’ will be used. Defaults to None.
val_size (float, optional) - The proportion of the dataset to be used for validation. If not provided, the class attribute ‘val_size’ will be used. Defaults to None.
test_size (float, optional) - The proportion of the dataset to be used for testing. If not provided, the class attribute ‘test_size’ will be used. Defaults to None.
output_directory (str, optional) - The directory to save the split datasets. If not provided, the class attribute ‘output_directory’ will be used. Defaults to None.
random_state (int, optional) - The random seed for reproducibility. Defaults to 42.
- Returns:
A tuple containing the training, validation, and test sets.
- Return type:
tuple
- ise.data.feature_engineer.add_lag_variables(data: DataFrame, lag: int, verbose=True) DataFrame[source]
Adds lag variables to the input DataFrame.
- Parameters:
data (pd.DataFrame) - The input DataFrame.
lag (int) - The number of time steps to lag the variables.
- Returns:
The DataFrame with lag variables added.
- Return type:
pd.DataFrame
- ise.data.feature_engineer.add_model_characteristics(data, model_char_path='./ise/utils/model_characteristics.csv', encode=True, ids_path=None) DataFrame[source]
- ise.data.feature_engineer.backfill_outliers(data, percentile=99.999)[source]
Replaces extreme values in y-values (above the specified percentile and below the 1-percentile across all y-values) with the value from the previous row.
- Parameters:
percentile (float) - The percentile to use for defining upper extreme values across all y-values. Defaults to 99.999.
- ise.data.feature_engineer.drop_outliers(data: DataFrame, column: str, method: str, expression: List[tuple] | None = None, quantiles: List[float] = [0.01, 0.99])[source]
Drops simulations that are outliers based on the provided method and expression. Extra complexity is handled due to the necessity of removing the entire 86 row series from the dataset rather than simply removing the rows with given conditions. Note that the condition indicates rows to be DROPPED, not kept (e.g. ‘sle’, ‘>’, ‘20’ would drop all simulations containing sle values over 20). If quantile method is used, outliers are dropped from the SLE column based on the provided quantile in the quantiles argument. If explicit is chosen, expression must contain a list of tuples such that the tuple contains [(column, operator, expression)] of the subset, e.g. [(“sle”, “>”, 20), (“sle”, “<”, -20)].
- Parameters:
data (pd.DataFrame) - The input DataFrame.
method (str) - Method of outlier deletion, must be in [quantile, explicit]
expression (list[tuple]) - List of subset expressions in the form [column, operator, value], defaults to None.
quantiles (list[float]) - List of quantiles for quantile method, defaults to [0.01, 0.99].
- Returns:
having outliers dropped.
- Return type:
data (pd.DataFrame)
- ise.data.feature_engineer.fill_mrro_nans(data: DataFrame, method) DataFrame[source]
Fills the NaN values in the specified columns with the given method.
- Parameters:
data (pd.DataFrame) - The input DataFrame.
method (str or int) - The method to fill NaN values. Must be one of ‘zero’, ‘mean’, ‘median’, or ‘drop’.
- Returns:
The DataFrame with NaN values filled according to the specified method.
- Return type:
pd.DataFrame
- Raises:
ValueError - If the method is not one of ‘zero’, ‘mean’, ‘median’, or ‘drop’.
- ise.data.feature_engineer.split_training_data(data, train_size, val_size, test_size=None, output_directory=None, random_state=42)[source]
Splits the input data into training, validation, and test sets based on the specified sizes.
- Parameters:
data (str or pandas.DataFrame) - The input data to be split. It can be either a file path (str) or a pandas DataFrame.
train_size (float) - The proportion of data to be used for training.
val_size (float) - The proportion of data to be used for validation.
test_size (float, optional) - The proportion of data to be used for testing. If not provided, the remaining data after training and validation will be used for testing. Defaults to None.
output_directory (str, optional) - The directory where the split data will be saved as CSV files. Defaults to None.
random_state (int, optional) - The random seed for shuffling the data. Defaults to 42.
- Returns:
A tuple containing the training, validation, and test sets as pandas DataFrames.
- Return type:
tuple
- Raises:
ValueError - If the length of data is not divisible by 86, indicating incomplete projections.
ValueError - If the data does not have a column named ‘id’.
ise.data.process module
- class ise.data.process.DatasetMerger(ice_sheet, forcings, projections, experiment_file, output_dir)[source]
Bases:
objectA class for merging datasets from forcing and projection files.
- class ise.data.process.DimensionalityReducer(forcing_dir, projection_dir, output_dir, ice_sheet=None, scaling_method=None)[source]
Bases:
object- convert_forcings(forcing_files: list | None = None, pca_model_directory: str | None = None, output_dir: str | None = None, scaling_method=None)[source]
Converts atmospheric and oceanic forcing files to PCA space using pretrained PCA models.
- Parameters:
forcing_files (list, optional) - List of specific forcing files to convert. If not provided, all files in the directory will be used. Default is None.
pca_model_directory (str, optional) - Directory containing the pretrained PCA models. If not provided, the directory specified during object initialization will be used. Default is None.
output_dir (str, optional) - Directory to save the converted files. If not provided, the directory specified during object initialization will be used. Default is None.
- Returns:
0 indicating successful conversion.
- Return type:
int
- convert_projections(projection_files: list | None = None, pca_model_directory: str | None = None, output_dir: str | None = None, scaling_method=None)[source]
- generate_pca_models(num_forcing_pcs, num_projection_pcs, scaling_method='standard')[source]
Generate principal component analysis (PCA) models for atmosphere and ocean variables.
Parameters: - atmosphere_fps (list): List of file paths for atmosphere data. - ocean_fps (list): List of file paths for ocean data. - save_dir (str): Directory to save the generated PCA models and results.
- Returns:
0 if successful.
- Return type:
int
- invert(pca_x, var_name, pca_model_directory=None, scaler_directory=None)[source]
Invert the given variable from PCA space.
- Parameters:
pca_x (array-like) - The input array containing the variables in PCA space.
variable (str) - The name of the variable to transform.
pca_models_paths (dict) - A dictionary containing the filepaths for the PCA models.
- Returns:
The inverted array.
- Return type:
array-like
- transform(x, var_name, num_pcs=None, pca_model_directory=None, scaler_directory=None, scaling_method='standard')[source]
Transform the given variable into PCA space.
- Parameters:
x (array-like) - The input array containing the variables.
variable (str) - The name of the variable to transform.
pca_models_paths (dict) - A dictionary containing the filepaths for the PCA models.
- Returns:
The transformed array.
- Return type:
array-like
- class ise.data.process.ProjectionProcessor(ice_sheet, forcings_directory, projections_directory, scalefac_path=None, densities_path=None)[source]
Bases:
objectA class for processing ice sheet data.
Attributes: - ice_sheet (str): Ice sheet to be processed. Must be ‘AIS’ or ‘GIS’. - forcings_directory (str): The path to the directory containing the forcings data. - projections_directory (str): The path to the directory containing the projections data. - scalefac_path (str): The path to the netCDF file containing scaling factors for each grid cell. - densities_path (str): The path to the CSV file containing ice and ocean density (rhow/rhoi) data for each experiment.
Methods: - __init__(self, ice_sheet, forcings_directory, projections_directory, scalefac_path=None, densities_path=None): Initializes the Processor object. - process_forcings(self): Processes the forcings data. - process_projections(self, output_directory): Processes the projections data. - _calculate_ivaf_minus_control(self, data_directory, densities_fp, scalefac_path): Calculates the ice volume above flotation (IVAF) for each file in the given data directory, subtracting out the control projection IVAF if applicable. - _calculate_ivaf_single_file(self, directory, densities, scalefac_model, ctrl_proj=False): Calculates the ice volume above flotation (IVAF) for a single file.
- process()[source]
Process the ISMIP6 projections by calculating IVAF for both control and experiments, subtracting out the control IVAF from experiments, and exporting ivaf files.
- Parameters:
output_directory (str) - The directory to save the processed projections.
- Raises:
ValueError - If projections_directory or output_directory is not specified.
- Returns:
1 indicating successful processing.
- Return type:
int
- ise.data.process.combine_gris_forcings(forcing_dir)[source]
Combine GrIS forcings from multiple CMIP directories into a single NetCDF file.
Parameters: - forcing_dir (str): The directory containing the GrIS forcings.
Returns: - int: 0 indicating successful completion of the function.
- ise.data.process.get_model_densities(zenodo_directory: str, output_path: str | None = None)[source]
Extracts values for rhoi and rhow from NetCDF files in the specified directory and returns a pandas DataFrame containing the group, model, rhoi, and rhow values for each file.
- Parameters:
zenodo_directory (str) - The path to the directory containing the NetCDF files.
output_path (str, optional) - The path to save the resulting DataFrame as a CSV file.
- Returns:
A DataFrame containing the group, model, rhoi, and rhow values for each file.
- Return type:
pandas.DataFrame
- ise.data.process.get_xarray_data(dataset_fp, var_name=None, ice_sheet='AIS', convert_and_subset=False)[source]
Retrieves data from an xarray dataset.
- Parameters:
dataset_fp (str) - The file path to the xarray dataset.
var_name (str, optional) - The name of the variable to retrieve from the dataset. Defaults to None.
ice_sheet (str, optional) - The ice sheet type. Defaults to ‘AIS’.
convert_and_subset (bool, optional) - Flag indicating whether to convert and subset the dataset. Defaults to False.
- Returns:
The retrieved data from the dataset.
- Return type:
np.ndarray or xr.Dataset
- ise.data.process.interpolate_values(data)[source]
Interpolates missing values in the x and y dimensions of the input NetCDF data using linear interpolation.
- Parameters:
data - A NetCDF file containing x and y dimensions with missing values.
- Returns:
A tuple containing the interpolated x and y arrays.
ise.data.scaler module
- class ise.data.scaler.StandardScaler[source]
Bases:
ModuleA class for scaling input data using mean and standard deviation.
- Parameters:
nn.Module - The base class for all neural network modules in PyTorch.
- mean_
The mean values of the input data.
- Type:
torch.Tensor
- scale_
The standard deviation values of the input data.
- Type:
torch.Tensor
- device
The device (CPU or GPU) on which the calculations are performed.
- Type:
torch.device
- fit(X)[source]
Computes the mean and standard deviation of the input data.
- Parameters:
X (torch.Tensor) - The input data to be scaled.
- inverse_transform(X)[source]
Reverses the scaling operation on the input data.
- Parameters:
X (torch.Tensor) - The scaled input data to be transformed back.
- Returns:
The transformed input data.
- Return type:
torch.Tensor
- Raises:
RuntimeError - If the Scaler instance is not fitted yet.
- static load(path)[source]
Loads the mean and standard deviation from a file.
- Parameters:
path (str) - The path to load the file from.
- Returns:
A Scaler instance with the loaded mean and standard deviation.
- Return type:
Scaler