ise.data
ise.data.dataclasses
- class ise.data.dataclasses.EmulatorDataset(X, y, sequence_length=5, projection_length=86)[source]
Bases:
DatasetA PyTorch dataset for loading emulator data, designed to handle sequence-based inputs and projections.
- Parameters:
X (pandas.DataFrame, numpy.ndarray, or torch.Tensor) - The input data.
y (pandas.DataFrame, numpy.ndarray, or torch.Tensor) - The target data.
sequence_length (int, optional) - The length of the input sequence. Default is 5.
projection_length (int or tuple, optional) - The length of the projection period. Default is 86.
- X
The input data converted to a PyTorch tensor.
- Type:
torch.Tensor
- y
The target data converted to a PyTorch tensor.
- Type:
torch.Tensor
- sequence_length
The length of the input sequence.
- Type:
int
- xdim
The number of dimensions in X.
- Type:
int
- num_projections
The number of projections in the dataset.
- Type:
int
- num_timesteps
The number of timesteps per projection.
- Type:
int
- num_features
The number of features in the dataset.
- Type:
int
- class ise.data.dataclasses.PyTorchDataset(X, y)[source]
Bases:
DatasetA PyTorch dataset for general-purpose data loading.
- Parameters:
X (torch.Tensor) - The input data.
y (torch.Tensor) - The target data.
- class ise.data.dataclasses.ScenarioDataset(features, labels)[source]
Bases:
DatasetA PyTorch dataset designed for scenario-based data loading.
- Parameters:
features (torch.Tensor) - The input features.
labels (torch.Tensor) - The target labels.
- features
The input features.
- Type:
torch.Tensor
- labels
The target labels.
- Type:
torch.Tensor
- class ise.data.dataclasses.TSDataset(X, y, sequence_length=5)[source]
Bases:
DatasetA PyTorch dataset for handling time series data with sequence-based input.
- Parameters:
X (torch.Tensor) - The input data.
y (torch.Tensor) - The target data.
sequence_length (int, optional) - The length of the input sequence. Default is 5.
- X
The input data.
- Type:
torch.Tensor
- y
The target data.
- Type:
torch.Tensor
- sequence_length
The sequence length.
- Type:
int
ise.data.feature_engineer
- class ise.data.feature_engineer.FeatureEngineer(ice_sheet, data: DataFrame, fill_mrro_nans: bool = False, split_dataset: bool = False, train_size: float = 0.7, val_size: float = 0.15, test_size: float = 0.15, output_directory: str = None)[source]
Bases:
objectA class for performing feature engineering on a given dataset, including preprocessing, scaling, dataset splitting, and outlier handling.
- Parameters:
ice_sheet (str) - The name of the ice sheet being analyzed.
data (pd.DataFrame) - The input dataset.
fill_mrro_nans (bool, optional) - Whether to fill missing values in the ‘mrro’ column. Defaults to False.
split_dataset (bool, optional) - Whether to split the dataset into training, validation, and test sets. Defaults to False.
train_size (float, optional) - Proportion of data to use for training. Defaults to 0.7.
val_size (float, optional) - Proportion of data to use for validation. Defaults to 0.15.
test_size (float, optional) - Proportion of data to use for testing. Defaults to 0.15.
output_directory (str, optional) - Directory to save the split datasets. Defaults to None.
- data
The input dataset.
- Type:
pd.DataFrame
- train_size
Proportion of training data.
- Type:
float
- val_size
Proportion of validation data.
- Type:
float
- test_size
Proportion of testing data.
- Type:
float
- output_directory
Directory to save datasets.
- Type:
str
- scaler_X_path
Path to the saved input feature scaler.
- Type:
str
- scaler_y_path
Path to the saved target variable scaler.
- Type:
str
- scaler_X
Scaler for input features.
- Type:
scaler object
- scaler_y
Scaler for target variables.
- Type:
scaler object
- train
Training dataset.
- Type:
pd.DataFrame
- val
Validation dataset.
- Type:
pd.DataFrame
- test
Test dataset.
- Type:
pd.DataFrame
- _including_model_characteristics
Whether model characteristics have been included.
- Type:
bool
- add_lag_variables(lag, data=None)[source]
Adds lagged versions of predictor variables to the dataset.
- Parameters:
lag (int) - Number of time steps to lag the variables.
data (pd.DataFrame, optional) - The dataset. If not provided, the class attribute ‘data’ is used.
- Returns:
The modified instance with lag variables added.
- Return type:
- add_model_characteristics(data=None, model_char_path=None, encode=True, ids_path=None)[source]
Merges model characteristic data with the dataset.
- Parameters:
data (pd.DataFrame, optional) - The dataset. If not provided, the class attribute ‘data’ is used.
model_char_path (str, optional) - Path to the model characteristics file. Defaults to the internal path.
encode (bool, optional) - Whether to one-hot encode categorical characteristics. Defaults to True.
ids_path (str, optional) - Path to an additional ID mapping file. Defaults to None.
- Returns:
The modified instance with model characteristics added.
- Return type:
- backfill_outliers(percentile=99.999, data=None)[source]
Replaces extreme values in target variables with the previous row’s value.
- Parameters:
percentile (float, optional) - Percentile threshold for identifying outliers. Defaults to 99.999.
data (pd.DataFrame, optional) - The dataset. If not provided, the class attribute ‘data’ is used.
- Returns:
The modified instance with outliers handled.
- Return type:
- drop_outliers(method, column, expression=None, quantiles=[0.01, 0.99], data=None)[source]
Drops simulations that are outliers based on the provided method.
- Parameters:
method (str) - Method of outlier deletion (‘quantile’ or ‘explicit’).
column (str) - Column used for detecting outliers.
expression (list[tuple], optional) - List of filtering expressions in the form [(column, operator, value)]. Defaults to None.
quantiles (list[float], optional) - Quantiles for ‘quantile’ method. Defaults to [0.01, 0.99].
data (pd.DataFrame, optional) - The dataset. If not provided, the class attribute ‘data’ is used.
- Returns:
The modified instance with outliers removed.
- Return type:
- fill_mrro_nans(method, data=None)[source]
Fills missing values in the ‘mrro’ column.
- Parameters:
method (str) - The method used to fill missing values.
data (pd.DataFrame, optional) - The dataset. Defaults to None.
- Returns:
The dataset with missing values filled.
- Return type:
pd.DataFrame
- scale_data(X=None, y=None, method='standard', save_dir=None)[source]
Scales input (X) and target (y) variables using a specified scaling method.
- Parameters:
X (pd.DataFrame or np.ndarray, optional) - Input data. Defaults to None.
y (pd.DataFrame or np.ndarray, optional) - Target data. Defaults to None.
method (str, optional) - Scaling method (‘standard’, ‘minmax’, ‘robust’). Defaults to ‘standard’.
save_dir (str, optional) - Directory to save scalers. Defaults to None.
- Returns:
Scaled X and y values.
- Return type:
tuple
- split_data(data=None, train_size=None, val_size=None, test_size=None, output_directory=None, random_state=42)[source]
Splits the dataset into training, validation, and test sets.
- Parameters:
data (pd.DataFrame, optional) - The input dataset. Defaults to None.
train_size (float, optional) - Proportion of training data. Defaults to None.
val_size (float, optional) - Proportion of validation data. Defaults to None.
test_size (float, optional) - Proportion of testing data. Defaults to None.
output_directory (str, optional) - Directory to save split datasets. Defaults to None.
random_state (int, optional) - Random seed for reproducibility. Defaults to 42.
- Returns:
Training, validation, and test datasets as pandas DataFrames.
- Return type:
tuple
- unscale_data(X=None, y=None, scaler_X_path=None, scaler_y_path=None)[source]
Reverses the scaling transformation for input (X) and target (y) variables.
- Parameters:
X (pd.DataFrame or np.ndarray, optional) - The input data to be unscaled. Defaults to None.
y (pd.DataFrame, np.ndarray, or torch.Tensor, optional) - The target data to be unscaled. Defaults to None.
scaler_X_path (str, optional) - Path to the stored input scaler. Defaults to None.
scaler_y_path (str, optional) - Path to the stored target scaler. Defaults to None.
- Returns:
Unscaled X and y data.
- Return type:
tuple
- ise.data.feature_engineer.add_lag_variables(data: DataFrame, lag: int, verbose=True) DataFrame[source]
Adds lagged variables to the input dataset, creating time-shifted versions of the predictor variables.
- Parameters:
data (pd.DataFrame) - The dataset containing time series data.
lag (int) - The number of time steps to lag the variables.
verbose (bool, optional) - Whether to display a progress bar. Defaults to True.
- Returns:
The dataset with lagged variables added.
- Return type:
pd.DataFrame
- ise.data.feature_engineer.add_model_characteristics(data, model_char_path='./ise/utils/model_characteristics.csv', encode=True, ids_path=None) DataFrame[source]
Adds model characteristics to the dataset.
- Parameters:
data (pd.DataFrame) - The input dataset.
model_char_path (str, optional) - Path to the model characteristics file. Defaults to internal path.
encode (bool, optional) - Whether to one-hot encode categorical characteristics. Defaults to True.
ids_path (str, optional) - Path to an additional ID mapping file. Defaults to None.
- Returns:
The dataset with model characteristics added.
- Return type:
pd.DataFrame
- ise.data.feature_engineer.backfill_outliers(data, percentile=99.999)[source]
Replaces extreme values in y-values (above the specified percentile and below the 1-percentile across all y-values) with the value from the previous row.
- Parameters:
data (pd.DataFrame) - The dataset containing y-values.
percentile (float, optional) - The percentile threshold to define upper extreme values. Defaults to 99.999.
- Returns:
The dataset with extreme values replaced using backfill.
- Return type:
pd.DataFrame
- ise.data.feature_engineer.drop_outliers(data: DataFrame, column: str, method: str, expression: List[tuple] = None, quantiles: List[float] = [0.01, 0.99])[source]
Removes outliers from the dataset based on a specified method.
- Parameters:
data (pd.DataFrame) - The dataset containing the column with potential outliers.
column (str) - The column to assess for outliers.
method (str) - The method of outlier detection (‘quantile’ or ‘explicit’).
expression (list of tuples, optional) - A list of conditions in the format [(column, operator, value)] for explicit filtering. Defaults to None.
quantiles (list of float, optional) - Quantiles for filtering when using the ‘quantile’ method. Defaults to [0.01, 0.99].
- Returns:
The dataset with outliers removed.
- Return type:
pd.DataFrame
- Raises:
AttributeError - If the method is ‘quantile’ but no quantiles are provided.
AttributeError - If the method is ‘explicit’ but no expression is provided.
ValueError - If the operator in the expression is not recognized.
- ise.data.feature_engineer.fill_mrro_nans(data: DataFrame, method) DataFrame[source]
Fills the NaN values in the specified columns with the given method.
- Parameters:
data (pd.DataFrame) - The input DataFrame.
method (str or int) - The method to fill NaN values. Must be one of ‘zero’, ‘mean’, ‘median’, or ‘drop’.
- Returns:
The DataFrame with NaN values filled according to the specified method.
- Return type:
pd.DataFrame
- Raises:
ValueError - If the method is not one of ‘zero’, ‘mean’, ‘median’, or ‘drop’.
- ise.data.feature_engineer.scale_data(data, scaler_path)[source]
Scales the provided dataset using a pre-trained scaler.
- Parameters:
data (pd.DataFrame) - The dataset to be scaled.
scaler_path (str) - Path to the saved scaler.
- Returns:
The scaled dataset.
- Return type:
pd.DataFrame
- ise.data.feature_engineer.split_training_data(data, train_size, val_size, test_size=None, output_directory=None, random_state=42)[source]
Splits the dataset into training, validation, and test sets.
- Parameters:
data (str or pd.DataFrame) - The dataset or path to the dataset to be split.
train_size (float) - Proportion of data to use for training.
val_size (float) - Proportion of data to use for validation.
test_size (float, optional) - Proportion of data to use for testing. Defaults to the remainder.
output_directory (str, optional) - Directory to save the split datasets as CSV files. Defaults to None.
random_state (int, optional) - Seed for reproducibility. Defaults to 42.
- Returns:
Training, validation, and test datasets as pandas DataFrames.
- Return type:
tuple
- Raises:
ValueError - If the dataset length is not divisible by 86, indicating incomplete projections.
ValueError - If the dataset does not contain an ‘id’ column.
ise.data.process
- class ise.data.process.DatasetMerger(ice_sheet, forcings, projections, experiment_file, output_dir)[source]
Bases:
objectA class for merging datasets from forcing and projection files to create a unified dataset for analysis.
- Parameters:
ice_sheet (str) - The ice sheet name (‘AIS’ or ‘GrIS’).
forcings (str) - The directory path containing forcing files.
projections (str) - The directory path containing projection files.
experiment_file (str) - The file path to the experiment metadata (CSV or JSON).
output_dir (str) - The directory path to save the merged dataset.
- experiments
The experiment metadata loaded from the provided file.
- Type:
pd.DataFrame
- forcing_paths
List of file paths for forcing datasets.
- Type:
list
- projection_paths
List of file paths for projection datasets.
- Type:
list
- forcing_metadata
Metadata about forcing files, including CMIP model and pathway.
- Type:
pd.DataFrame
- merge_dataset()[source]
Merges the forcing and projection datasets into a single structured dataset.
- class ise.data.process.ProjectionProcessor(ice_sheet, forcings_directory, projections_directory, scalefac_path=None, densities_path=None)[source]
Bases:
objectA class for processing ISMIP6 projections (outputs) for ice sheet models, specifically for calculating Ice Volume Above Flotation (IVAF), handling control projections, and processing experimental projections.
- Parameters:
ice_sheet (str) - The ice sheet being analyzed (‘AIS’ or ‘GIS’).
forcings_directory (str) - Path to the directory containing forcing datasets.
projections_directory (str) - Path to the directory containing projection datasets.
scalefac_path (str, optional) - Path to the NetCDF file containing scaling factors for each grid cell. Defaults to None.
densities_path (str, optional) - Path to the CSV file containing density data for models. Defaults to None.
- forcings_directory
Path to forcing data.
- Type:
str
- projections_directory
Path to projection data.
- Type:
str
- densities_path
Path to density dataset.
- Type:
str
- scalefac_path
Path to scaling factor dataset.
- Type:
str
- ice_sheet
Ice sheet identifier (‘AIS’ or ‘GIS’).
- Type:
str
- resolution
Resolution of the dataset (5 for GIS, 8 for AIS).
- Type:
int
- process()[source]
Processes ISMIP6 projections by calculating IVAF and subtracting control projections.
- _calculate_ivaf_minus_control()[source]
Computes IVAF and subtracts control values for experimental projections.
- ise.data.process.combine_gris_forcings(forcing_dir)[source]
Combines GrIS forcings from multiple CMIP model directories into consolidated NetCDF files.
- Parameters:
forcing_dir (str) - Directory containing the GrIS forcing files.
- Returns:
0 upon successful processing.
- Return type:
int
- ise.data.process.convert_and_subset_times(dataset)[source]
Converts time variables in an xarray dataset to a uniform format and subsets time to the range 2015-2100.
- Parameters:
dataset (xarray.Dataset) - The dataset with time values to be converted and subset.
- Returns:
The dataset with standardized time format and subset to the correct time range.
- Return type:
xarray.Dataset
- Raises:
ValueError - If time values are not in a recognizable format.
- ise.data.process.get_model_densities(zenodo_directory: str, output_path: str = None)[source]
Extracts density values (rhoi and rhow) from NetCDF files in the specified directory and returns them in a pandas DataFrame.
- Parameters:
zenodo_directory (str) - Path to the directory containing the NetCDF files.
output_path (str, optional) - Path to save the extracted density values as a CSV file. Defaults to None.
- Returns:
A DataFrame containing the group, model, rhoi, and rhow values for each model run.
- Return type:
pandas.DataFrame
- ise.data.process.get_xarray_data(dataset_fp, var_name=None, ice_sheet='AIS', convert_and_subset=False)[source]
Retrieves and processes data from an xarray dataset.
- Parameters:
dataset_fp (str) - The file path to the xarray dataset.
var_name (str, optional) - The name of the variable to retrieve from the dataset. Defaults to None.
ice_sheet (str, optional) - The ice sheet type (‘AIS’ or ‘GrIS’). Defaults to ‘AIS’.
convert_and_subset (bool, optional) - If True, converts and subsets the dataset for the target time range. Defaults to False.
- Returns:
The extracted variable as a NumPy array or the entire processed dataset.
- Return type:
np.ndarray or xarray.Dataset
- ise.data.process.interpolate_values(data)[source]
Interpolates missing values in the x and y dimensions of the input dataset using linear interpolation. Ensures that first and last values are properly adjusted to maintain consistency.
- Parameters:
data (xarray.Dataset) - A dataset containing x and y dimensions with potential missing values.
- Returns:
A tuple containing the interpolated x and y arrays.
- Return type:
tuple
- ise.data.process.merge_datasets(forcings, projections, experiments_file, ice_sheet='AIS', export_directory=None)[source]
Merges forcing and projection datasets using experiment metadata.
- Parameters:
forcings (pd.DataFrame) - Forcing dataset.
projections (pd.DataFrame) - Projection dataset.
experiments_file (str or pd.DataFrame) - Path to the experiment metadata file or a DataFrame.
ice_sheet (str, optional) - The ice sheet type (‘AIS’ or ‘GrIS’). Defaults to ‘AIS’.
export_directory (str, optional) - Directory to save the merged dataset. Defaults to None.
- Returns:
The merged dataset containing forcing, projection, and metadata.
- Return type:
pandas.DataFrame
- ise.data.process.process_AIS_atmospheric_sectors(forcing_directory, grid_file)[source]
Processes atmospheric forcing data for AIS sectors, aggregating sector-level data.
- Parameters:
forcing_directory (str) - Directory containing atmospheric forcing data.
grid_file (str or xarray.Dataset) - Grid file defining sector boundaries.
- Returns:
DataFrame containing processed atmospheric forcing data for AIS sectors.
- Return type:
pandas.DataFrame
- ise.data.process.process_AIS_oceanic_sectors(forcing_directory, grid_file)[source]
Processes oceanic forcing data for AIS sectors, aggregating sector-level data for thermal forcing, salinity, and temperature.
- Parameters:
forcing_directory (str) - Directory containing oceanic forcing data.
grid_file (str or xarray.Dataset) - Grid file defining sector boundaries.
- Returns:
DataFrame containing processed oceanic forcing data for AIS sectors.
- Return type:
pandas.DataFrame
- ise.data.process.process_AIS_outputs(zenodo_directory, with_ctrl=False)[source]
Processes AIS model outputs by extracting Ice Volume Above Flotation (IVAF) data and computing sea-level equivalents.
- Parameters:
zenodo_directory (str) - Directory containing AIS output files.
with_ctrl (bool, optional) - If True, includes control projections. Defaults to False.
- Returns:
DataFrame containing processed AIS output data.
- Return type:
pandas.DataFrame
- ise.data.process.process_GrIS_atmospheric_sectors(forcing_directory, grid_file)[source]
Processes atmospheric forcing data for GrIS sectors, aggregating sector-level data.
- Parameters:
forcing_directory (str) - Directory containing atmospheric forcing data.
grid_file (str or xarray.Dataset) - Grid file defining sector boundaries.
- Returns:
DataFrame containing processed atmospheric forcing data for GrIS sectors.
- Return type:
pandas.DataFrame
- ise.data.process.process_GrIS_oceanic_sectors(forcing_directory, grid_file)[source]
Processes oceanic forcing data for GrIS sectors, aggregating sector-level data for thermal forcing and basin runoff.
- Parameters:
forcing_directory (str) - Directory containing oceanic forcing data.
grid_file (str or xarray.Dataset) - Grid file defining sector boundaries.
- Returns:
DataFrame containing processed oceanic forcing data for GrIS sectors.
- Return type:
pandas.DataFrame
- ise.data.process.process_GrIS_outputs(zenodo_directory)[source]
Processes GrIS model outputs by extracting Ice Volume Above Flotation (IVAF) data and computing sea-level equivalents.
- Parameters:
zenodo_directory (str) - Directory containing GrIS output files.
- Returns:
DataFrame containing processed GrIS output data.
- Return type:
pandas.DataFrame
- ise.data.process.process_sectors(ice_sheet, forcing_directory, grid_file, zenodo_directory, experiments_file, export_directory=None, overwrite=False, with_ctrl=False)[source]
Processes sector-based datasets by merging atmospheric, oceanic, and projection data for the given ice sheet.
- Parameters:
ice_sheet (str) - The ice sheet being processed (‘AIS’ or ‘GrIS’).
forcing_directory (str) - Directory containing forcing data.
grid_file (str) - Path to the grid file defining sectors.
zenodo_directory (str) - Directory containing projection data.
experiments_file (str) - Path to the experiment metadata file.
export_directory (str, optional) - Directory to save processed datasets. Defaults to None.
overwrite (bool, optional) - If True, overwrites existing datasets. Defaults to False.
with_ctrl (bool, optional) - If True, includes control projections. Defaults to False.
- Returns:
The final merged dataset.
- Return type:
pandas.DataFrame
ise.data.scaler
- class ise.data.scaler.LogScaler(epsilon=1e-08)[source]
Bases:
ModuleA class for scaling input data using a logarithmic transformation, ensuring all values are positive by applying a shift.
- Parameters:
epsilon (float, optional) - A small constant to avoid log(0) errors. Defaults to 1e-8.
- epsilon
A small constant to avoid log(0) errors.
- Type:
float
- min_value
The minimum value in the dataset used for shifting.
- Type:
float
- device
The device (CPU or GPU) on which calculations are performed.
- Type:
torch.device
- fit(X)[source]
Computes the minimum value in the dataset to ensure all values remain positive during transformation.
- Parameters:
X (torch.Tensor) - The input data to be scaled.
- class ise.data.scaler.RobustScaler[source]
Bases:
ModuleA class for scaling input data using the median and interquartile range (IQR), making it robust to outliers.
- Parameters:
nn.Module - The base class for all neural network modules in PyTorch.
- median_
The median values of the input data.
- Type:
torch.Tensor
- iqr_
The interquartile range (IQR) values of the input data.
- Type:
torch.Tensor
- device
The device (CPU or GPU) on which the calculations are performed.
- Type:
torch.device
- fit(X)[source]
Computes the median and interquartile range (IQR) of the input data.
- Parameters:
X (torch.Tensor) - The input data to be scaled.
- class ise.data.scaler.StandardScaler[source]
Bases:
ModuleA class for scaling input data using mean and standard deviation.
- Parameters:
nn.Module - The base class for all neural network modules in PyTorch.
- mean_
The mean values of the input data.
- Type:
torch.Tensor
- scale_
The standard deviation values of the input data.
- Type:
torch.Tensor
- device
The device (CPU or GPU) on which the calculations are performed.
- Type:
torch.device
- fit(X)[source]
Computes the mean and standard deviation of the input data.
- Parameters:
X (torch.Tensor) - The input data to be scaled.
- inverse_transform(X)[source]
Reverses the scaling operation on the input data.
- Parameters:
X (torch.Tensor) - The scaled input data to be transformed back.
- Returns:
The transformed input data.
- Return type:
torch.Tensor
- Raises:
RuntimeError - If the Scaler instance is not fitted yet.
- static load(path)[source]
Loads the mean and standard deviation from a file.
- Parameters:
path (str) - The path to load the file from.
- Returns:
A Scaler instance with the loaded mean and standard deviation.
- Return type:
Scaler