Dataset¶
Multiple Aspect Trajectory Tools Framework
MAT-data: Data Preprocessing for Multiple Aspect Trajectory Data Mining
The present application offers a tool, to support the user in the classification task of multiple aspect trajectories, specifically for extracting and visualizing the movelets, the parts of the trajectory that better discriminate a class. It integrates into a unique platform the fragmented approaches available for multiple aspects trajectories and in general for multidimensional sequence classification into a unique web-based and python library system. Offers both movelets visualization and classification methods.
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
@author: Tarlis Portela
- matdata.dataset.load_ds(dataset='mat.FoursquareNYC', prefix='', missing='-999', sample_size=1, random_num=1)[source]¶
Load a dataset for training or testing from a GitHub repository.
Parameters:¶
- datasetstr, optional
The name of the dataset to load (default ‘mat.FoursquareNYC’).
- prefixstr, optional
The prefix to be added to the dataset file name (default ‘’).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:¶
- pandas.DataFrame
The loaded dataset with optional sampling.
- matdata.dataset.load_ds_holdout(dataset='mat.FoursquareNYC', train_size=0.7, prefix='', missing='-999', sample_size=1, random_num=1)[source]¶
Load a dataset for training and testing with a holdout method from a GitHub repository.
Parameters:¶
- datasetstr, optional
The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’). Format as category.DatasetName
- train_sizefloat, optional
The proportion of the dataset to include in the training set (default 0.7).
- prefixstr, optional
The prefix to be added to the dataset file name (default ‘’).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:¶
- trainpandas.DataFrame
The training dataset.
- testpandas.DataFrame
The testing dataset.
- matdata.dataset.load_ds_kfold(dataset='mat.FoursquareNYC', k=5, prefix='', missing='-999', sample_size=1, random_num=1)[source]¶
Load a dataset for k-fold cross-validation from a GitHub repository.
Parameters:¶
- datasetstr, optional
The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’).
- kint, optional
The number of folds for cross-validation (default 5).
- prefixstr, optional
The prefix to be added to the dataset file name (default ‘’).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:¶
- ktrainlist ofpandas.DataFrame
The training datasets for each fold.
- ktestlist of pandas.DataFrame
The testing datasets for each fold.
- matdata.dataset.prepare_ds(df, tid_col='tid', class_col=None, sample_size=1, random_num=1)[source]¶
Prepare dataset for training or testing (helper function).
Parameters:¶
- dfpandas.DataFrame
The DataFrame containing the dataset.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used for ordering data (default None).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:¶
- pandas.DataFrame
The prepared dataset with optional sampling.
- matdata.dataset.read_ds(data_file, tid_col='tid', class_col=None, missing='-999', sample_size=1, random_num=1)[source]¶
Read a dataset from a file.
Parameters:¶
- data_filestr
The path to the dataset file.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used (default None).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:¶
- pandas.DataFrame
The read dataset.
- matdata.dataset.read_ds_5fold(data_path, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]¶
Read datasets for k-fold cross-validation from files in a directory.
See Also¶
read_ds_kfold : Read datasets for k-fold cross-validation.
Parameters:¶
- data_pathstr
The path to the directory containing the dataset files.
- prefixstr, optional
The prefix of the dataset file names (default ‘specific’).
- suffixstr, optional
The suffix of the dataset file names (default ‘.csv’).
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used (default None).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
Returns:¶
- 5_trainlist ofpandas.DataFrame
The training datasets for each fold.
- 5_testlist of pandas.DataFrame
The testing datasets for each fold.
- matdata.dataset.read_ds_holdout(data_path, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999', fold=None)[source]¶
Read datasets for holdout validation from files in a directory.
Parameters:¶
- data_pathstr
The path to the directory containing the dataset files.
- prefixstr, optional
The prefix of the dataset file names (default ‘specific’).
- suffixstr, optional
The suffix of the dataset file names (default ‘.csv’).
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used (default None).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- foldint or None, optional
The fold number to load for holdout validation, including subdirectory (ex. run1). If None, read files in data_path.
Returns:¶
- trainpandas.DataFrame
The training dataset.
- testpandas.DataFrame
The testing dataset.
- matdata.dataset.read_ds_kfold(data_path, k=5, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]¶
Read datasets for k-fold cross-validation from files in a directory.
Parameters:¶
- data_pathstr
The path to the directory containing the dataset files.
- kint, optional
The number of folds for cross-validation (default 5).
- prefixstr, optional
The prefix of the dataset file names (default ‘specific’).
- suffixstr, optional
The suffix of the dataset file names (default ‘.csv’).
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used (default None).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
Returns:¶
- ktrainlist ofpandas.DataFrame
The training datasets for each fold.
- ktestlist of pandas.DataFrame
The testing datasets for each fold.