Pre-processing¶
Multiple Aspect Trajectory Tools Framework
MAT-data: Data Preprocessing for Multiple Aspect Trajectory Data Mining
The present application offers a tool, to support the user in the classification task of multiple aspect trajectories, specifically for extracting and visualizing the movelets, the parts of the trajectory that better discriminate a class. It integrates into a unique platform the fragmented approaches available for multiple aspects trajectories and in general for multidimensional sequence classification into a unique web-based and python library system. Offers both movelets visualization and classification methods.
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
@author: Tarlis Portela
- matdata.preprocess.countClasses(data_path, folder, file='train.csv', tid_col='tid', class_col='label', markd=False)[source]¶
Counts the occurrences of each class label in a dataset.
Parameters:¶
- data_pathstr
The directory path where the dataset file is located.
- folderstr
The subfolder within the data path where the dataset file is located.
- filestr, optional (default=’train.csv’)
The name of the dataset file to be read.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- markdbool, optional (default=False)
A flag indicating whether to print the class counts in Markdown format.
Returns:¶
- pandas.DataFrame or str
If markd is False, prins the markdown text and returns a dictionary DataFrame containing the counts of each class label in the dataset. If markd is True, returns str markdown of the counts of each class label in the dataset.
- matdata.preprocess.datasetStatistics(data_path, folder, file_prefix='', tid_col='tid', class_col='label', to_file=False)[source]¶
Computes statistics for a dataset, including summary statistics for each column and class distribution into a markdown file format.
Parameters:¶
- data_pathstr
The directory path where the dataset file(s) are located.
- folderstr
The subfolder within the data path where the dataset file(s) are located.
- file_prefixstr, optional (default=’’)
The prefix to be added to the dataset file names.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- to_filebool, optional (default=False)
A flag indicating whether to save the statistics to a file.
Returns:¶
- dict or None
If to_file is False, prints markdown and returns a str containing the computed statistics. If to_file is str, returns markdown str and saves the statistics to a file named as in to_file value.
- matdata.preprocess.dfStats(df)[source]¶
Computes summary statistics for each column in a DataFrame.
Parameters:¶
- dfpandas.DataFrame
The DataFrame for which statistics are to be computed.
Returns:¶
- pandas.DataFrame
A DataFrame containing summary statistics for each column, including mean, standard deviation, and variance. Columns are sorted by variance in descending order.
- matdata.preprocess.dfVariance(df)[source]¶
Computes the variance for each column in a DataFrame.
Parameters:¶
- dfpandas.DataFrame
The DataFrame for which variance is to be computed.
Returns:¶
- pandas.Series
A Series containing the variance for each column in the DataFrame.
- matdata.preprocess.featuresJSON(df, version=1, deftype='nominal', defcomparator='equals', tid_col='tid', label_col='label', file=False)[source]¶
Generates a JSON representation of features from a DataFrame.
Parameters:¶
- dfpandas.DataFrame
The DataFrame containing the dataset.
- versionint, optional (default=1)
The version number of the JSON schema (1 for MASTERMovelets format, 2 for HiPerMovelets format).
- deftypestr, optional (default=’nominal’)
The default type of features.
- defcomparatorstr, optional (default=’equals’)
The default comparator for features.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- label_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- filebool, optional (default=False)
A flag indicating whether to save the JSON representation to a file.
Returns:¶
- str
If file is False, returns a str representing the features in JSON format. If file is str, returns a str of JSON features and saves the JSON representation to a file param name.
- matdata.preprocess.joinTrainTest(dir_path, train_file='train.csv', test_file='test.csv', tid_col='tid', class_col='label', to_file=False)[source]¶
Joins training and testing datasets from separate files into a single DataFrame.
Parameters:¶
- dir_pathstr
The directory path where the training and testing files are located.
- train_filestr, optional (default=”train.csv”)
The name of the training file to be read.
- test_filestr, optional (default=”test.csv”)
The name of the testing file to be read.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- to_filebool, optional (default=False)
A flag indicating whether to save the joined DataFrame to a file, and saves the joined DataFrame to a file named ‘joined.csv’.
Returns:¶
- pandas.DataFrame
A DataFrame containing the joined training and testing data. If to_file is True, returns the DataFrame and saves the joined DataFrame to a file named ‘joined.csv’.
- matdata.preprocess.kfold_trainTestSplit(df, k, random_num=1, tid_col='tid', class_col='label', fileprefix='', columns_order=None, ktrain=None, ktest=None, mat_columns=None, data_path='.', outformats=[], verbose=False)[source]¶
Splits a DataFrame into k folds for k-fold cross-validation, optionally organizes columns, and saves them to files.
Parameters:¶
- dfpandas.DataFrame
The DataFrame to be split into k folds.
- kint
The number of folds for cross-validation.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving, for example: ‘specific_’ or ‘generic_’.
- columns_orderlist of str, optional
A list of column names specifying the desired order of columns. If None, no reordering is performed.
- ktrainlist of pandas.DataFrame, optional
A list of training sets for each fold. If None, the function will split the data into training and testing sets.
- ktestlist of pandas.DataFrame, optional
A list of testing sets for each fold. If None, the function will split the data into training and testing sets.
- mat_columnslist of str, optional
A list of column names to be included in the .mat files, corresponding to columns_order.
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
- verbosebool, optional (default=False)
A flag indicating whether to display progress messages.
Returns:¶
- ktrainlist of pandas.DataFrame
List of DataFrame containing the training sets.
- ktestlist of pandas.DataFrame
List of DataFrame containing the testing sets.
- matdata.preprocess.klabels_stratify(df, kl=10, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.')[source]¶
Stratifies a DataFrame by a specified number of class labels and splits it into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:¶
- dfpandas.DataFrame
The DataFrame to be stratified and split into training and testing sets.
- klint, optional (default=10)
The number of class labels to stratify the DataFrame.
- train_sizefloat, optional (default=0.7)
The proportion of the stratified dataset to include in the training set.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- mat_columnslist of str, optional (unused for now)
A list of column names to be included in the .mat files, if set to save.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
Returns:¶
- trainpandas.DataFrame
A DataFrame containing the training set.
- testpandas.DataFrame
A DataFrame containing the testing set.
- matdata.preprocess.organizeFrame(df, columns_order=None, tid_col='tid', class_col='label', make_spatials=False)[source]¶
Organizes a DataFrame by reordering columns and optionally converting spatial columns.
Parameters:¶
- dfpandas.DataFrame
The DataFrame to be organized.
- columns_orderlist of str, optional
A list of column names specifying the desired order of columns. If None, no reordering is performed.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- make_spatialsbool, optional (default=False)
A flag indicating whether to convert spatial columns to both lat/lon separated or space format, which is the lat/lon concatenated in one column.
Returns:¶
- pandas.DataFrame
A DataFrame containing the organized data, with columns added as specified and spatial columns converted if requested.
- columns_order_zip
A list of the columns with space column, if present.
- columns_order_csv
A list of the columns with lat/lon columns, if present.
- matdata.preprocess.readDataset(data_path, folder=None, file='train.csv', class_col='label', tid_col='tid', missing='?')[source]¶
Reads a dataset file (CSV format by default, ‘train.csv’) and returns it as a pandas DataFrame.
Parameters:¶
- data_pathstr
The directory path where the dataset file is located.
- folderstr, optional
The subfolder within the data path where the dataset file is located.
- filestr, optional (default=’train.csv’)
The name of the dataset file to be read.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- missingstr, optional (default=’?’)
The placeholder for missing values in the dataset.
Returns:¶
- pandas.DataFrame
A DataFrame containing the dataset from the specified file, with trajectory identifier, class label, and missing values handled as specified.
- matdata.preprocess.stratify(df, sample_size=0.5, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.')[source]¶
Stratifies a DataFrame by class label and splits it into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:¶
- dfpandas.DataFrame
The DataFrame to be stratified and split into training and testing sets.
- sample_sizefloat, optional (default=0.5)
The proportion of the dataset to sample for stratification.
- train_sizefloat, optional (default=0.7)
The proportion of the stratified dataset to include in the training set.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- mat_columnslist of str, optional (unused for now)
A list of column names to be included in the .mat files, if set to save.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
Returns:¶
- trainpandas.DataFrame
A DataFrame containing the training set.
- testpandas.DataFrame
A DataFrame containing the testing set.
- matdata.preprocess.trainTestSplit(df, train_size=0.7, random_num=1, tid_col='tid', class_col='label', fileprefix='', data_path='.', outformats=[], verbose=False, organize_columns=True)[source]¶
Splits a DataFrame into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:¶
- dfpandas.DataFrame
The DataFrame to be split into training and testing sets.
- train_sizefloat, optional (default=0.7)
The proportion of the dataset to include in the training set.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving.
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘parquet’]).
- verbosebool, optional (default=False)
A flag indicating whether to display progress messages.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
Returns:¶
- trainpandas.DataFrame
A DataFrame containing the training set.
- testpandas.DataFrame
A DataFrame containing the testing set.