Pre-processing

Multiple Aspect Trajectory Tools Framework

MAT-data: Data Preprocessing for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the classification task of multiple aspect trajectories, specifically for extracting and visualizing the movelets, the parts of the trajectory that better discriminate a class. It integrates into a unique platform the fragmented approaches available for multiple aspects trajectories and in general for multidimensional sequence classification into a unique web-based and python library system. Offers both movelets visualization and classification methods.

Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

@author: Tarlis Portela


matdata.preprocess.countClasses(data_path, folder, file='train.csv', tid_col='tid', class_col='label', markd=False)[source]

Counts the occurrences of each class label in a dataset.

Parameters:

data_pathstr

The directory path where the dataset file is located.

folderstr

The subfolder within the data path where the dataset file is located.

filestr, optional (default=’train.csv’)

The name of the dataset file to be read.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

markdbool, optional (default=False)

A flag indicating whether to print the class counts in Markdown format.

Returns:

pandas.DataFrame or str

If markd is False, prins the markdown text and returns a dictionary DataFrame containing the counts of each class label in the dataset. If markd is True, returns str markdown of the counts of each class label in the dataset.

matdata.preprocess.datasetStatistics(data_path, folder, file_prefix='', tid_col='tid', class_col='label', to_file=False)[source]

Computes statistics for a dataset, including summary statistics for each column and class distribution into a markdown file format.

Parameters:

data_pathstr

The directory path where the dataset file(s) are located.

folderstr

The subfolder within the data path where the dataset file(s) are located.

file_prefixstr, optional (default=’’)

The prefix to be added to the dataset file names.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

to_filebool, optional (default=False)

A flag indicating whether to save the statistics to a file.

Returns:

dict or None

If to_file is False, prints markdown and returns a str containing the computed statistics. If to_file is str, returns markdown str and saves the statistics to a file named as in to_file value.

matdata.preprocess.dfStats(df)[source]

Computes summary statistics for each column in a DataFrame.

Parameters:

dfpandas.DataFrame

The DataFrame for which statistics are to be computed.

Returns:

pandas.DataFrame

A DataFrame containing summary statistics for each column, including mean, standard deviation, and variance. Columns are sorted by variance in descending order.

matdata.preprocess.dfVariance(df)[source]

Computes the variance for each column in a DataFrame.

Parameters:

dfpandas.DataFrame

The DataFrame for which variance is to be computed.

Returns:

pandas.Series

A Series containing the variance for each column in the DataFrame.

matdata.preprocess.featuresJSON(df, version=1, deftype='nominal', defcomparator='equals', tid_col='tid', label_col='label', file=False)[source]

Generates a JSON representation of features from a DataFrame.

Parameters:

dfpandas.DataFrame

The DataFrame containing the dataset.

versionint, optional (default=1)

The version number of the JSON schema (1 for MASTERMovelets format, 2 for HiPerMovelets format).

deftypestr, optional (default=’nominal’)

The default type of features.

defcomparatorstr, optional (default=’equals’)

The default comparator for features.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

label_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

filebool, optional (default=False)

A flag indicating whether to save the JSON representation to a file.

Returns:

str

If file is False, returns a str representing the features in JSON format. If file is str, returns a str of JSON features and saves the JSON representation to a file param name.

matdata.preprocess.joinTrainTest(dir_path, train_file='train.csv', test_file='test.csv', tid_col='tid', class_col='label', to_file=False)[source]

Joins training and testing datasets from separate files into a single DataFrame.

Parameters:

dir_pathstr

The directory path where the training and testing files are located.

train_filestr, optional (default=”train.csv”)

The name of the training file to be read.

test_filestr, optional (default=”test.csv”)

The name of the testing file to be read.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

to_filebool, optional (default=False)

A flag indicating whether to save the joined DataFrame to a file, and saves the joined DataFrame to a file named ‘joined.csv’.

Returns:

pandas.DataFrame

A DataFrame containing the joined training and testing data. If to_file is True, returns the DataFrame and saves the joined DataFrame to a file named ‘joined.csv’.

matdata.preprocess.kfold_trainTestSplit(df, k, random_num=1, tid_col='tid', class_col='label', fileprefix='', columns_order=None, ktrain=None, ktest=None, mat_columns=None, data_path='.', outformats=[], verbose=False)[source]

Splits a DataFrame into k folds for k-fold cross-validation, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame

The DataFrame to be split into k folds.

kint

The number of folds for cross-validation.

random_numint, optional (default=1)

The random seed for reproducible results.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

fileprefixstr, optional (default=’’)

The prefix to be added to the file names when saving, for example: ‘specific_’ or ‘generic_’.

columns_orderlist of str, optional

A list of column names specifying the desired order of columns. If None, no reordering is performed.

ktrainlist of pandas.DataFrame, optional

A list of training sets for each fold. If None, the function will split the data into training and testing sets.

ktestlist of pandas.DataFrame, optional

A list of testing sets for each fold. If None, the function will split the data into training and testing sets.

mat_columnslist of str, optional

A list of column names to be included in the .mat files, corresponding to columns_order.

data_pathstr, optional (default=’.’)

The directory path where the output files will be saved.

outformatslist of str, optional

A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).

verbosebool, optional (default=False)

A flag indicating whether to display progress messages.

Returns:

ktrainlist of pandas.DataFrame

List of DataFrame containing the training sets.

ktestlist of pandas.DataFrame

List of DataFrame containing the testing sets.

matdata.preprocess.klabels_stratify(df, kl=10, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.')[source]

Stratifies a DataFrame by a specified number of class labels and splits it into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame

The DataFrame to be stratified and split into training and testing sets.

klint, optional (default=10)

The number of class labels to stratify the DataFrame.

train_sizefloat, optional (default=0.7)

The proportion of the stratified dataset to include in the training set.

random_numint, optional (default=1)

The random seed for reproducible results.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

organize_columnsbool, optional (default=True)

A flag indicating whether to organize columns before saving.

mat_columnslist of str, optional (unused for now)

A list of column names to be included in the .mat files, if set to save.

fileprefixstr, optional (default=’’)

The prefix to be added to the file names when saving.

outformatslist of str, optional

A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).

data_pathstr, optional (default=’.’)

The directory path where the output files will be saved.

Returns:

trainpandas.DataFrame

A DataFrame containing the training set.

testpandas.DataFrame

A DataFrame containing the testing set.

matdata.preprocess.organizeFrame(df, columns_order=None, tid_col='tid', class_col='label', make_spatials=False)[source]

Organizes a DataFrame by reordering columns and optionally converting spatial columns.

Parameters:

dfpandas.DataFrame

The DataFrame to be organized.

columns_orderlist of str, optional

A list of column names specifying the desired order of columns. If None, no reordering is performed.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

make_spatialsbool, optional (default=False)

A flag indicating whether to convert spatial columns to both lat/lon separated or space format, which is the lat/lon concatenated in one column.

Returns:

pandas.DataFrame

A DataFrame containing the organized data, with columns added as specified and spatial columns converted if requested.

columns_order_zip

A list of the columns with space column, if present.

columns_order_csv

A list of the columns with lat/lon columns, if present.

matdata.preprocess.readDataset(data_path, folder=None, file='train.csv', class_col='label', tid_col='tid', missing='?')[source]

Reads a dataset file (CSV format by default, ‘train.csv’) and returns it as a pandas DataFrame.

Parameters:

data_pathstr

The directory path where the dataset file is located.

folderstr, optional

The subfolder within the data path where the dataset file is located.

filestr, optional (default=’train.csv’)

The name of the dataset file to be read.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

missingstr, optional (default=’?’)

The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame

A DataFrame containing the dataset from the specified file, with trajectory identifier, class label, and missing values handled as specified.

matdata.preprocess.stratify(df, sample_size=0.5, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.')[source]

Stratifies a DataFrame by class label and splits it into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame

The DataFrame to be stratified and split into training and testing sets.

sample_sizefloat, optional (default=0.5)

The proportion of the dataset to sample for stratification.

train_sizefloat, optional (default=0.7)

The proportion of the stratified dataset to include in the training set.

random_numint, optional (default=1)

The random seed for reproducible results.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

organize_columnsbool, optional (default=True)

A flag indicating whether to organize columns before saving.

mat_columnslist of str, optional (unused for now)

A list of column names to be included in the .mat files, if set to save.

fileprefixstr, optional (default=’’)

The prefix to be added to the file names when saving.

outformatslist of str, optional

A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).

data_pathstr, optional (default=’.’)

The directory path where the output files will be saved.

Returns:

trainpandas.DataFrame

A DataFrame containing the training set.

testpandas.DataFrame

A DataFrame containing the testing set.

matdata.preprocess.trainTestSplit(df, train_size=0.7, random_num=1, tid_col='tid', class_col='label', fileprefix='', data_path='.', outformats=[], verbose=False, organize_columns=True)[source]

Splits a DataFrame into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame

The DataFrame to be split into training and testing sets.

train_sizefloat, optional (default=0.7)

The proportion of the dataset to include in the training set.

random_numint, optional (default=1)

The random seed for reproducible results.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

fileprefixstr, optional (default=’’)

The prefix to be added to the file names when saving.

data_pathstr, optional (default=’.’)

The directory path where the output files will be saved.

outformatslist of str, optional

A list of output formats for saving the datasets (e.g., [‘csv’, ‘parquet’]).

verbosebool, optional (default=False)

A flag indicating whether to display progress messages.

organize_columnsbool, optional (default=True)

A flag indicating whether to organize columns before saving.

Returns:

trainpandas.DataFrame

A DataFrame containing the training set.

testpandas.DataFrame

A DataFrame containing the testing set.