MAT-data: Data Handler for Multiple Aspect Trajectory Data Mining [MAT-Tools Framework]¶

Sample Code in python notebook to use mat-data as a python library.

The present package offers a tool, to support the user in the task of data preprocessing of multiple aspect trajectories, or to generating synthetic datasets. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods.

Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

In [ ]:
!pip install mat-data
#!pip install --upgrade mat-data

1. Reading local data¶

Sample code for trajectory dataset read from local files

The easy way to read data is to load a csv file in pandas, such as:

In [8]:
import pandas as pd

data_path = 'matdata/assets/sample'

pd.read_csv(data_path + '/Foursquare_Sample.csv')
Out[8]:
tid lat_lon date_time time rating price weather day root_type type
0 126 40.8331652006224 -73.9418603427692 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Residence Home (private)
1 126 40.8340978041072 -73.9452672225881 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Food Deli / Bodega
2 126 40.8331652006224 -73.9418603427692 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Residence Home (private)
3 126 40.7646959283254 -73.8851974964414 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Food Fried Chicken Joint
4 126 40.7660790376824 -73.8835287094116 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Travel & Transport Bus Station
... ... ... ... ... ... ... ... ... ... ...
66957 29563 40.7047332789043 -73.9877378940582 2012-08-10 17:17:37 1037 -1.0 -1 Clouds Friday College & University General College & University
66958 29563 40.6951627360199 -73.9954478691072 2012-08-10 20:10:59 1210 8.0 2 Clouds Friday Food Thai Restaurant
66959 29563 40.6978026652822 -73.9941451630314 2012-08-11 08:01:20 481 6.9 -1 Clouds Saturday Outdoors & Recreation Gym
66960 29563 40.6946728967503 -73.9940820360805 2012-08-11 13:39:39 819 7.0 1 Clouds Saturday Food Coffee Shop
66961 29563 40.6978026652822 -73.9941451630314 2012-08-12 07:56:26 476 6.9 -1 Clouds Sunday Outdoors & Recreation Gym

66962 rows × 10 columns

mat-data provides modules to handle dataset reading in standard ways:

a) Read a dataset locally:

This is an example for .csv file, however this can read .csv, .parquet, .zip, .ts, and .xes file formats.

In [9]:
from matdata.dataset import *
In [10]:
df = read_ds('matdata/assets/sample/Foursquare_Sample.csv')
df.head()
Out[10]:
lat_lon date_time time rating price weather day root_type type tid
0 40.8331652006224 -73.9418603427692 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Residence Home (private) 126
1 40.8340978041072 -73.9452672225881 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Food Deli / Bodega 126
2 40.8331652006224 -73.9418603427692 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Residence Home (private) 126
3 40.7646959283254 -73.8851974964414 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Food Fried Chicken Joint 126
4 40.7660790376824 -73.8835287094116 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Travel & Transport Bus Station 126

Optionally, you can use the standardized reading, which will make 'tid'/'label' nomenclature (or rename columns) and will sort the trajectories

In [11]:
df = read_ds('matdata/assets/sample/Foursquare_Sample.csv', tid_col='tid', class_col='root_type')
df.head()
Out[11]:
lat_lon date_time time rating price weather day type tid label
0 40.8331652006224 -73.9418603427692 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Home (private) 126 Residence
1 40.8340978041072 -73.9452672225881 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Deli / Bodega 126 Food
2 40.8331652006224 -73.9418603427692 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Home (private) 126 Residence
3 40.7646959283254 -73.8851974964414 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Fried Chicken Joint 126 Food
4 40.7660790376824 -73.8835287094116 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Bus Station 126 Travel & Transport

2. Loading Repository Data¶

This module loads data from public repository Git: mat-analysis datasets (v2_0)

Check the GitHub repository to see available datasets.

To use helpers for data loading, import from package matdata.dataset:

In [12]:
from matdata.dataset import *
a) First, you can load datasets by informing the category (parent folder) and dataset name (subfolder):
In [13]:
# dataset='mat.FoursquareNYC' ## => dafault

df = load_ds(sample_size=0.25)
df
Loading dataset file: https://github.com/mat-analysis/datasets/raw/main/mat/FoursquareNYC/
  0%|          | 0/193 [00:00<?, ?it/s]
Out[13]:
lat lon day hour poi category price rating weather tid label
0 40.834098 -73.945267 Monday 13 21580 Food 1 8.2 Clouds 127 6
1 40.567196 -73.882576 Monday 19 2392 Travel & Transport -999 -999.0 Clouds 127 6
2 40.689913 -73.981504 Monday 23 35589 Travel & Transport -999 -999.0 Clouds 127 6
3 40.708588 -73.991032 Monday 23 18603 Travel & Transport -999 -999.0 Clouds 127 6
4 40.833165 -73.941860 Tuesday 14 36348 Residence -999 -999.0 Clear 127 6
... ... ... ... ... ... ... ... ... ... ... ...
22118 40.704273 -73.986759 Saturday 10 25461 Outdoors & Recreation -999 8.6 Clear 29551 1070
22119 40.704733 -73.987738 Saturday 10 1805 College & University -999 -999.0 Clear 29551 1070
22120 40.717353 -73.960392 Saturday 13 32523 Food 1 9.3 Clear 29551 1070
22121 40.697721 -73.993020 Sunday 2 36212 College & University -999 -999.0 Clear 29551 1070
22122 40.697803 -73.994145 Sunday 8 16452 Outdoors & Recreation -999 6.9 Clear 29551 1070

16435 rows × 11 columns

 b) Second, you can load the 70/30 hold out split available (by default):
In [14]:
df_train, df_test = load_ds_holdout()

print(df_train.shape, df_test.shape)

df_train
Loading dataset file: https://github.com/mat-analysis/datasets/raw/main/mat/FoursquareNYC/
  0%|          | 0/193 [00:00<?, ?it/s]
(46785, 11) (20177, 11)
Out[14]:
lat lon day hour poi category price rating weather tid label
0 40.834098 -73.945267 Monday 13 21580 Food 1 8.2 Clouds 127 6
1 40.567196 -73.882576 Monday 19 2392 Travel & Transport -999 -999.0 Clouds 127 6
2 40.689913 -73.981504 Monday 23 35589 Travel & Transport -999 -999.0 Clouds 127 6
3 40.708588 -73.991032 Monday 23 18603 Travel & Transport -999 -999.0 Clouds 127 6
4 40.833165 -73.941860 Tuesday 14 36348 Residence -999 -999.0 Clear 127 6
... ... ... ... ... ... ... ... ... ... ... ...
22148 40.705953 -73.996568 Saturday 19 35308 Outdoors & Recreation -999 9.6 Clear 29556 1070
22149 40.697721 -73.993020 Saturday 23 36212 College & University -999 -999.0 Clear 29556 1070
22150 40.697884 -73.992805 Sunday 15 38090 Shop & Service -999 5.2 Clouds 29556 1070
22151 40.698291 -73.996632 Sunday 18 33538 Outdoors & Recreation -999 9.6 Clouds 29556 1070
22152 40.692421 -73.994002 Sunday 18 29212 Professional & Other Places -999 -999.0 Clouds 29556 1070

46785 rows × 11 columns

Or, you can hold out split on another proportion (50% for instance):

In [15]:
df_train, df_test = load_ds_holdout(train_size=0.5)

# The split is class-balanced, thus train and test number of trajectories may not be exactly proportional.
print(df_train.shape, df_test.shape) 

df_train
Loading dataset file: https://github.com/mat-analysis/datasets/raw/main/mat/FoursquareNYC/
  0%|          | 0/193 [00:00<?, ?it/s]
(33773, 11) (33189, 11)
Out[15]:
lat lon day hour poi category price rating weather tid label
0 40.834098 -73.945267 Monday 13 21580 Food 1 8.2 Clouds 127 6
1 40.567196 -73.882576 Monday 19 2392 Travel & Transport -999 -999.0 Clouds 127 6
2 40.689913 -73.981504 Monday 23 35589 Travel & Transport -999 -999.0 Clouds 127 6
3 40.708588 -73.991032 Monday 23 18603 Travel & Transport -999 -999.0 Clouds 127 6
4 40.833165 -73.941860 Tuesday 14 36348 Residence -999 -999.0 Clear 127 6
... ... ... ... ... ... ... ... ... ... ... ...
22131 40.704733 -73.987738 Thursday 18 1805 College & University -999 -999.0 Clear 29554 1070
22132 40.704273 -73.986759 Thursday 19 25461 Outdoors & Recreation -999 8.6 Clear 29554 1070
22133 40.697803 -73.994145 Friday 10 16452 Outdoors & Recreation -999 6.9 Clear 29554 1070
22134 40.695163 -73.995448 Friday 20 1944 Food 2 8.0 Clear 29554 1070
22135 40.694673 -73.994082 Saturday 13 16201 Food 1 7.0 Clear 29554 1070

33773 rows × 11 columns

c) Or, you can load the k-fold split datasets available (deafult k=5):
In [16]:
df_train, df_test = load_ds_kfold()

for k in range(len(df_train)):
    print('Shape train/test:', df_train[k].shape, df_test[k].shape)
Loading dataset file: https://github.com/mat-analysis/datasets/raw/main/mat/FoursquareNYC/
Spliting Data:   0%|          | 0/193 [00:00<?, ?it/s]
Shape train/test: (51130, 11) (15832, 11)
Shape train/test: (52678, 11) (14284, 11)
Shape train/test: (53938, 11) (13024, 11)
Shape train/test: (54856, 11) (12106, 11)
Shape train/test: (55246, 11) (11716, 11)
d) You can load a different dataset from repository:
In [17]:
# Use the format: 'category.DatasetName'
dataset='raw.Animals'

df = load_ds(dataset)
df
Loading dataset file: https://github.com/mat-analysis/datasets/raw/main/raw/Animals/
Out[17]:
time lat lon tid label
0 0.00 50.1066 3.79665 1 D
1 4.39 50.1045 3.79455 1 D
2 7.90 50.1111 3.79845 1 D
3 9.62 50.1072 3.79845 1 D
4 15.09 50.1132 3.79965 1 D
... ... ... ... ... ...
4517 258.88 50.1696 3.76215 97 C
4518 260.85 50.1693 3.76185 97 C
4519 262.80 50.1693 3.76245 97 C
4520 264.69 50.1687 3.76455 97 C
4521 266.90 50.1687 3.76455 97 C

14990 rows × 5 columns

e) To get a full list of anailable repositories and categories:
In [18]:
rd = repository_datasets()

print('Multiple Aspect Trajecory datasets:', rd['mat'])

rd
Multiple Aspect Trajecory datasets: ['Brightkite', 'FoursquareGlobal', 'FoursquareNYC', 'Gowalla', 'Weeplaces']
Out[18]:
{'log': ['BPI2011', 'BPI2012', 'BPI2015', 'BPI2017', 'BPI2018', 'BPI2019'],
 'mat': ['Brightkite',
  'FoursquareGlobal',
  'FoursquareNYC',
  'Gowalla',
  'Weeplaces'],
 'mts': ['ActivityRecognition',
  'ArticularyWordRecognition',
  'AtrialFibrillation',
  'AustralianSignLanguage',
  'BasicMotions',
  'CharacterTrajectories',
  'Cricket',
  'DuckDuckGeese',
  'ERing',
  'EigenWorms',
  'Epilepsy',
  'EthanolConcentration',
  'FaceDetection',
  'FaciesRocks',
  'FingerMovements',
  'GECCOWater',
  'GrammaticalFacialExpression',
  'HandMovementDirection',
  'Handwriting',
  'Heartbeat',
  'InsectWingbeat',
  'JapaneseVowels',
  'LSST',
  'Libras',
  'MotorImagery',
  'NATOPS',
  'PEMS-SF',
  'PenDigits',
  'PhonemeSpectra',
  'RacketSports',
  'SelfRegulationSCP1',
  'SelfRegulationSCP2',
  'SpokenArabicDigits',
  'StandWalkJump',
  'UWaveGestureLibrary'],
 'raw': ['Animals', 'Geolife', 'GoTrack', 'Hurricanes', 'Vehicles'],
 'sequential': ['ClothingAlibaba', 'Promoters', 'SJGS'],
 'uts': ['ACSF1',
  'Adiac',
  'AllGestureWiimoteX',
  'AllGestureWiimoteY',
  'AllGestureWiimoteZ',
  'ArrowHead',
  'BME',
  'Beef',
  'BeetleFly',
  'BirdChicken',
  'CBF',
  'Car',
  'Chinatown',
  'ChlorineConcentration',
  'CinCECGTorso',
  'Coffee',
  'Computers',
  'CricketX',
  'CricketY',
  'CricketZ',
  'Crop',
  'DiatomSizeReduction',
  'DistalPhalanxOutlineAgeGroup',
  'DistalPhalanxOutlineCorrect',
  'DistalPhalanxTW',
  'DodgerLoopDay',
  'DodgerLoopGame',
  'DodgerLoopWeekend',
  'ECG200',
  'ECG5000',
  'ECGFiveDays',
  'EOGHorizontalSignal',
  'EOGVerticalSignal',
  'Earthquakes',
  'ElectricDevices',
  'EthanolLevel',
  'FaceAll',
  'FaceFour',
  'FacesUCR',
  'FiftyWords',
  'Fish',
  'FordA',
  'FordB',
  'FreezerRegularTrain',
  'FreezerSmallTrain',
  'Fungi',
  'GestureMidAirD1',
  'GestureMidAirD2',
  'GestureMidAirD3',
  'GesturePebbleZ1',
  'GesturePebbleZ2',
  'GunPoint',
  'GunPointAgeSpan',
  'GunPointMaleVersusFemale',
  'GunPointOldVersusYoung',
  'Ham',
  'HandOutlines',
  'Haptics',
  'Herring',
  'HouseTwenty',
  'InlineSkate',
  'InsectEPGRegularTrain',
  'InsectEPGSmallTrain',
  'InsectWingbeatSound',
  'ItalyPowerDemand',
  'LargeKitchenAppliances',
  'Lightning2',
  'Lightning7',
  'Mallat',
  'Meat',
  'MedicalImages',
  'MelbournePedestrian',
  'MiddlePhalanxOutlineAgeGroup',
  'MiddlePhalanxOutlineCorrect',
  'MiddlePhalanxTW',
  'MixedShapesRegularTrain',
  'MixedShapesSmallTrain',
  'MoteStrain',
  'NonInvasiveFetalECGThorax1',
  'NonInvasiveFetalECGThorax2',
  'OSULeaf',
  'OliveOil',
  'PLAID',
  'PhalangesOutlinesCorrect',
  'Phoneme',
  'PickupGestureWiimoteZ',
  'PigAirwayPressure',
  'PigArtPressure',
  'PigCVP',
  'Plane',
  'PowerCons',
  'ProximalPhalanxOutlineAgeGroup',
  'ProximalPhalanxOutlineCorrect',
  'ProximalPhalanxTW',
  'RefrigerationDevices',
  'Rock',
  'ScreenType',
  'SemgHandGenderCh2',
  'SemgHandMovementCh2',
  'SemgHandSubjectCh2',
  'ShakeGestureWiimoteZ',
  'ShapeletSim',
  'ShapesAll',
  'SmallKitchenAppliances',
  'SmoothSubspace',
  'SonyAIBORobotSurface1',
  'SonyAIBORobotSurface2',
  'StarLightCurves',
  'Strawberry',
  'SwedishLeaf',
  'Symbols',
  'SyntheticControl',
  'ToeSegmentation1',
  'ToeSegmentation2',
  'Trace',
  'TwoLeadECG',
  'TwoPatterns',
  'UMD',
  'UWaveGestureLibraryAll',
  'UWaveGestureLibraryX',
  'UWaveGestureLibraryY',
  'UWaveGestureLibraryZ',
  'Wafer',
  'Wine',
  'WordSynonyms',
  'Worms',
  'WormsTwoClass',
  'Yoga']}

3. Pre-processing data¶

To use helpers for data pre-processing, import from package matdata.preprocess:

In [19]:
from matdata.preprocess import *

The preprocess module provides some functions to work data:

Basic functions:

  • readDataset: load datasets as pandas DataFrame (from .csv, .parquet, .zip, .ts or .xes)
  • organizeFrame: standardize data columns for the DataFrame

Train and Test split functions:

  • trainTestSplit: split dataset (pandas DataFrame) in train / test (70/30% by default)
  • kfold_trainTestSplit: split dataset (pandas DataFrame) in k-fold train / test (5-fold of 80/20% each fold by default)
  • stratify: extract trajectories from the dataset, respecting class balance, creating a subset of the data (to use when smaller datasets are needed)
  • klabels_stratify: k-labels statification (randomly select k-labels from the dataset)
  • joinTrainTest: joins the separated train and test files into one DataFrame.

Statistical functions:

  • printFeaturesJSON: print a default JSON file descriptor for Movelets methods (version 1 or 2)
  • countClasses: calculates statistics from a dataset dataframe
  • dfVariance: calculates a variance rank from a dataset dataframe
  • dfStats: calculates attributes statistics ordered by variance from a dataset dataframe
  • datasetStatistics: generates dataset statistics from a dataframe in markdown text.

Type reading functions:

  • csv2df: reads .csv dataset dataframe
  • parquet2df: reads .parquet dataset dataframe
  • zip2df: reads .zip dataset dataframe (zip containing trajectory csv files)
  • ts2df: reads .ts dataset dataframe (Time Series data format)
  • xes2df: reads .xes dataset dataframe (event log / event stream file)
  • mat2df: TODO reads .mat dataset dataframe (multiple aspect trajectory specific file format)

File convertion functions:

  • zip2csv: converts .zip files and saves to .csv files
  • df2zip: converts DataFrame and saves to .zip files
  • any2ts: converts .zip or .csv files and saves to .ts files
  • xes2csv: reads .xes files and converts to DataFrame
  • convertDataset: default format conversions. Reads the dataset files and saves in .csv and .zip formats, also do k-fold split if not present
a) Basic reading the data, and organization:
In [20]:
data_path = 'matdata/assets/sample'

df = readDataset(data_path, file='Foursquare_Sample.csv')
df.head()
Out[20]:
tid lat_lon date_time time rating price weather day root_type type
0 126 40.8331652006224 -73.9418603427692 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Residence Home (private)
1 126 40.8340978041072 -73.9452672225881 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Food Deli / Bodega
2 126 40.8331652006224 -73.9418603427692 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Residence Home (private)
3 126 40.7646959283254 -73.8851974964414 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Food Fried Chicken Joint
4 126 40.7660790376824 -73.8835287094116 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Travel & Transport Bus Station
In [21]:
df, space_cols, ll_cols = organizeFrame(df, make_spatials=True)

print('Columns with space: ', space_cols)
print('Columns with lat/lon: ', ll_cols)
df.head()
Columns with space:  ['space', 'date_time', 'time', 'rating', 'price', 'weather', 'day', 'root_type', 'type', 'tid']
Columns with lat/lon:  ['date_time', 'time', 'rating', 'price', 'weather', 'day', 'root_type', 'type', 'lat', 'lon', 'tid']
Out[21]:
tid space date_time time rating price weather day root_type type lat lon
0 126 40.8331652006224 -73.9418603427692 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Residence Home (private) 40.833165 -73.941860
1 126 40.8340978041072 -73.9452672225881 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Food Deli / Bodega 40.834098 -73.945267
2 126 40.8331652006224 -73.9418603427692 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Residence Home (private) 40.833165 -73.941860
3 126 40.7646959283254 -73.8851974964414 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Food Fried Chicken Joint 40.764696 -73.885197
4 126 40.7660790376824 -73.8835287094116 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Travel & Transport Bus Station 40.766079 -73.883529

Note: To better standard, we recomend for classification the use of prepare_ds function from dataset module, as you can indicate the class column:

In [22]:
from matdata.dataset import prepare_ds

df = prepare_ds(df, class_col='root_type') # 'root_type' is then renamed 'label'
df
Out[22]:
date_time time rating price weather day type lat lon tid label
0 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Home (private) 40.833165 -73.941860 126 Residence
1 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Deli / Bodega 40.834098 -73.945267 126 Food
2 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Home (private) 40.833165 -73.941860 126 Residence
3 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Fried Chicken Joint 40.764696 -73.885197 126 Food
4 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Bus Station 40.766079 -73.883529 126 Travel & Transport
... ... ... ... ... ... ... ... ... ... ... ...
66957 2012-08-10 17:17:37 1037 -1.0 -1 Clouds Friday General College & University 40.704733 -73.987738 29563 College & University
66958 2012-08-10 20:10:59 1210 8.0 2 Clouds Friday Thai Restaurant 40.695163 -73.995448 29563 Food
66959 2012-08-11 08:01:20 481 6.9 -1 Clouds Saturday Gym 40.697803 -73.994145 29563 Outdoors & Recreation
66960 2012-08-11 13:39:39 819 7.0 1 Clouds Saturday Coffee Shop 40.694673 -73.994082 29563 Food
66961 2012-08-12 07:56:26 476 6.9 -1 Clouds Sunday Gym 40.697803 -73.994145 29563 Outdoors & Recreation

66962 rows × 11 columns

b) Train and test split:

To hold-out split a dataset into train and test (70/30% by deafult):

In [23]:
train, test = trainTestSplit(df, random_num=1)
train.head()
  0%|          | 0/9 [00:00<?, ?it/s]
Out[23]:
date_time time rating price weather day type lat lon tid label
0 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Home (private) 40.833165 -73.941860 126 Residence
1 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Deli / Bodega 40.834098 -73.945267 126 Food
2 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Home (private) 40.833165 -73.941860 126 Residence
3 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Fried Chicken Joint 40.764696 -73.885197 126 Food
4 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Bus Station 40.766079 -73.883529 126 Travel & Transport

If you want to save, indicate the output format and data path:

In [24]:
trainTestSplit(df, data_path=data_path, outformats=['csv', 'parquet'])

# Reading:
df = readDataset(data_path, file='train.parquet')
df.head()
  0%|          | 0/9 [00:00<?, ?it/s]
Writing - CSV |TRAIN - 
Writing - CSV |TEST - 
Writing - Parquet |TRAIN - 
Writing - Parquet |TEST - 
Out[24]:
date_time time rating price weather day type lat lon tid label
0 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Home (private) 40.833165 -73.941860 126 Residence
1 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Deli / Bodega 40.834098 -73.945267 126 Food
2 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Home (private) 40.833165 -73.941860 126 Residence
3 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Fried Chicken Joint 40.764696 -73.885197 126 Food
4 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Bus Station 40.766079 -73.883529 126 Travel & Transport

To k-fold split a dataset into train and test:

In [25]:
train, test = kfold_trainTestSplit(df, k=3)

for k in range(len(train)):
    print('Shape train/test:', train[k].shape, test[k].shape)
Spliting Data:   0%|          | 0/10 [00:00<?, ?it/s]
Shape train/test: (174636, 11) (86969, 11)
Shape train/test: (175715, 11) (85890, 11)
Shape train/test: (172859, 11) (88746, 11)
c) Stratifying the data (example to get 50% of the dataset):
In [26]:
train, test = stratify(df, sample_size=0.5)

print('Shape train/test:', train.shape, test.shape)
train.head()
  0%|          | 0/9 [00:00<?, ?it/s]
  0%|          | 0/9 [00:00<?, ?it/s]
Shape train/test: (16432, 11) (7106, 11)
Out[26]:
date_time time rating price weather day type lat lon tid label
0 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Home (private) 40.833165 -73.941860 126 Residence
1 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Deli / Bodega 40.834098 -73.945267 126 Food
2 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Home (private) 40.833165 -73.941860 126 Residence
3 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Fried Chicken Joint 40.764696 -73.885197 126 Food
4 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Bus Station 40.766079 -73.883529 126 Travel & Transport

k-Fold Stratifying the data (example to get 50% of the dataset in 3-folds):

In [27]:
train, test = klabels_stratify(df, kl=5)

print('Shape train/test:', train.shape, test.shape)


print('Labels before:', df.label.unique())
print('Labels after:', train.label.unique())
  0%|          | 0/5 [00:00<?, ?it/s]
Shape train/test: (25824, 11) (11704, 11)
Labels before: ['Residence' 'Food' 'Travel & Transport' 'Professional & Other Places'
 'Shop & Service' 'Outdoors & Recreation' 'College & University'
 'Arts & Entertainment' 'Nightlife Spot' 'Event']
Labels after: ['Residence' 'Food' 'Travel & Transport' 'Professional & Other Places'
 'Shop & Service']
d) Joining train and test files:
In [28]:
df = joinTrainTest(data_path, train_file="train.csv", test_file="test.csv", to_file=True) # Saves 'joined.csv' file

df.head()
Joining train and test data from... matdata/assets/sample
Saving joined dataset as: matdata/assets/sample/joined.csv
Done.
 --------------------------------------------------------------------------------
Out[28]:
date_time time rating price weather day type lat lon tid label
0 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Home (private) 40.833165 -73.941860 126 Residence
1 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Deli / Bodega 40.834098 -73.945267 126 Food
2 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Home (private) 40.833165 -73.941860 126 Residence
3 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Fried Chicken Joint 40.764696 -73.885197 126 Food
4 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Bus Station 40.766079 -73.883529 126 Travel & Transport

Note: We standardized all repository datasets by creating a data.parquet file with np.NaN as missing values, as example:

In [29]:
from matdata.preprocess import *
from matdata.dataset import prepare_ds
import numpy as np

data_path = 'matdata/assets/sample'

df.replace('?', np.NaN, inplace=True)

df = prepare_ds(df)

df2parquet(df, data_path, 'data')
Saving dataset as: matdata/assets/sample/data.parquet
Done.
 --------------------------------------------------------------------------------
Out[29]:
date_time time rating price weather day type lat lon label tid
0 2012-11-12 05:17:18 317 -1.0 -1 Clear Monday Home (private) 40.833165 -73.941860 Residence 126
1 2012-11-12 23:24:55 1404 8.2 1 Clouds Monday Deli / Bodega 40.834098 -73.945267 Food 126
2 2012-11-13 00:00:07 0 -1.0 -1 Clouds Tuesday Home (private) 40.833165 -73.941860 Residence 126
3 2012-11-15 17:49:01 1069 6.6 3 Clear Thursday Fried Chicken Joint 40.764696 -73.885197 Food 126
4 2012-11-15 18:40:16 1120 -1.0 -1 Clear Thursday Bus Station 40.766079 -73.883529 Travel & Transport 126
... ... ... ... ... ... ... ... ... ... ... ...
20080 2012-08-10 17:17:37 1037 -1.0 -1 Clouds Friday General College & University 40.704733 -73.987738 College & University 29563
20081 2012-08-10 20:10:59 1210 8.0 2 Clouds Friday Thai Restaurant 40.695163 -73.995448 Food 29563
20082 2012-08-11 08:01:20 481 6.9 -1 Clouds Saturday Gym 40.697803 -73.994145 Outdoors & Recreation 29563
20083 2012-08-11 13:39:39 819 7.0 1 Clouds Saturday Coffee Shop 40.694673 -73.994082 Food 29563
20084 2012-08-12 07:56:26 476 6.9 -1 Clouds Sunday Gym 40.697803 -73.994145 Outdoors & Recreation 29563

66962 rows × 11 columns

4. Synthetic Data Generation¶

TODO

In [1]:
from matdata.generator import *
  • scalerSamplerGenerator: generates trajectory datasets based on real data on scale intervals
  • samplerGenerator: generate a trajectory dataset based on real data
  • scalerRandomGenerator: generates trajectory datasets based on random data on scale intervals
  • randomGenerator: generate a trajectory dataset based on random data
a) To generate a sample dataset (default config):
In [2]:
samplerGenerator()
Out[2]:
tid space time day rating price weather root_type type label
0 1 40.7429692382286 -73.8827669620514 447 Monday 6.1 -1 Clouds Shop & Service Supermarket C1
1 1 40.8610785519472 -73.9301037076315 486 Monday -1.0 -1 Clouds Residence Home (private) C1
2 1 40.5774555812085 -73.9812469482422 653 Monday -1.0 -1 Clear Travel & Transport Metro Station C1
3 1 40.8114381522955 -74.0677964687347 701 Monday 6.8 -1 Rain Arts & Entertainment Stadium C1
4 1 40.7811844364976 -73.9732030930065 758 Monday 9.4 -1 Clear Arts & Entertainment Science Museum C1
... ... ... ... ... ... ... ... ... ... ...
495 10 40.7638885711179 -74.0233821808352 1385 Saturday -1.0 -1 Clouds Travel & Transport Border Crossing C1
496 10 40.6891441345215 -73.9303207397461 379 Sunday -1.0 1 Clouds Food Deli / Bodega C1
497 10 40.7704703500000 -74.0281831400000 547 Sunday -1.0 -1 Clear Professional & Other Places Post Office C1
498 10 40.7774773922087 -73.8146725755643 783 Sunday -1.0 -1 Clear Shop & Service Candy Store C1
499 10 40.8331652006224 -73.9418603427692 1033 Sunday -1.0 -1 Clouds Residence Home (private) C1

500 rows × 10 columns

To specify the synthetic dataset parameters:

In [3]:
N=10 # Number of trajectories
M=50 # Number of points by trajectory
C=3  # Number of classes (C1 to Cn)
samplerGenerator(N, M, C)
Out[3]:
tid space time day rating price weather root_type type label
0 1 40.7429692382286 -73.8827669620514 447 Monday 6.1 -1 Clouds Shop & Service Supermarket C1
1 1 40.8610785519472 -73.9301037076315 486 Monday -1.0 -1 Clouds Residence Home (private) C1
2 1 40.5774555812085 -73.9812469482422 653 Monday -1.0 -1 Clear Travel & Transport Metro Station C1
3 1 40.8114381522955 -74.0677964687347 701 Monday 6.8 -1 Rain Arts & Entertainment Stadium C1
4 1 40.7811844364976 -73.9732030930065 758 Monday 9.4 -1 Clear Arts & Entertainment Science Museum C1
... ... ... ... ... ... ... ... ... ... ...
495 11 40.7638885711179 -74.0233821808352 1385 Saturday -1.0 -1 Clouds Travel & Transport Border Crossing C3
496 11 40.6891441345215 -73.9303207397461 379 Sunday -1.0 1 Clouds Food Deli / Bodega C3
497 11 40.7704703500000 -74.0281831400000 547 Sunday -1.0 -1 Clear Professional & Other Places Post Office C3
498 11 40.7774773922087 -73.8146725755643 783 Sunday -1.0 -1 Clear Shop & Service Candy Store C3
499 11 40.8331652006224 -73.9418603427692 1033 Sunday -1.0 -1 Clouds Residence Home (private) C3

500 rows × 10 columns

b) To generate a set of sample datasets:

Creates and save dataset files (including movelets json descriptor file). Generates sample datasets in a increasing log scale for each parameter. It uses the middle value for the other configurations

In [4]:
data_path = 'matdata/assets/sample/samples'

Ns=[100, 3]   # Min. number of trajectories: 100, 3 scales (by log increment)  
Ms=[10,  3]   # Min. number of points: 10, 3 scales (by log increment)
Ls=[8,   3]   # Min. number of attributes: 8, 3 scales (by log increment)
Cs=[2,   3]   # Min. number of labels: 2, 3 scales (by log increment)

scalerSamplerGenerator(Ns, Ms, Ls, Cs, save_to=data_path)
  0%|          | 0/12 [00:00<?, ?it/s]
N :: fix. value: 	 200 	scale:	 [100, 200, 400]
M :: fix. value: 	 20 	scale:	 [10, 20, 40]
L :: fix. value: 	 8 	scale:	 [8, 16, 32]
C :: fix. value: 	 4 	scale:	 [2, 4, 8]
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
c) To generate a random dataset (default config):
In [5]:
randomGenerator()
Out[5]:
tid a1_space a2_time a3_n1 a4_n2 a5_nominal a6_day a7_weather a8_category a9_space a10_time label
0 1 134.5 848.28 114 949 976.98 PX Tuesday Clouds Outdoors & Recreation 101.04 778.52 1114 C1
1 1 764.54 255.32 985 -656 630.77 HK Saturday Clouds Shop & Service 328.42 509.78 83 C1
2 1 495.93 449.94 746 344 695.05 JM Saturday Rain Travel & Transport 665.91 179.75 1073 C1
3 1 652.24 789.51 1167 -442 450.85 ZO Wednesday Unknown Nightlife Spot 149.71 141.68 185 C1
4 1 93.95 28.38 1135 327 523.90 CU Monday Clouds Professional & Other Places 866.41 305.93 522 C1
... ... ... ... ... ... ... ... ... ... ... ... ...
495 10 647.96 300.5 1205 561 819.12 ACR Saturday Rain Arts & Entertainment 515.96 955.66 634 C10
496 10 779.26 933.61 485 121 745.32 AIZ Tuesday Rain Outdoors & Recreation 377.83 263.13 435 C10
497 10 398.3 545.19 406 -248 180.89 YH Monday Fog Event 458.27 665.46 874 C10
498 10 133.52 605.02 62 -485 890.42 DI Thursday Clouds Residence 263.93 944.8 1141 C10
499 10 387.49 106.12 435 837 398.13 ACM Tuesday Rain Residence 916.46 785.13 1198 C10

500 rows × 12 columns

To specify the synthetic random dataset parameters:

In [6]:
N=10 # Number of trajectories
M=50 # Number of points by trajectory
L=10 # Number of attributes
C=3  # Number of classes (C1 to Cn)
randomGenerator(N, M, L, C)
Out[6]:
tid a1_space a2_time a3_n1 a4_n2 a5_nominal a6_day a7_weather a8_category a9_space a10_time label
0 1 134.5 848.28 114 949 976.98 PX Tuesday Clouds Outdoors & Recreation 101.04 778.52 1114 C1
1 1 764.54 255.32 985 -656 630.77 HK Saturday Clouds Shop & Service 328.42 509.78 83 C1
2 1 495.93 449.94 746 344 695.05 JM Saturday Rain Travel & Transport 665.91 179.75 1073 C1
3 1 652.24 789.51 1167 -442 450.85 ZO Wednesday Unknown Nightlife Spot 149.71 141.68 185 C1
4 1 93.95 28.38 1135 327 523.90 CU Monday Clouds Professional & Other Places 866.41 305.93 522 C1
... ... ... ... ... ... ... ... ... ... ... ... ...
495 11 647.96 300.5 1205 561 819.12 ACR Saturday Rain Arts & Entertainment 515.96 955.66 634 C3
496 11 779.26 933.61 485 121 745.32 AIZ Tuesday Rain Outdoors & Recreation 377.83 263.13 435 C3
497 11 398.3 545.19 406 -248 180.89 YH Monday Fog Event 458.27 665.46 874 C3
498 11 133.52 605.02 62 -485 890.42 DI Thursday Clouds Residence 263.93 944.8 1141 C3
499 11 387.49 106.12 435 837 398.13 ACM Tuesday Rain Residence 916.46 785.13 1198 C3

500 rows × 12 columns

d) To generate a set of random datasets:

Creates and save dataset files (including movelets json descriptor file). Generates randomic datasets in a increasing log scale for each parameter. It uses the middle value for the other configurations

In [7]:
data_path = 'matdata/assets/sample/random'

Ns=[100, 3]   # Min. number of trajectories: 100, 3 scales (by log increment)  
Ms=[10,  3]   # Min. number of points: 10, 3 scales (by log increment)
Ls=[8,   3]   # Min. number of attributes: 8, 3 scales (by log increment)
Cs=[2,   3]   # Min. number of labels: 2, 3 scales (by log increment)

scalerRandomGenerator(Ns, Ms, Ls, Cs, save_to=data_path)
  0%|          | 0/12 [00:00<?, ?it/s]
N :: fix. value: 	 200 	scale:	 [100, 200, 400]
M :: fix. value: 	 20 	scale:	 [10, 20, 40]
L :: fix. value: 	 8 	scale:	 [8, 16, 32]
C :: fix. value: 	 4 	scale:	 [2, 4, 8]
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |
Writing - CSV |

# By Tarlis Portela (2023)