Sample Code in python notebook to use mat-data as a python library.
The present package offers a tool, to support the user in the task of data preprocessing of multiple aspect trajectories, or to generating synthetic datasets. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods.
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
!pip install mat-data
#!pip install --upgrade mat-data
Sample code for trajectory dataset read from local files
The easy way to read data is to load a csv file in pandas, such as:
import pandas as pd
data_path = 'matdata/assets/sample'
pd.read_csv(data_path + '/Foursquare_Sample.csv')
tid | lat_lon | date_time | time | rating | price | weather | day | root_type | type | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 126 | 40.8331652006224 -73.9418603427692 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Residence | Home (private) |
1 | 126 | 40.8340978041072 -73.9452672225881 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Food | Deli / Bodega |
2 | 126 | 40.8331652006224 -73.9418603427692 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Residence | Home (private) |
3 | 126 | 40.7646959283254 -73.8851974964414 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Food | Fried Chicken Joint |
4 | 126 | 40.7660790376824 -73.8835287094116 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Travel & Transport | Bus Station |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
66957 | 29563 | 40.7047332789043 -73.9877378940582 | 2012-08-10 17:17:37 | 1037 | -1.0 | -1 | Clouds | Friday | College & University | General College & University |
66958 | 29563 | 40.6951627360199 -73.9954478691072 | 2012-08-10 20:10:59 | 1210 | 8.0 | 2 | Clouds | Friday | Food | Thai Restaurant |
66959 | 29563 | 40.6978026652822 -73.9941451630314 | 2012-08-11 08:01:20 | 481 | 6.9 | -1 | Clouds | Saturday | Outdoors & Recreation | Gym |
66960 | 29563 | 40.6946728967503 -73.9940820360805 | 2012-08-11 13:39:39 | 819 | 7.0 | 1 | Clouds | Saturday | Food | Coffee Shop |
66961 | 29563 | 40.6978026652822 -73.9941451630314 | 2012-08-12 07:56:26 | 476 | 6.9 | -1 | Clouds | Sunday | Outdoors & Recreation | Gym |
66962 rows × 10 columns
mat-data
provides modules to handle dataset reading in standard ways:
a) Read a dataset locally:
This is an example for .csv file, however this can read .csv, .parquet, .zip, .ts, and .xes file formats.
from matdata.dataset import *
df = read_ds('matdata/assets/sample/Foursquare_Sample.csv')
df.head()
lat_lon | date_time | time | rating | price | weather | day | root_type | type | tid | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 40.8331652006224 -73.9418603427692 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Residence | Home (private) | 126 |
1 | 40.8340978041072 -73.9452672225881 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Food | Deli / Bodega | 126 |
2 | 40.8331652006224 -73.9418603427692 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Residence | Home (private) | 126 |
3 | 40.7646959283254 -73.8851974964414 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Food | Fried Chicken Joint | 126 |
4 | 40.7660790376824 -73.8835287094116 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Travel & Transport | Bus Station | 126 |
Optionally, you can use the standardized reading, which will make 'tid'/'label' nomenclature (or rename columns) and will sort the trajectories
df = read_ds('matdata/assets/sample/Foursquare_Sample.csv', tid_col='tid', class_col='root_type')
df.head()
lat_lon | date_time | time | rating | price | weather | day | type | tid | label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 40.8331652006224 -73.9418603427692 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Home (private) | 126 | Residence |
1 | 40.8340978041072 -73.9452672225881 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Deli / Bodega | 126 | Food |
2 | 40.8331652006224 -73.9418603427692 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Home (private) | 126 | Residence |
3 | 40.7646959283254 -73.8851974964414 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Fried Chicken Joint | 126 | Food |
4 | 40.7660790376824 -73.8835287094116 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Bus Station | 126 | Travel & Transport |
This module loads data from public repository Git: mat-analysis datasets (v2_0)
Check the GitHub repository to see available datasets.
To use helpers for data loading, import from package matdata.dataset
:
from matdata.dataset import *
a) First, you can load datasets by informing the category (parent folder) and dataset name (subfolder):
# dataset='mat.FoursquareNYC' ## => dafault
df = load_ds(sample_size=0.25)
df
Loading dataset file: https://github.com/mat-analysis/datasets/raw/main/mat/FoursquareNYC/
0%| | 0/193 [00:00<?, ?it/s]
lat | lon | day | hour | poi | category | price | rating | weather | tid | label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.834098 | -73.945267 | Monday | 13 | 21580 | Food | 1 | 8.2 | Clouds | 127 | 6 |
1 | 40.567196 | -73.882576 | Monday | 19 | 2392 | Travel & Transport | -999 | -999.0 | Clouds | 127 | 6 |
2 | 40.689913 | -73.981504 | Monday | 23 | 35589 | Travel & Transport | -999 | -999.0 | Clouds | 127 | 6 |
3 | 40.708588 | -73.991032 | Monday | 23 | 18603 | Travel & Transport | -999 | -999.0 | Clouds | 127 | 6 |
4 | 40.833165 | -73.941860 | Tuesday | 14 | 36348 | Residence | -999 | -999.0 | Clear | 127 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
22118 | 40.704273 | -73.986759 | Saturday | 10 | 25461 | Outdoors & Recreation | -999 | 8.6 | Clear | 29551 | 1070 |
22119 | 40.704733 | -73.987738 | Saturday | 10 | 1805 | College & University | -999 | -999.0 | Clear | 29551 | 1070 |
22120 | 40.717353 | -73.960392 | Saturday | 13 | 32523 | Food | 1 | 9.3 | Clear | 29551 | 1070 |
22121 | 40.697721 | -73.993020 | Sunday | 2 | 36212 | College & University | -999 | -999.0 | Clear | 29551 | 1070 |
22122 | 40.697803 | -73.994145 | Sunday | 8 | 16452 | Outdoors & Recreation | -999 | 6.9 | Clear | 29551 | 1070 |
16435 rows × 11 columns
b) Second, you can load the 70/30 hold out split available (by default):
df_train, df_test = load_ds_holdout()
print(df_train.shape, df_test.shape)
df_train
Loading dataset file: https://github.com/mat-analysis/datasets/raw/main/mat/FoursquareNYC/
0%| | 0/193 [00:00<?, ?it/s]
(46785, 11) (20177, 11)
lat | lon | day | hour | poi | category | price | rating | weather | tid | label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.834098 | -73.945267 | Monday | 13 | 21580 | Food | 1 | 8.2 | Clouds | 127 | 6 |
1 | 40.567196 | -73.882576 | Monday | 19 | 2392 | Travel & Transport | -999 | -999.0 | Clouds | 127 | 6 |
2 | 40.689913 | -73.981504 | Monday | 23 | 35589 | Travel & Transport | -999 | -999.0 | Clouds | 127 | 6 |
3 | 40.708588 | -73.991032 | Monday | 23 | 18603 | Travel & Transport | -999 | -999.0 | Clouds | 127 | 6 |
4 | 40.833165 | -73.941860 | Tuesday | 14 | 36348 | Residence | -999 | -999.0 | Clear | 127 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
22148 | 40.705953 | -73.996568 | Saturday | 19 | 35308 | Outdoors & Recreation | -999 | 9.6 | Clear | 29556 | 1070 |
22149 | 40.697721 | -73.993020 | Saturday | 23 | 36212 | College & University | -999 | -999.0 | Clear | 29556 | 1070 |
22150 | 40.697884 | -73.992805 | Sunday | 15 | 38090 | Shop & Service | -999 | 5.2 | Clouds | 29556 | 1070 |
22151 | 40.698291 | -73.996632 | Sunday | 18 | 33538 | Outdoors & Recreation | -999 | 9.6 | Clouds | 29556 | 1070 |
22152 | 40.692421 | -73.994002 | Sunday | 18 | 29212 | Professional & Other Places | -999 | -999.0 | Clouds | 29556 | 1070 |
46785 rows × 11 columns
Or, you can hold out split on another proportion (50% for instance):
df_train, df_test = load_ds_holdout(train_size=0.5)
# The split is class-balanced, thus train and test number of trajectories may not be exactly proportional.
print(df_train.shape, df_test.shape)
df_train
Loading dataset file: https://github.com/mat-analysis/datasets/raw/main/mat/FoursquareNYC/
0%| | 0/193 [00:00<?, ?it/s]
(33773, 11) (33189, 11)
lat | lon | day | hour | poi | category | price | rating | weather | tid | label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.834098 | -73.945267 | Monday | 13 | 21580 | Food | 1 | 8.2 | Clouds | 127 | 6 |
1 | 40.567196 | -73.882576 | Monday | 19 | 2392 | Travel & Transport | -999 | -999.0 | Clouds | 127 | 6 |
2 | 40.689913 | -73.981504 | Monday | 23 | 35589 | Travel & Transport | -999 | -999.0 | Clouds | 127 | 6 |
3 | 40.708588 | -73.991032 | Monday | 23 | 18603 | Travel & Transport | -999 | -999.0 | Clouds | 127 | 6 |
4 | 40.833165 | -73.941860 | Tuesday | 14 | 36348 | Residence | -999 | -999.0 | Clear | 127 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
22131 | 40.704733 | -73.987738 | Thursday | 18 | 1805 | College & University | -999 | -999.0 | Clear | 29554 | 1070 |
22132 | 40.704273 | -73.986759 | Thursday | 19 | 25461 | Outdoors & Recreation | -999 | 8.6 | Clear | 29554 | 1070 |
22133 | 40.697803 | -73.994145 | Friday | 10 | 16452 | Outdoors & Recreation | -999 | 6.9 | Clear | 29554 | 1070 |
22134 | 40.695163 | -73.995448 | Friday | 20 | 1944 | Food | 2 | 8.0 | Clear | 29554 | 1070 |
22135 | 40.694673 | -73.994082 | Saturday | 13 | 16201 | Food | 1 | 7.0 | Clear | 29554 | 1070 |
33773 rows × 11 columns
c) Or, you can load the k-fold split datasets available (deafult k=5):
df_train, df_test = load_ds_kfold()
for k in range(len(df_train)):
print('Shape train/test:', df_train[k].shape, df_test[k].shape)
Loading dataset file: https://github.com/mat-analysis/datasets/raw/main/mat/FoursquareNYC/
Spliting Data: 0%| | 0/193 [00:00<?, ?it/s]
Shape train/test: (51130, 11) (15832, 11) Shape train/test: (52678, 11) (14284, 11) Shape train/test: (53938, 11) (13024, 11) Shape train/test: (54856, 11) (12106, 11) Shape train/test: (55246, 11) (11716, 11)
d) You can load a different dataset from repository:
# Use the format: 'category.DatasetName'
dataset='raw.Animals'
df = load_ds(dataset)
df
Loading dataset file: https://github.com/mat-analysis/datasets/raw/main/raw/Animals/
time | lat | lon | tid | label | |
---|---|---|---|---|---|
0 | 0.00 | 50.1066 | 3.79665 | 1 | D |
1 | 4.39 | 50.1045 | 3.79455 | 1 | D |
2 | 7.90 | 50.1111 | 3.79845 | 1 | D |
3 | 9.62 | 50.1072 | 3.79845 | 1 | D |
4 | 15.09 | 50.1132 | 3.79965 | 1 | D |
... | ... | ... | ... | ... | ... |
4517 | 258.88 | 50.1696 | 3.76215 | 97 | C |
4518 | 260.85 | 50.1693 | 3.76185 | 97 | C |
4519 | 262.80 | 50.1693 | 3.76245 | 97 | C |
4520 | 264.69 | 50.1687 | 3.76455 | 97 | C |
4521 | 266.90 | 50.1687 | 3.76455 | 97 | C |
14990 rows × 5 columns
e) To get a full list of anailable repositories and categories:
rd = repository_datasets()
print('Multiple Aspect Trajecory datasets:', rd['mat'])
rd
Multiple Aspect Trajecory datasets: ['Brightkite', 'FoursquareGlobal', 'FoursquareNYC', 'Gowalla', 'Weeplaces']
{'log': ['BPI2011', 'BPI2012', 'BPI2015', 'BPI2017', 'BPI2018', 'BPI2019'], 'mat': ['Brightkite', 'FoursquareGlobal', 'FoursquareNYC', 'Gowalla', 'Weeplaces'], 'mts': ['ActivityRecognition', 'ArticularyWordRecognition', 'AtrialFibrillation', 'AustralianSignLanguage', 'BasicMotions', 'CharacterTrajectories', 'Cricket', 'DuckDuckGeese', 'ERing', 'EigenWorms', 'Epilepsy', 'EthanolConcentration', 'FaceDetection', 'FaciesRocks', 'FingerMovements', 'GECCOWater', 'GrammaticalFacialExpression', 'HandMovementDirection', 'Handwriting', 'Heartbeat', 'InsectWingbeat', 'JapaneseVowels', 'LSST', 'Libras', 'MotorImagery', 'NATOPS', 'PEMS-SF', 'PenDigits', 'PhonemeSpectra', 'RacketSports', 'SelfRegulationSCP1', 'SelfRegulationSCP2', 'SpokenArabicDigits', 'StandWalkJump', 'UWaveGestureLibrary'], 'raw': ['Animals', 'Geolife', 'GoTrack', 'Hurricanes', 'Vehicles'], 'sequential': ['ClothingAlibaba', 'Promoters', 'SJGS'], 'uts': ['ACSF1', 'Adiac', 'AllGestureWiimoteX', 'AllGestureWiimoteY', 'AllGestureWiimoteZ', 'ArrowHead', 'BME', 'Beef', 'BeetleFly', 'BirdChicken', 'CBF', 'Car', 'Chinatown', 'ChlorineConcentration', 'CinCECGTorso', 'Coffee', 'Computers', 'CricketX', 'CricketY', 'CricketZ', 'Crop', 'DiatomSizeReduction', 'DistalPhalanxOutlineAgeGroup', 'DistalPhalanxOutlineCorrect', 'DistalPhalanxTW', 'DodgerLoopDay', 'DodgerLoopGame', 'DodgerLoopWeekend', 'ECG200', 'ECG5000', 'ECGFiveDays', 'EOGHorizontalSignal', 'EOGVerticalSignal', 'Earthquakes', 'ElectricDevices', 'EthanolLevel', 'FaceAll', 'FaceFour', 'FacesUCR', 'FiftyWords', 'Fish', 'FordA', 'FordB', 'FreezerRegularTrain', 'FreezerSmallTrain', 'Fungi', 'GestureMidAirD1', 'GestureMidAirD2', 'GestureMidAirD3', 'GesturePebbleZ1', 'GesturePebbleZ2', 'GunPoint', 'GunPointAgeSpan', 'GunPointMaleVersusFemale', 'GunPointOldVersusYoung', 'Ham', 'HandOutlines', 'Haptics', 'Herring', 'HouseTwenty', 'InlineSkate', 'InsectEPGRegularTrain', 'InsectEPGSmallTrain', 'InsectWingbeatSound', 'ItalyPowerDemand', 'LargeKitchenAppliances', 'Lightning2', 'Lightning7', 'Mallat', 'Meat', 'MedicalImages', 'MelbournePedestrian', 'MiddlePhalanxOutlineAgeGroup', 'MiddlePhalanxOutlineCorrect', 'MiddlePhalanxTW', 'MixedShapesRegularTrain', 'MixedShapesSmallTrain', 'MoteStrain', 'NonInvasiveFetalECGThorax1', 'NonInvasiveFetalECGThorax2', 'OSULeaf', 'OliveOil', 'PLAID', 'PhalangesOutlinesCorrect', 'Phoneme', 'PickupGestureWiimoteZ', 'PigAirwayPressure', 'PigArtPressure', 'PigCVP', 'Plane', 'PowerCons', 'ProximalPhalanxOutlineAgeGroup', 'ProximalPhalanxOutlineCorrect', 'ProximalPhalanxTW', 'RefrigerationDevices', 'Rock', 'ScreenType', 'SemgHandGenderCh2', 'SemgHandMovementCh2', 'SemgHandSubjectCh2', 'ShakeGestureWiimoteZ', 'ShapeletSim', 'ShapesAll', 'SmallKitchenAppliances', 'SmoothSubspace', 'SonyAIBORobotSurface1', 'SonyAIBORobotSurface2', 'StarLightCurves', 'Strawberry', 'SwedishLeaf', 'Symbols', 'SyntheticControl', 'ToeSegmentation1', 'ToeSegmentation2', 'Trace', 'TwoLeadECG', 'TwoPatterns', 'UMD', 'UWaveGestureLibraryAll', 'UWaveGestureLibraryX', 'UWaveGestureLibraryY', 'UWaveGestureLibraryZ', 'Wafer', 'Wine', 'WordSynonyms', 'Worms', 'WormsTwoClass', 'Yoga']}
To use helpers for data pre-processing, import from package matdata.preprocess
:
from matdata.preprocess import *
The preprocess module provides some functions to work data:
Basic functions:
readDataset
: load datasets as pandas DataFrame (from .csv, .parquet, .zip, .ts or .xes)organizeFrame
: standardize data columns for the DataFrameTrain and Test split functions:
trainTestSplit
: split dataset (pandas DataFrame) in train / test (70/30% by default)kfold_trainTestSplit
: split dataset (pandas DataFrame) in k-fold train / test (5-fold of 80/20% each fold by default)stratify
: extract trajectories from the dataset, respecting class balance, creating a subset of the data (to use when smaller datasets are needed)klabels_stratify
: k-labels statification (randomly select k-labels from the dataset)joinTrainTest
: joins the separated train and test files into one DataFrame.Statistical functions:
printFeaturesJSON
: print a default JSON file descriptor for Movelets methods (version 1 or 2)countClasses
: calculates statistics from a dataset dataframedfVariance
: calculates a variance rank from a dataset dataframedfStats
: calculates attributes statistics ordered by variance from a dataset dataframedatasetStatistics
: generates dataset statistics from a dataframe in markdown text.Type reading functions:
csv2df
: reads .csv dataset dataframeparquet2df
: reads .parquet dataset dataframezip2df
: reads .zip dataset dataframe (zip containing trajectory csv files)ts2df
: reads .ts dataset dataframe (Time Series data format)xes2df
: reads .xes dataset dataframe (event log / event stream file)mat2df
: TODO reads .mat dataset dataframe (multiple aspect trajectory specific file format)File convertion functions:
zip2csv
: converts .zip files and saves to .csv filesdf2zip
: converts DataFrame and saves to .zip filesany2ts
: converts .zip or .csv files and saves to .ts filesxes2csv
: reads .xes files and converts to DataFrameconvertDataset
: default format conversions. Reads the dataset files and saves in .csv and .zip formats, also do k-fold split if not presenta) Basic reading the data, and organization:
data_path = 'matdata/assets/sample'
df = readDataset(data_path, file='Foursquare_Sample.csv')
df.head()
tid | lat_lon | date_time | time | rating | price | weather | day | root_type | type | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 126 | 40.8331652006224 -73.9418603427692 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Residence | Home (private) |
1 | 126 | 40.8340978041072 -73.9452672225881 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Food | Deli / Bodega |
2 | 126 | 40.8331652006224 -73.9418603427692 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Residence | Home (private) |
3 | 126 | 40.7646959283254 -73.8851974964414 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Food | Fried Chicken Joint |
4 | 126 | 40.7660790376824 -73.8835287094116 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Travel & Transport | Bus Station |
df, space_cols, ll_cols = organizeFrame(df, make_spatials=True)
print('Columns with space: ', space_cols)
print('Columns with lat/lon: ', ll_cols)
df.head()
Columns with space: ['space', 'date_time', 'time', 'rating', 'price', 'weather', 'day', 'root_type', 'type', 'tid'] Columns with lat/lon: ['date_time', 'time', 'rating', 'price', 'weather', 'day', 'root_type', 'type', 'lat', 'lon', 'tid']
tid | space | date_time | time | rating | price | weather | day | root_type | type | lat | lon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 126 | 40.8331652006224 -73.9418603427692 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Residence | Home (private) | 40.833165 | -73.941860 |
1 | 126 | 40.8340978041072 -73.9452672225881 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Food | Deli / Bodega | 40.834098 | -73.945267 |
2 | 126 | 40.8331652006224 -73.9418603427692 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Residence | Home (private) | 40.833165 | -73.941860 |
3 | 126 | 40.7646959283254 -73.8851974964414 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Food | Fried Chicken Joint | 40.764696 | -73.885197 |
4 | 126 | 40.7660790376824 -73.8835287094116 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Travel & Transport | Bus Station | 40.766079 | -73.883529 |
Note: To better standard, we recomend for classification the use of prepare_ds
function from dataset
module, as you can indicate the class column:
from matdata.dataset import prepare_ds
df = prepare_ds(df, class_col='root_type') # 'root_type' is then renamed 'label'
df
date_time | time | rating | price | weather | day | type | lat | lon | tid | label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Home (private) | 40.833165 | -73.941860 | 126 | Residence |
1 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Deli / Bodega | 40.834098 | -73.945267 | 126 | Food |
2 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Home (private) | 40.833165 | -73.941860 | 126 | Residence |
3 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Fried Chicken Joint | 40.764696 | -73.885197 | 126 | Food |
4 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Bus Station | 40.766079 | -73.883529 | 126 | Travel & Transport |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
66957 | 2012-08-10 17:17:37 | 1037 | -1.0 | -1 | Clouds | Friday | General College & University | 40.704733 | -73.987738 | 29563 | College & University |
66958 | 2012-08-10 20:10:59 | 1210 | 8.0 | 2 | Clouds | Friday | Thai Restaurant | 40.695163 | -73.995448 | 29563 | Food |
66959 | 2012-08-11 08:01:20 | 481 | 6.9 | -1 | Clouds | Saturday | Gym | 40.697803 | -73.994145 | 29563 | Outdoors & Recreation |
66960 | 2012-08-11 13:39:39 | 819 | 7.0 | 1 | Clouds | Saturday | Coffee Shop | 40.694673 | -73.994082 | 29563 | Food |
66961 | 2012-08-12 07:56:26 | 476 | 6.9 | -1 | Clouds | Sunday | Gym | 40.697803 | -73.994145 | 29563 | Outdoors & Recreation |
66962 rows × 11 columns
b) Train and test split:
To hold-out split a dataset into train and test (70/30% by deafult):
train, test = trainTestSplit(df, random_num=1)
train.head()
0%| | 0/9 [00:00<?, ?it/s]
date_time | time | rating | price | weather | day | type | lat | lon | tid | label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Home (private) | 40.833165 | -73.941860 | 126 | Residence |
1 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Deli / Bodega | 40.834098 | -73.945267 | 126 | Food |
2 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Home (private) | 40.833165 | -73.941860 | 126 | Residence |
3 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Fried Chicken Joint | 40.764696 | -73.885197 | 126 | Food |
4 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Bus Station | 40.766079 | -73.883529 | 126 | Travel & Transport |
If you want to save, indicate the output format and data path:
trainTestSplit(df, data_path=data_path, outformats=['csv', 'parquet'])
# Reading:
df = readDataset(data_path, file='train.parquet')
df.head()
0%| | 0/9 [00:00<?, ?it/s]
Writing - CSV |TRAIN - Writing - CSV |TEST - Writing - Parquet |TRAIN - Writing - Parquet |TEST -
date_time | time | rating | price | weather | day | type | lat | lon | tid | label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Home (private) | 40.833165 | -73.941860 | 126 | Residence |
1 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Deli / Bodega | 40.834098 | -73.945267 | 126 | Food |
2 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Home (private) | 40.833165 | -73.941860 | 126 | Residence |
3 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Fried Chicken Joint | 40.764696 | -73.885197 | 126 | Food |
4 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Bus Station | 40.766079 | -73.883529 | 126 | Travel & Transport |
To k-fold split a dataset into train and test:
train, test = kfold_trainTestSplit(df, k=3)
for k in range(len(train)):
print('Shape train/test:', train[k].shape, test[k].shape)
Spliting Data: 0%| | 0/10 [00:00<?, ?it/s]
Shape train/test: (174636, 11) (86969, 11) Shape train/test: (175715, 11) (85890, 11) Shape train/test: (172859, 11) (88746, 11)
c) Stratifying the data (example to get 50% of the dataset):
train, test = stratify(df, sample_size=0.5)
print('Shape train/test:', train.shape, test.shape)
train.head()
0%| | 0/9 [00:00<?, ?it/s]
0%| | 0/9 [00:00<?, ?it/s]
Shape train/test: (16432, 11) (7106, 11)
date_time | time | rating | price | weather | day | type | lat | lon | tid | label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Home (private) | 40.833165 | -73.941860 | 126 | Residence |
1 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Deli / Bodega | 40.834098 | -73.945267 | 126 | Food |
2 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Home (private) | 40.833165 | -73.941860 | 126 | Residence |
3 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Fried Chicken Joint | 40.764696 | -73.885197 | 126 | Food |
4 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Bus Station | 40.766079 | -73.883529 | 126 | Travel & Transport |
k-Fold Stratifying the data (example to get 50% of the dataset in 3-folds):
train, test = klabels_stratify(df, kl=5)
print('Shape train/test:', train.shape, test.shape)
print('Labels before:', df.label.unique())
print('Labels after:', train.label.unique())
0%| | 0/5 [00:00<?, ?it/s]
Shape train/test: (25824, 11) (11704, 11) Labels before: ['Residence' 'Food' 'Travel & Transport' 'Professional & Other Places' 'Shop & Service' 'Outdoors & Recreation' 'College & University' 'Arts & Entertainment' 'Nightlife Spot' 'Event'] Labels after: ['Residence' 'Food' 'Travel & Transport' 'Professional & Other Places' 'Shop & Service']
d) Joining train and test files:
df = joinTrainTest(data_path, train_file="train.csv", test_file="test.csv", to_file=True) # Saves 'joined.csv' file
df.head()
Joining train and test data from... matdata/assets/sample Saving joined dataset as: matdata/assets/sample/joined.csv Done. --------------------------------------------------------------------------------
date_time | time | rating | price | weather | day | type | lat | lon | tid | label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Home (private) | 40.833165 | -73.941860 | 126 | Residence |
1 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Deli / Bodega | 40.834098 | -73.945267 | 126 | Food |
2 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Home (private) | 40.833165 | -73.941860 | 126 | Residence |
3 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Fried Chicken Joint | 40.764696 | -73.885197 | 126 | Food |
4 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Bus Station | 40.766079 | -73.883529 | 126 | Travel & Transport |
Note: We standardized all repository datasets by creating a data.parquet
file with np.NaN
as missing values, as example:
from matdata.preprocess import *
from matdata.dataset import prepare_ds
import numpy as np
data_path = 'matdata/assets/sample'
df.replace('?', np.NaN, inplace=True)
df = prepare_ds(df)
df2parquet(df, data_path, 'data')
Saving dataset as: matdata/assets/sample/data.parquet Done. --------------------------------------------------------------------------------
date_time | time | rating | price | weather | day | type | lat | lon | label | tid | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-11-12 05:17:18 | 317 | -1.0 | -1 | Clear | Monday | Home (private) | 40.833165 | -73.941860 | Residence | 126 |
1 | 2012-11-12 23:24:55 | 1404 | 8.2 | 1 | Clouds | Monday | Deli / Bodega | 40.834098 | -73.945267 | Food | 126 |
2 | 2012-11-13 00:00:07 | 0 | -1.0 | -1 | Clouds | Tuesday | Home (private) | 40.833165 | -73.941860 | Residence | 126 |
3 | 2012-11-15 17:49:01 | 1069 | 6.6 | 3 | Clear | Thursday | Fried Chicken Joint | 40.764696 | -73.885197 | Food | 126 |
4 | 2012-11-15 18:40:16 | 1120 | -1.0 | -1 | Clear | Thursday | Bus Station | 40.766079 | -73.883529 | Travel & Transport | 126 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20080 | 2012-08-10 17:17:37 | 1037 | -1.0 | -1 | Clouds | Friday | General College & University | 40.704733 | -73.987738 | College & University | 29563 |
20081 | 2012-08-10 20:10:59 | 1210 | 8.0 | 2 | Clouds | Friday | Thai Restaurant | 40.695163 | -73.995448 | Food | 29563 |
20082 | 2012-08-11 08:01:20 | 481 | 6.9 | -1 | Clouds | Saturday | Gym | 40.697803 | -73.994145 | Outdoors & Recreation | 29563 |
20083 | 2012-08-11 13:39:39 | 819 | 7.0 | 1 | Clouds | Saturday | Coffee Shop | 40.694673 | -73.994082 | Food | 29563 |
20084 | 2012-08-12 07:56:26 | 476 | 6.9 | -1 | Clouds | Sunday | Gym | 40.697803 | -73.994145 | Outdoors & Recreation | 29563 |
66962 rows × 11 columns
TODO
from matdata.generator import *
scalerSamplerGenerator
: generates trajectory datasets based on real data on scale intervalssamplerGenerator
: generate a trajectory dataset based on real datascalerRandomGenerator
: generates trajectory datasets based on random data on scale intervalsrandomGenerator
: generate a trajectory dataset based on random dataa) To generate a sample dataset (default config):
samplerGenerator()
tid | space | time | day | rating | price | weather | root_type | type | label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 40.7429692382286 -73.8827669620514 | 447 | Monday | 6.1 | -1 | Clouds | Shop & Service | Supermarket | C1 |
1 | 1 | 40.8610785519472 -73.9301037076315 | 486 | Monday | -1.0 | -1 | Clouds | Residence | Home (private) | C1 |
2 | 1 | 40.5774555812085 -73.9812469482422 | 653 | Monday | -1.0 | -1 | Clear | Travel & Transport | Metro Station | C1 |
3 | 1 | 40.8114381522955 -74.0677964687347 | 701 | Monday | 6.8 | -1 | Rain | Arts & Entertainment | Stadium | C1 |
4 | 1 | 40.7811844364976 -73.9732030930065 | 758 | Monday | 9.4 | -1 | Clear | Arts & Entertainment | Science Museum | C1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | 10 | 40.7638885711179 -74.0233821808352 | 1385 | Saturday | -1.0 | -1 | Clouds | Travel & Transport | Border Crossing | C1 |
496 | 10 | 40.6891441345215 -73.9303207397461 | 379 | Sunday | -1.0 | 1 | Clouds | Food | Deli / Bodega | C1 |
497 | 10 | 40.7704703500000 -74.0281831400000 | 547 | Sunday | -1.0 | -1 | Clear | Professional & Other Places | Post Office | C1 |
498 | 10 | 40.7774773922087 -73.8146725755643 | 783 | Sunday | -1.0 | -1 | Clear | Shop & Service | Candy Store | C1 |
499 | 10 | 40.8331652006224 -73.9418603427692 | 1033 | Sunday | -1.0 | -1 | Clouds | Residence | Home (private) | C1 |
500 rows × 10 columns
To specify the synthetic dataset parameters:
N=10 # Number of trajectories
M=50 # Number of points by trajectory
C=3 # Number of classes (C1 to Cn)
samplerGenerator(N, M, C)
tid | space | time | day | rating | price | weather | root_type | type | label | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 40.7429692382286 -73.8827669620514 | 447 | Monday | 6.1 | -1 | Clouds | Shop & Service | Supermarket | C1 |
1 | 1 | 40.8610785519472 -73.9301037076315 | 486 | Monday | -1.0 | -1 | Clouds | Residence | Home (private) | C1 |
2 | 1 | 40.5774555812085 -73.9812469482422 | 653 | Monday | -1.0 | -1 | Clear | Travel & Transport | Metro Station | C1 |
3 | 1 | 40.8114381522955 -74.0677964687347 | 701 | Monday | 6.8 | -1 | Rain | Arts & Entertainment | Stadium | C1 |
4 | 1 | 40.7811844364976 -73.9732030930065 | 758 | Monday | 9.4 | -1 | Clear | Arts & Entertainment | Science Museum | C1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | 11 | 40.7638885711179 -74.0233821808352 | 1385 | Saturday | -1.0 | -1 | Clouds | Travel & Transport | Border Crossing | C3 |
496 | 11 | 40.6891441345215 -73.9303207397461 | 379 | Sunday | -1.0 | 1 | Clouds | Food | Deli / Bodega | C3 |
497 | 11 | 40.7704703500000 -74.0281831400000 | 547 | Sunday | -1.0 | -1 | Clear | Professional & Other Places | Post Office | C3 |
498 | 11 | 40.7774773922087 -73.8146725755643 | 783 | Sunday | -1.0 | -1 | Clear | Shop & Service | Candy Store | C3 |
499 | 11 | 40.8331652006224 -73.9418603427692 | 1033 | Sunday | -1.0 | -1 | Clouds | Residence | Home (private) | C3 |
500 rows × 10 columns
b) To generate a set of sample datasets:
Creates and save dataset files (including movelets json descriptor file). Generates sample datasets in a increasing log scale for each parameter. It uses the middle value for the other configurations
data_path = 'matdata/assets/sample/samples'
Ns=[100, 3] # Min. number of trajectories: 100, 3 scales (by log increment)
Ms=[10, 3] # Min. number of points: 10, 3 scales (by log increment)
Ls=[8, 3] # Min. number of attributes: 8, 3 scales (by log increment)
Cs=[2, 3] # Min. number of labels: 2, 3 scales (by log increment)
scalerSamplerGenerator(Ns, Ms, Ls, Cs, save_to=data_path)
0%| | 0/12 [00:00<?, ?it/s]
N :: fix. value: 200 scale: [100, 200, 400] M :: fix. value: 20 scale: [10, 20, 40] L :: fix. value: 8 scale: [8, 16, 32] C :: fix. value: 4 scale: [2, 4, 8] Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV |
c) To generate a random dataset (default config):
randomGenerator()
tid | a1_space | a2_time | a3_n1 | a4_n2 | a5_nominal | a6_day | a7_weather | a8_category | a9_space | a10_time | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 134.5 848.28 | 114 | 949 | 976.98 | PX | Tuesday | Clouds | Outdoors & Recreation | 101.04 778.52 | 1114 | C1 |
1 | 1 | 764.54 255.32 | 985 | -656 | 630.77 | HK | Saturday | Clouds | Shop & Service | 328.42 509.78 | 83 | C1 |
2 | 1 | 495.93 449.94 | 746 | 344 | 695.05 | JM | Saturday | Rain | Travel & Transport | 665.91 179.75 | 1073 | C1 |
3 | 1 | 652.24 789.51 | 1167 | -442 | 450.85 | ZO | Wednesday | Unknown | Nightlife Spot | 149.71 141.68 | 185 | C1 |
4 | 1 | 93.95 28.38 | 1135 | 327 | 523.90 | CU | Monday | Clouds | Professional & Other Places | 866.41 305.93 | 522 | C1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | 10 | 647.96 300.5 | 1205 | 561 | 819.12 | ACR | Saturday | Rain | Arts & Entertainment | 515.96 955.66 | 634 | C10 |
496 | 10 | 779.26 933.61 | 485 | 121 | 745.32 | AIZ | Tuesday | Rain | Outdoors & Recreation | 377.83 263.13 | 435 | C10 |
497 | 10 | 398.3 545.19 | 406 | -248 | 180.89 | YH | Monday | Fog | Event | 458.27 665.46 | 874 | C10 |
498 | 10 | 133.52 605.02 | 62 | -485 | 890.42 | DI | Thursday | Clouds | Residence | 263.93 944.8 | 1141 | C10 |
499 | 10 | 387.49 106.12 | 435 | 837 | 398.13 | ACM | Tuesday | Rain | Residence | 916.46 785.13 | 1198 | C10 |
500 rows × 12 columns
To specify the synthetic random dataset parameters:
N=10 # Number of trajectories
M=50 # Number of points by trajectory
L=10 # Number of attributes
C=3 # Number of classes (C1 to Cn)
randomGenerator(N, M, L, C)
tid | a1_space | a2_time | a3_n1 | a4_n2 | a5_nominal | a6_day | a7_weather | a8_category | a9_space | a10_time | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 134.5 848.28 | 114 | 949 | 976.98 | PX | Tuesday | Clouds | Outdoors & Recreation | 101.04 778.52 | 1114 | C1 |
1 | 1 | 764.54 255.32 | 985 | -656 | 630.77 | HK | Saturday | Clouds | Shop & Service | 328.42 509.78 | 83 | C1 |
2 | 1 | 495.93 449.94 | 746 | 344 | 695.05 | JM | Saturday | Rain | Travel & Transport | 665.91 179.75 | 1073 | C1 |
3 | 1 | 652.24 789.51 | 1167 | -442 | 450.85 | ZO | Wednesday | Unknown | Nightlife Spot | 149.71 141.68 | 185 | C1 |
4 | 1 | 93.95 28.38 | 1135 | 327 | 523.90 | CU | Monday | Clouds | Professional & Other Places | 866.41 305.93 | 522 | C1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | 11 | 647.96 300.5 | 1205 | 561 | 819.12 | ACR | Saturday | Rain | Arts & Entertainment | 515.96 955.66 | 634 | C3 |
496 | 11 | 779.26 933.61 | 485 | 121 | 745.32 | AIZ | Tuesday | Rain | Outdoors & Recreation | 377.83 263.13 | 435 | C3 |
497 | 11 | 398.3 545.19 | 406 | -248 | 180.89 | YH | Monday | Fog | Event | 458.27 665.46 | 874 | C3 |
498 | 11 | 133.52 605.02 | 62 | -485 | 890.42 | DI | Thursday | Clouds | Residence | 263.93 944.8 | 1141 | C3 |
499 | 11 | 387.49 106.12 | 435 | 837 | 398.13 | ACM | Tuesday | Rain | Residence | 916.46 785.13 | 1198 | C3 |
500 rows × 12 columns
d) To generate a set of random datasets:
Creates and save dataset files (including movelets json descriptor file). Generates randomic datasets in a increasing log scale for each parameter. It uses the middle value for the other configurations
data_path = 'matdata/assets/sample/random'
Ns=[100, 3] # Min. number of trajectories: 100, 3 scales (by log increment)
Ms=[10, 3] # Min. number of points: 10, 3 scales (by log increment)
Ls=[8, 3] # Min. number of attributes: 8, 3 scales (by log increment)
Cs=[2, 3] # Min. number of labels: 2, 3 scales (by log increment)
scalerRandomGenerator(Ns, Ms, Ls, Cs, save_to=data_path)
0%| | 0/12 [00:00<?, ?it/s]
N :: fix. value: 200 scale: [100, 200, 400] M :: fix. value: 20 scale: [10, 20, 40] L :: fix. value: 8 scale: [8, 16, 32] C :: fix. value: 4 scale: [2, 4, 8] Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV | Writing - CSV |
# By Tarlis Portela (2023)