Replication Materials for the Torch-Choice Paper
Author: Tianyu Du
Email:
tianyudu@stanford.edu
This repository contains the replication materials for the paper "Torch-Choice: A Library for Choice Models in PyTorch". Due to the limited space in the main paper, we have omitted some codes and outputs in the paper. This repository contains the full version of codes mentioned in the paper.
import warnings
warnings.filterwarnings("ignore")
from time import time
import numpy as np
import pandas as pd
import torch
import torch_choice
from torch_choice import run
from tqdm import tqdm
from torch_choice.data import ChoiceDataset, JointDataset, utils, load_mode_canada_dataset, load_house_cooling_dataset_v1
from torch_choice.model import ConditionalLogitModel, NestedLogitModel
'1.0.3'
Data Structure
car_choice = pd.read_csv("https://raw.githubusercontent.com/gsbDBI/torch-choice/main/tutorials/public_datasets/car_choice.csv")
car_choice.head()
| record_id | session_id | consumer_id | car | purchase | gender | income | speed | discount | price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | American | 1 | 1 | 46.699997 | 10 | 0.94 | 90 |
| 1 | 1 | 1 | 1 | Japanese | 0 | 1 | 46.699997 | 8 | 0.94 | 110 |
| 2 | 1 | 1 | 1 | European | 0 | 1 | 46.699997 | 7 | 0.94 | 50 |
| 3 | 1 | 1 | 1 | Korean | 0 | 1 | 46.699997 | 8 | 0.94 | 10 |
| 4 | 2 | 2 | 2 | American | 1 | 1 | 26.100000 | 10 | 0.95 | 100 |
Adding Observables, Method 1: Observables Derived from Columns of the Main Dataset
user_observable_columns=["gender", "income"]
from torch_choice.utils.easy_data_wrapper import EasyDatasetWrapper
data_wrapper_from_columns = EasyDatasetWrapper(
main_data=car_choice,
purchase_record_column='record_id',
choice_column='purchase',
item_name_column='car',
user_index_column='consumer_id',
session_index_column='session_id',
user_observable_columns=['gender', 'income'],
item_observable_columns=['speed'],
session_observable_columns=['discount'],
itemsession_observable_columns=['price'])
data_wrapper_from_columns.summary()
dataset = data_wrapper_from_columns.choice_dataset
# ChoiceDataset(label=[], item_index=[885], provided_num_items=[], user_index=[885], session_index=[885], item_availability=[885, 4], item_speed=[4, 1], user_gender=[885, 1], user_income=[885, 1], session_discount=[885, 1], itemsession_price=[885, 4, 1], device=cpu)
Creating choice dataset from stata format data-frames...
Note: choice sets of different sizes found in different purchase records: {'size 4': 'occurrence 505', 'size 3': 'occurrence 380'}
Finished Creating Choice Dataset.
* purchase record index range: [1 2 3] ... [883 884 885]
* Space of 4 items:
0 1 2 3
item name American European Japanese Korean
* Number of purchase records/cases: 885.
* Preview of main data frame:
record_id session_id consumer_id car purchase gender \
0 1 1 1 American 1 1
1 1 1 1 Japanese 0 1
2 1 1 1 European 0 1
3 1 1 1 Korean 0 1
4 2 2 2 American 1 1
... ... ... ... ... ... ...
3155 884 884 884 Japanese 1 1
3156 884 884 884 European 0 1
3157 885 885 885 American 1 1
3158 885 885 885 Japanese 0 1
3159 885 885 885 European 0 1
income speed discount price
0 46.699997 10 0.94 90
1 46.699997 8 0.94 110
2 46.699997 7 0.94 50
3 46.699997 8 0.94 10
4 26.100000 10 0.95 100
... ... ... ... ...
3155 20.900000 8 0.89 100
3156 20.900000 7 0.89 40
3157 30.600000 10 0.81 100
3158 30.600000 8 0.81 50
3159 30.600000 7 0.81 40
[3160 rows x 10 columns]
* Preview of ChoiceDataset:
ChoiceDataset(label=[], item_index=[885], user_index=[885], session_index=[885], item_availability=[885, 4], item_speed=[4, 1], user_gender=[885, 1], user_income=[885, 1], session_discount=[885, 1], itemsession_price=[885, 4, 1], device=cpu)
Adding Observables, Method 2: Added as Separated DataFrames
# create dataframes for gender and income. The dataframe for user-specific observable needs to have the `consumer_id` column.
gender = car_choice.groupby('consumer_id')['gender'].first().reset_index()
income = car_choice.groupby('consumer_id')['income'].first().reset_index()
# alternatively, put gender and income in the same dataframe.
gender_and_income = car_choice.groupby('consumer_id')[['gender', 'income']].first().reset_index()
# speed as item observable, the dataframe requires a `car` column.
speed = car_choice.groupby('car')['speed'].first().reset_index()
# discount as session observable. the dataframe requires a `session_id` column.
discount = car_choice.groupby('session_id')['discount'].first().reset_index()
# create the price as itemsession observable, the dataframe requires both `car` and `session_id` columns.
price = car_choice[['car', 'session_id', 'price']]
# fill in NANs for (session, item) pairs that the item was not available in that session.
price = price.pivot('car', 'session_id', 'price').melt(ignore_index=False).reset_index()
data_wrapper_from_dataframes = EasyDatasetWrapper(
main_data=car_choice,
purchase_record_column='record_id',
choice_column='purchase',
item_name_column='car',
user_index_column='consumer_id',
session_index_column='session_id',
user_observable_data={'gender': gender, 'income': income},
# alternatively, supply gender and income as a single dataframe.
# user_observable_data={'gender_and_income': gender_and_income},
item_observable_data={'speed': speed},
session_observable_data={'discount': discount},
itemsession_observable_data={'price': price})
# the second method creates exactly the same ChoiceDataset as the previous method.
assert data_wrapper_from_dataframes.choice_dataset == data_wrapper_from_columns.choice_dataset
Creating choice dataset from stata format data-frames...
Note: choice sets of different sizes found in different purchase records: {'size 4': 'occurrence 505', 'size 3': 'occurrence 380'}
Finished Creating Choice Dataset.
data_wrapper_mixed = EasyDatasetWrapper(
main_data=car_choice,
purchase_record_column='record_id',
choice_column='purchase',
item_name_column='car',
user_index_column='consumer_id',
session_index_column='session_id',
user_observable_data={'gender': gender, 'income': income},
item_observable_data={'speed': speed},
session_observable_data={'discount': discount},
itemsession_observable_columns=['price'])
# these methods create exactly the same choice dataset.
assert data_wrapper_mixed.choice_dataset == data_wrapper_from_columns.choice_dataset == data_wrapper_from_dataframes.choice_dataset
Creating choice dataset from stata format data-frames...
Note: choice sets of different sizes found in different purchase records: {'size 4': 'occurrence 505', 'size 3': 'occurrence 380'}
Finished Creating Choice Dataset.
Constructing a Choice Dataset, Method 2: Building from Tensors
N = 10_000
num_users = 10
num_items = 4
num_sessions = 500
user_obs = torch.randn(num_users, 128)
item_obs = torch.randn(num_items, 64)
useritem_obs = torch.randn(num_users, num_items, 32)
session_obs = torch.randn(num_sessions, 10)
itemsession_obs = torch.randn(num_sessions, num_items, 12)
usersessionitem_obs = torch.randn(num_users, num_sessions, num_items, 8)
item_index = torch.LongTensor(np.random.choice(num_items, size=N))
user_index = torch.LongTensor(np.random.choice(num_users, size=N))
session_index = torch.LongTensor(np.random.choice(num_sessions, size=N))
item_availability = torch.ones(num_sessions, num_items).bool()
dataset = ChoiceDataset(
# required:
item_index=item_index,
# optional:
user_index=user_index, session_index=session_index, item_availability=item_availability,
# observable tensors are supplied as keyword arguments with special prefixes.
user_obs=user_obs, item_obs=item_obs, useritem_obs=useritem_obs, session_obs=session_obs, itemsession_obs=itemsession_obs, usersessionitem_obs=usersessionitem_obs)
ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], useritem_obs=[10, 4, 32], session_obs=[500, 10], itemsession_obs=[500, 4, 12], usersessionitem_obs=[10, 500, 4, 8], device=cpu)
Functionalities of the Choice Dataset
print(f'{dataset.num_users=:}')
# dataset.num_users=10
print(f'{dataset.num_items=:}')
# dataset.num_items=4
print(f'{dataset.num_sessions=:}')
# dataset.num_sessions=500
print(f'{len(dataset)=:}')
# len(dataset)=10000
dataset.num_users=10
dataset.num_items=4
dataset.num_sessions=500
len(dataset)=10000
# clone
print(dataset.item_index[:10])
# tensor([2, 2, 3, 1, 3, 2, 2, 1, 0, 1])
dataset_cloned = dataset.clone()
# modify the cloned dataset.
dataset_cloned.item_index = 99 * torch.ones(num_sessions)
print(dataset_cloned.item_index[:10])
# the cloned dataset is changed.
# tensor([99., 99., 99., 99., 99., 99., 99., 99., 99., 99.])
print(dataset.item_index[:10])
# the original dataset does not change.
# tensor([2, 2, 3, 1, 3, 2, 2, 1, 0, 1])
tensor([0, 1, 3, 1, 2, 0, 3, 2, 3, 1])
tensor([99., 99., 99., 99., 99., 99., 99., 99., 99., 99.])
tensor([0, 1, 3, 1, 2, 0, 3, 2, 3, 1])
# move to device
print(f'{dataset.device=:}')
# dataset.device=cpu
print(f'{dataset.device=:}')
# dataset.device=cpu
print(f'{dataset.user_index.device=:}')
# dataset.user_index.device=cpu
print(f'{dataset.session_index.device=:}')
# dataset.session_index.device=cpu
if torch.cuda.is_available():
# please note that this can only be demonstrated
dataset = dataset.to('cuda')
print(f'{dataset.device=:}')
# dataset.device=cuda:0
print(f'{dataset.item_index.device=:}')
# dataset.item_index.device=cuda:0
print(f'{dataset.user_index.device=:}')
# dataset.user_index.device=cuda:0
print(f'{dataset.session_index.device=:}')
# dataset.session_index.device=cuda:0
dataset._check_device_consistency()
dataset.device=cpu
dataset.device=cpu
dataset.user_index.device=cpu
dataset.session_index.device=cpu
def print_dict_shape(d):
for key, val in d.items():
if torch.is_tensor(val):
print(f'dict.{key}.shape={val.shape}')
print_dict_shape(dataset.x_dict)
dict.user_obs.shape=torch.Size([10000, 4, 128])
dict.item_obs.shape=torch.Size([10000, 4, 64])
dict.useritem_obs.shape=torch.Size([10000, 4, 32])
dict.session_obs.shape=torch.Size([10000, 4, 10])
dict.itemsession_obs.shape=torch.Size([10000, 4, 12])
dict.usersessionitem_obs.shape=torch.Size([10000, 4, 8])
# __getitem__ to get batch.
# pick 5 random sessions as the mini-batch.
dataset = dataset.to('cpu')
indices = torch.Tensor(np.random.choice(len(dataset), size=5, replace=False)).long()
print(indices)
# tensor([1118, 976, 1956, 290, 8283])
subset = dataset[indices]
print(dataset)
# ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
print(subset)
# ChoiceDataset(label=[], item_index=[5], user_index=[5], session_index=[5], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], session_obs=[500, 10], price_obs=[500, 4, 12], device=cpu)
tensor([6419, 3349, 6741, 3078, 6424])
ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], useritem_obs=[10, 4, 32], session_obs=[500, 10], itemsession_obs=[500, 4, 12], usersessionitem_obs=[10, 500, 4, 8], device=cpu)
ChoiceDataset(label=[], item_index=[5], user_index=[5], session_index=[5], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], useritem_obs=[10, 4, 32], session_obs=[500, 10], itemsession_obs=[500, 4, 12], usersessionitem_obs=[10, 500, 4, 8], device=cpu)
print(subset.item_index)
# tensor([0, 1, 0, 0, 0])
print(dataset.item_index[indices])
# tensor([0, 1, 0, 0, 0])
subset.item_index += 1 # modifying the batch does not change the original dataset.
print(subset.item_index)
# tensor([1, 2, 1, 1, 1])
print(dataset.item_index[indices])
# tensor([0, 1, 0, 0, 0])
tensor([2, 1, 1, 0, 0])
tensor([2, 1, 1, 0, 0])
tensor([3, 2, 2, 1, 1])
tensor([2, 1, 1, 0, 0])
print(subset.item_obs[0, 0])
# tensor(-1.5811)
print(dataset.item_obs[0, 0])
# tensor(-1.5811)
subset.item_obs += 1
print(subset.item_obs[0, 0])
# tensor(-0.5811)
print(dataset.item_obs[0, 0])
# tensor(-1.5811)
tensor(0.1007)
tensor(0.1007)
tensor(1.1007)
tensor(0.1007)
print(id(subset.item_index))
# 140339656298640
print(id(dataset.item_index[indices]))
# 140339656150528
# these two are different objects in memory.
11458049504
11458562704
Chaining Multiple Datasets with JointDataset
item_level_dataset = dataset.clone()
nest_level_dataset = dataset.clone()
joint_dataset = JointDataset(
item=item_level_dataset,
nest=nest_level_dataset)
print(joint_dataset)
JointDataset with 2 sub-datasets: (
item: ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], useritem_obs=[10, 4, 32], session_obs=[500, 10], itemsession_obs=[500, 4, 12], usersessionitem_obs=[10, 500, 4, 8], device=cpu)
nest: ChoiceDataset(label=[], item_index=[10000], user_index=[10000], session_index=[10000], item_availability=[500, 4], user_obs=[10, 128], item_obs=[4, 64], useritem_obs=[10, 4, 32], session_obs=[500, 10], itemsession_obs=[500, 4, 12], usersessionitem_obs=[10, 500, 4, 8], device=cpu)
)
from torch.utils.data.sampler import BatchSampler, SequentialSampler, RandomSampler
shuffle = False # for demonstration purpose.
batch_size = 32
# Create sampler.
sampler = BatchSampler(
RandomSampler(dataset) if shuffle else SequentialSampler(dataset),
batch_size=batch_size,
drop_last=False)
dataloader = torch.utils.data.DataLoader(dataset,
sampler=sampler,
collate_fn=lambda x: x[0],
pin_memory=(dataset.device == 'cpu'))
print(f'{item_obs.shape=:}')
# item_obs.shape=torch.Size([4, 64])
item_obs_all = item_obs.view(1, num_items, -1).expand(len(dataset), -1, -1)
item_obs_all = item_obs_all.to(dataset.device)
item_index_all = item_index.to(dataset.device)
print(f'{item_obs_all.shape=:}')
# item_obs_all.shape=torch.Size([10000, 4, 64])
item_obs.shape=torch.Size([4, 64])
item_obs_all.shape=torch.Size([10000, 4, 64])
for i, batch in enumerate(dataloader):
first, last = i * batch_size, min(len(dataset), (i + 1) * batch_size)
idx = torch.arange(first, last)
assert torch.all(item_obs_all[idx, :, :] == batch.x_dict['item_obs'])
assert torch.all(item_index_all[idx] == batch.item_index)
torch.Size([16, 4, 64])
print_dict_shape(dataset.x_dict)
# dict.user_obs.shape=torch.Size([10000, 4, 128])
# dict.item_obs.shape=torch.Size([10000, 4, 64])
# dict.session_obs.shape=torch.Size([10000, 4, 10])
# dict.price_obs.shape=torch.Size([10000, 4, 12])
dict.user_obs.shape=torch.Size([10000, 4, 128])
dict.item_obs.shape=torch.Size([10000, 4, 64])
dict.useritem_obs.shape=torch.Size([10000, 4, 32])
dict.session_obs.shape=torch.Size([10000, 4, 10])
dict.itemsession_obs.shape=torch.Size([10000, 4, 12])
dict.usersessionitem_obs.shape=torch.Size([10000, 4, 8])
10000
Conditional Logit Model
No `session_index` is provided, assume each choice instance is in its own session.
ChoiceDataset(label=[], item_index=[2779], user_index=[], session_index=[2779], item_availability=[], itemsession_cost_freq_ovt=[2779, 4, 3], session_income=[2779, 1], itemsession_ivt=[2779, 4, 1], device=cpu)
model = ConditionalLogitModel(
formula='(itemsession_cost_freq_ovt|constant) + (session_income|item) + (itemsession_ivt|item-full) + (intercept|item)',
dataset=dataset,
num_items=4)
model = ConditionalLogitModel(
coef_variation_dict={'itemsession_cost_freq_ovt': 'constant',
'session_income': 'item',
'itemsession_ivt': 'item-full',
'intercept': 'item'},
num_param_dict={'itemsession_cost_freq_ovt': 3,
'session_income': 1,
'itemsession_ivt': 1,
'intercept': 1},
num_items=4)
model = ConditionalLogitModel(
coef_variation_dict={'itemsession_cost_freq_ovt': 'constant',
'session_income': 'item',
'itemsession_ivt': 'item-full',
'intercept': 'item'},
num_param_dict={'itemsession_cost_freq_ovt': 3,
'session_income': 1,
'itemsession_ivt': 1,
'intercept': 1},
num_items=4,
regularization="L1", regularization_weight=0.5)
from torch_choice import run
run(model, dataset, batch_size=-1, learning_rate=0.01, num_epochs=1000, model_optimizer="LBFGS")
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
==================== model received ====================
ConditionalLogitModel(
(coef_dict): ModuleDict(
(itemsession_cost_freq_ovt[constant]): Coefficient(variation=constant, num_items=4, num_users=None, num_params=3, 3 trainable parameters in total, device=cpu).
(session_income[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, device=cpu).
(itemsession_ivt[item-full]): Coefficient(variation=item-full, num_items=4, num_users=None, num_params=1, 4 trainable parameters in total, device=cpu).
(intercept[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, device=cpu).
)
)
Conditional logistic discrete choice model, expects input features:
X[itemsession_cost_freq_ovt[constant]] with 3 parameters, with constant level variation.
X[session_income[item]] with 1 parameters, with item level variation.
X[itemsession_ivt[item-full]] with 1 parameters, with item-full level variation.
X[intercept[item]] with 1 parameters, with item level variation.
device=cpu
==================== data set received ====================
[Train dataset] ChoiceDataset(label=[], item_index=[2779], user_index=[], session_index=[2779], item_availability=[], itemsession_cost_freq_ovt=[2779, 4, 3], session_income=[2779, 1], itemsession_ivt=[2779, 4, 1], device=cpu)
[Validation dataset] None
[Test dataset] None
| Name | Type | Params
------------------------------------------------
0 | model | ConditionalLogitModel | 13
------------------------------------------------
13 Trainable params
0 Non-trainable params
13 Total params
0.000 Total estimated model params size (MB)
Epoch 999: 100%|██████████| 1/1 [00:00<00:00, 107.14it/s, loss=1.88e+03, v_num=45]
`Trainer.fit` stopped: `max_epochs=1000` reached.
Epoch 999: 100%|██████████| 1/1 [00:00<00:00, 98.73it/s, loss=1.88e+03, v_num=45]
Time taken for training: 18.987757921218872
Skip testing, no test dataset is provided.
==================== model results ====================
Log-likelihood: [Training] -1874.63818359375, [Validation] N/A, [Test] N/A
| Coefficient | Estimation | Std. Err. | z-value | Pr(>|z|) | Significance |
|:--------------------------------------|-------------:|------------:|-------------:|------------:|:---------------|
| itemsession_cost_freq_ovt[constant]_0 | -0.0372949 | 0.00709483 | -5.25663 | 1.46723e-07 | *** |
| itemsession_cost_freq_ovt[constant]_1 | 0.0934485 | 0.00509605 | 18.3374 | 0 | *** |
| itemsession_cost_freq_ovt[constant]_2 | -0.0427757 | 0.00322198 | -13.2762 | 0 | *** |
| session_income[item]_0 | -0.0862389 | 0.0183019 | -4.71202 | 2.4527e-06 | *** |
| session_income[item]_1 | -0.0269126 | 0.00384874 | -6.99258 | 2.69873e-12 | *** |
| session_income[item]_2 | -0.0370584 | 0.00406312 | -9.12069 | 0 | *** |
| itemsession_ivt[item-full]_0 | 0.0593796 | 0.0100867 | 5.88689 | 3.93536e-09 | *** |
| itemsession_ivt[item-full]_1 | -0.00634707 | 0.0042809 | -1.48265 | 0.138168 | |
| itemsession_ivt[item-full]_2 | -0.00583223 | 0.00189433 | -3.07879 | 0.00207844 | ** |
| itemsession_ivt[item-full]_3 | -0.00137813 | 0.00118697 | -1.16105 | 0.245622 | |
| intercept[item]_0 | -9.98532e-09 | 1.26823 | -7.8734e-09 | 1 | |
| intercept[item]_1 | 1.32592 | 0.703708 | 1.88419 | 0.0595399 | |
| intercept[item]_2 | 2.8192 | 0.618182 | 4.56047 | 5.10383e-06 | *** |
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ConditionalLogitModel(
(coef_dict): ModuleDict(
(itemsession_cost_freq_ovt[constant]): Coefficient(variation=constant, num_items=4, num_users=None, num_params=3, 3 trainable parameters in total, device=cpu).
(session_income[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, device=cpu).
(itemsession_ivt[item-full]): Coefficient(variation=item-full, num_items=4, num_users=None, num_params=1, 4 trainable parameters in total, device=cpu).
(intercept[item]): Coefficient(variation=item, num_items=4, num_users=None, num_params=1, 3 trainable parameters in total, device=cpu).
)
)
Conditional logistic discrete choice model, expects input features:
X[itemsession_cost_freq_ovt[constant]] with 3 parameters, with constant level variation.
X[session_income[item]] with 1 parameters, with item level variation.
X[itemsession_ivt[item-full]] with 1 parameters, with item-full level variation.
X[intercept[item]] with 1 parameters, with item level variation.
device=cpu
TensorFlow installation not found - running with reduced feature set.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.12.1 at http://localhost:6006/ (Press CTRL+C to quit)
^C
Nested Logit Model
The code demo for nested logit models in the paper was abstract, please refer to the nested-logit model tutorial for executable code.