Models¶
These are models available to be used in the model evaluation, training, and household training link tasks.
Attributes for all models:
threshold
– Type:float
. Alpha threshold (model hyperparameter).threshold_ratio
– Type:float
. Beta threshold (de-duplication distance ratio).Any parameters available in the model as defined in the Spark documentation can be passed as params using the label given in the Spark docs. Commonly used parameters are listed below with descriptive explanations from the Spark docs.
random_forest¶
Uses pyspark.ml.classification.RandomForestClassifier. Returns probability as an array.
Parameters:
maxDepth
– Type:int
. Maximum depth of the tree. Spark default value is 5.numTrees
– Type:int
. The number of trees to train. Spark default value is 20, must be >= 1.featureSubsetStrategy
– Type:string
. Per the Spark docs: “The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n].”
model_parameters = {
type = "random_forest",
maxDepth = 5,
numTrees = 75,
featureSubsetStrategy = "sqrt",
threshold = 0.15,
threshold_ratio = 1.0
}
probit¶
Uses pyspark.ml.regression.GeneralizedLinearRegression with family="binomial"
and link="probit"
.
model_parameters = {
type = "probit",
threshold = 0.85,
threshold_ratio = 1.2
}
logistic_regression¶
Uses pyspark.ml.classification.LogisticRegression
chosen_model = {
type = "logistic_regression",
threshold = 0.5,
threshold_ratio = 1.0
}
decision_tree¶
Uses pyspark.ml.classification.DecisionTreeClassifier.
Parameters:
maxDepth
– Type:int
. Maximum depth of the tree.minInstancesPerNode
– Typeint
. Per the Spark docs: “Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.”maxBins
– Type:int
. Per the Spark docs: “Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.”
chosen_model = {
type = "decision_tree",
maxDepth = 6,
minInstancesPerNode = 2,
maxBins = 4
}
gradient_boosted_trees¶
Uses pyspark.ml.classification.GBTClassifier.
Parameters:
maxDepth
– Type:int
. Maximum depth of the tree.minInstancesPerNode
– Typeint
. Per the Spark docs: “Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.”maxBins
– Type:int
. Per the Spark docs: “Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.”
chosen_model = {
type = "gradient_boosted_trees",
maxDepth = 4,
minInstancesPerNode = 1,
maxBins = 6,
threshold = 0.7,
threshold_ratio = 1.3
}