Report generated on {{ time_signature }}
Below is an overview of the dataset used in this evaluation:
Note: The feature summary is only for the first 10 features.
{% endif %}Below you will see a list of preprocessing steps performed on the dataset before any models were applied. Preprocessing steps are divided into categories of tasks that they are responsible for. The selection of certain steps can be highly configured with parameters passed to Mamut Classifier. Some steps are dynamically selected based on the dataset characteristics (e.g. PowerTransformer for skewed features).
Pipeline:
Below you will see the results of the Principal Component Analysis (PCA) performed on the dataset. PCA is a dimensionality reduction technique that is used to reduce the number of features in the dataset. The results below show the input of each feature to the principal components.
Below you will see the results of the model evaluation. The list of models tested during this training session includes:
Each model has been tuned for optimal performance using the hyperparameter tuning in Optuna.
The optimizer used for tuning was: {{ optimizer }}. Optimization was performed with
respect to the metric: {{ metric }} for {{ n_trials }} iterations. Access any model's
hyperparameters by getting the models from mamut.raw_fitted_models_ field and running .get_params() on
your model of interest.
After tuning, models were evaluated on the {{ evaluation_dataset }} set.
{% if evaluation_dataset == 'holdout' %}
This final holdout set was not used for model or ensemble selection; the selected model remains
{{ best_model }} based on validation performance.
{% else %}
The selected model was {{ best_model }} based on validation performance.
{% endif %}
This section records whether the report uses validation or final holdout data, summarizes the repeated stratified validation design used for score stability checks, and lists high-signal leakage warnings. Holdout metrics should be treated as final evidence; validation metrics should be treated as model selection evidence.
{{ validation_integrity | safe }}This table confirms or challenges the validation-selected model using the evidence below. A challenge is advisory: final holdout evidence should trigger a rerun of model selection or a new final holdout, not an automatic silent promotion.
{{ selection_guidance | safe }}The selected MAMUT estimator is refit with fold-local preprocessing and compared against simple baselines on the same {{ evaluation_dataset }} set. A useful model should clear these baselines by a practical margin, not only by a small numerical difference.
{{ baseline_comparison | safe }}Stability is estimated with repeated stratified cross-validation on the modeling data. Confidence intervals are approximate t-intervals over fold scores, clipped to the valid metric range, and should be read as an uncertainty signal, not as a formal guarantee of production performance.
{{ score_stability | safe }}
Below you will see the evaluation of the ensemble model. The ensemble model was created by combining individual models using the {{ ensemble_method }} method. {% if ensemble_method == 'Stacking' %} The meta-learner used was RandomForestClassifier. {% endif %} The models for the ensemble were selected using greedy approach with respect to the metric: {{ metric }}. The best ensemble model was selected.
Ensemble Stacking Model:
The results of the greedy ensemble created during the experiment on the {{ evaluation_dataset }} set are available in the below table.
{{ ensemble_summary | safe }}Below you will see the feature importances in the dataset. Feature importance is calculated using the {{ feature_importance_method }} method. The method used to calculate feature importance is based on the model type. For example, tree-based models use the Gini importance, while linear models use the coefficients. The feature importance is calculated using the training set and is used to determine the most important features for the model.
Below you will see the SHAP values for the best model. SHAP values provide a way to interpret the impact of each feature on the model's output. Each point on the summary plot represents one feature value of one observation. The position on the y-axis is determined by the feature importance and on the x-axis by the Shapley value. The color represents the value of the feature from low to high. Overlapping points are jittered in y-axis direction, so we get a sense of the distribution of the Shapley values per feature.
1. X-Axis Spread: A wider spread signifies varying importance levels of that feature across the dataset.
2. Relative Impact: Features with points shifted to the right (higher SHAP values) indicate more substantial positive contributions, while those shifted to the left (lower SHAP values) represent negative contributions.
3. Overall Importance: Features on the Y-Axis are ordered by importance, with the most important features at the top.
4. Comparative Importance: Features with more spread-out or consistently shifted points might hold higher significance in the model’s predictions.
{% if not binary %}5. Multiclass Classification: The SHAP values can only be display for each class separately, so we display it for class with label 0.
{% endif %}