📈 Evaluate Performance

Focus question: How accurate and reliable are my model’s predictions?

1 Performance Metrics

Model Error (Training)

(Regression accuracy metrics)

Regression accuracy metrics

What it is: A set of metrics showing how well the model fits the training data.

How to use it:

R² (Coefficient of Determination): closer to 1 means the model explains more variance in the data, but it does not reflect the absolute size of errors.
RMSE, MSE, MAE, MEDAE: smaller values mean smaller prediction errors; RMSE/MSE emphasize large errors more strongly.
Do not rely on R² alone—combine it with error metrics to get a complete picture of accuracy.

Why it matters:

Checks whether the model has learned meaningful patterns from the training data.
Identifies if errors are small enough to make predictions useful for decision-making.

Limitations:

Training error alone cannot detect overfitting; a model may perform well on training data but fail on new data.
R² can be misleading: it ignores error size, can appear high even with practically large errors, and can even be negative in cross-validation.
R² assumes a roughly linear relationship between predictions and true values—when relationships are non-linear, it may underestimate performance.
R² also does not reveal systematic bias (e.g., consistent under- or over-prediction).

	Metric	output_1	output_2
0	R2	0.99	0.94
1	RMSE	4.09	8.69
2	MAE	2.08	5.56
3	MSE	16.74	75.48
4	MEDAE	0.68	3.67

Table 1: Table of regression accuracy metrics calculated on the training dataset.

Model Uncertainty (Training)

(Uncertainty calibration metrics)

Uncertainty calibration metrics

What it is: Metrics describing how well the model’s predicted confidence intervals match actual outcomes.

How to use it:

PICP (Prediction Interval Coverage Probability): fraction of true values captured within prediction intervals; closer to the target (e.g., 0.95) is better.
MPIW (Mean Prediction Interval Width): average width of the prediction intervals; narrower means more precise predictions.
Always interpret PICP and MPIW together: good uncertainty means both reliable coverage and reasonably tight intervals.

Why it matters:

Balanced PICP and MPIW show whether uncertainty estimates are both trustworthy and useful.
In Bayesian Optimization, well-calibrated uncertainty is critical for balancing exploration (testing uncertain regions) and exploitation (focusing on promising regions).

Limitations:

Good calibration on training data does not guarantee generalization to new experiments.
Models may achieve high PICP by inflating MPIW (overly cautious, less useful predictions).

	Metric	output_1	output_2
5	PICP	1.00	1.00
6	MPIW	28.91	77.42

Table 2: Table of uncertainty calibration metrics calculated on the training dataset.

Model Error (Cross-validated)

(Regression accuracy metrics with confidence intervals)

Regression accuracy metrics with confidence intervals

What it is: Error metrics averaged across cross-validation folds, with confidence intervals showing variability.

How to use it:

Interpret R², RMSE, MAE, MSE, and MEDAE as in training results, but note the ± values showing uncertainty across folds.
Smaller intervals mean more stable and reliable performance across data splits.
R² alone can be misleading: it ignores error size, may look good even with practically large errors, and can be negative when predictions are worse than a simple mean.

Why it matters:

Cross-validation gives a more realistic estimate of model performance on unseen data.
Confidence intervals reveal stability and reliability, guiding trust in model results.

Limitations:

Dataset size can strongly affect confidence intervals; small datasets lead to wider ranges.
R² can be misleading: it ignores error size, can appear high even with practically large errors, and can even be negative in cross-validation.
R² assumes a roughly linear relationship between predictions and true values—when relationships are non-linear, it may underestimate performance.
R² also does not reveal systematic bias (e.g., consistent under- or over-prediction).

	Metric	output_1	output_2
0	R2	0.614 ± 0.560	0.668 ± 0.715
1	RMSE	19.284 ± 18.748	18.170 ± 19.353
2	MSE	451.914 ± 672.890	416.823 ± 862.163
3	MAE	12.498 ± 11.715	12.105 ± 10.642
4	MEDAE	3.223 ± 4.098	6.068 ± 4.849

Table 3: Table of regression accuracy metrics calculated using cross-validation.

Model Uncertainty (Cross-validated)

(Uncertainty calibration metrics with confidence intervals)

Uncertainty calibration metrics with confidence intervals

What it is: Uncertainty calibration results averaged across cross-validation folds, with variability estimates.

How to use it:

PICP: check whether coverage stays near the target probability (e.g., 0.95) across folds.
MPIW: assess whether prediction intervals remain reasonably narrow across folds.
Evaluate PICP and MPIW together: good uncertainty means coverage is accurate without intervals being overly wide.

Why it matters:

Reliable uncertainty under cross-validation confirms the model can provide trustworthy confidence estimates on new data.
For Bayesian Optimization, accurate uncertainty ensures efficient balance of exploration and exploitation.

Limitations:

Wide intervals across folds may indicate unstable or unreliable uncertainty estimates.
Cross-validation may still understate uncertainty if data are highly correlated or unbalanced.

	Metric	output_1	output_2
5	PICP	0.828 ± 0.745	0.903 ± 0.585
6	MPIW	44.807 ± 70.132	79.239 ± 74.415

Table 4: Table of uncertainty calibration metrics calculated using cross-validation.

2 Residual Analysis

Residuals Scatter by Predicted Value

(Residuals plotted against predicted values)

Residuals plotted against predicted values

What it is: A diagnostic plot comparing model prediction errors (residuals) against the predicted values.

How to use it:

The x-axis shows predicted values, the y-axis shows residuals (observed – predicted).
Points near y=0 indicate accurate predictions.
Patterns or funnels in the scatter may suggest bias or heteroscedasticity.

Why it matters:

Reveals systematic errors in the model.
Supports decisions on whether retraining or feature engineering is needed.

Limitations:

May be hard to interpret with small datasets.
Does not explain *why* errors occur, only that they exist.

Residual Distribution

(KDE of residuals)

KDE of residuals

What it is: A kernel density estimate (KDE) of signed residuals showing how prediction errors are distributed around zero.

How to use it:

Center: a peak near 0 indicates unbiased predictions; a shift away from 0 suggests systematic under/over-prediction.
Shape: symmetry implies balanced errors; skew or heavy tails indicate systematic issues or outliers.
Spread: wider distributions mean higher error variance; narrower means more consistent accuracy.

Why it matters:

Validates model calibration and reveals bias that can mislead downstream decisions.
Guides choices about transforms, noise models, or additional data collection before BO.

Limitations:

Shows overall error behavior but not where in input space errors occur—pair with residuals-vs-predicted or PDPs.
KDE can be unstable with very small sample sizes or extreme outliers.

3 Model Uncertainty

Observed vs. Predicted (Training)

(Model predictions compared to true values with posterior uncertainty)

Model predictions compared to true values with posterior uncertainty

What it is: A scatter plot comparing predicted outputs from the model to the true observed values in the training dataset. Each point represents one experiment, with error bars showing the model’s posterior predictive uncertainty (standard deviation).

How to use it:

Look for alignment of points along the red diagonal line (perfect prediction).
Error bars indicate the spread of the posterior distribution — wider bars mean the model is less confident about its prediction.

Why it matters:

Shows whether the model is overfitting the training data (appearing too perfect).
Uncertainty estimates help gauge how much trust to place in each prediction, beyond average accuracy.

Limitations:

Training performance can be overly optimistic and not generalize to new data.
Posterior uncertainty reflects model belief, which may be miscalibrated if the training data are biased or limited.

Observed vs. Predicted (Cross-validated)

(Model predictions compared to true values with posterior uncertainty)

Model predictions compared to true values with posterior uncertainty

What it is: A scatter plot comparing model predictions against observed values under cross-validation, with error bars reflecting posterior predictive uncertainty (standard deviation).

How to use it:

Points closer to the red diagonal line indicate better predictive accuracy.
Cross-validation introduces variability across folds; error bars capture the spread of predictions from the posterior, not just raw error.

Why it matters:

Cross-validation performance better reflects how the model will generalize to unseen data.
Posterior uncertainty helps identify outputs where Bayesian Optimization can explore (large uncertainty) versus exploit (low uncertainty).

Limitations:

Error bars reflect model uncertainty, not necessarily real-world experimental noise.
Cross-validation with limited data can inflate uncertainty due to small training subsets.

Prediction Intervals (Training)

(Predictions ordered by output with confidence intervals)

Prediction Intervals (Training)

(Predictions ordered by output with confidence intervals)

Predictions ordered by output with confidence intervals

What it is: A plot comparing observed and predicted outputs on the training data, with error bars showing the model’s confidence intervals for each prediction.

How to use it:

The x-axis orders experiments by their observed output value; the y-axis shows observed (dashed line) and predicted (points with intervals).
Error bars represent the model’s predictive intervals: narrow means high confidence, wide means greater uncertainty.
Check whether observed values consistently fall inside the intervals; perfect coverage is expected on training data.

Why it matters:

Shows whether the model has fit the training data tightly and how its uncertainty estimates behave in-sample.
Useful for spotting overfitting: overly narrow intervals with perfect fit may not generalize to new data.
Provides a direct view of calibration, beyond just error size (R², MAE).

Limitations:

Training intervals can look overly optimistic and may not reflect real-world uncertainty.
Perfect coverage (PICP=1.0) on training data does not guarantee generalization.
Cannot assess stability across different data splits; must be compared with cross-validation results.

Prediction Intervals (Cross-validated)

(Predictions ordered by output with confidence intervals)

Predictions ordered by output with confidence intervals

What it is: A plot comparing observed and predicted outputs under cross-validation, with confidence intervals reflecting model uncertainty on unseen folds.

How to use it:

The x-axis orders experiments by their observed output; the y-axis shows observed (dashed line) and predicted values (points with intervals).
Error bars represent predictive intervals for each fold; wider intervals mean the model is less certain.
Look for whether most observed points fall within their intervals (good calibration) and whether interval widths are reasonable (not excessively wide).

Why it matters:

Gives a more realistic picture of how well the model’s uncertainty generalizes to unseen data.
Helps detect overconfidence: if many observed values fall outside the intervals, the model underestimates uncertainty.
Provides complementary insight to observed vs. predicted plots by showing *both* accuracy and calibration under cross-validation.

Limitations:

Coverage probability (PICP) may vary with small datasets; results can look noisy with few samples.
Wide intervals may reduce interpretability, especially when data are sparse or highly variable.
Cross-validation assumes folds are representative; correlated or unbalanced data can bias interval estimates.

Prediction Uncertainty Distribution

(KDE of predictive uncertainty)

KDE of predictive uncertainty

What it is: A KDE of non-negative uncertainty magnitudes from the model’s predictive distribution for a given output.

How to use it:

Center/Spread: a mass near low values indicates confident predictions; a long right tail indicates many uncertain cases.
Modality: multiple peaks suggest distinct regimes with different confidence levels.
Compare across outputs or model versions to see which settings are more/less confident.

Why it matters:

In Bayesian Optimization, uncertainty drives exploration; identifying high-uncertainty regions helps target informative experiments.
Helps detect over-confident models (too concentrated near zero) or over-conservative ones (very wide uncertainties).

Limitations:

Does not indicate accuracy—low uncertainty can still be wrong if the model is mis-specified; pair with error metrics.
Scale depends on the uncertainty definition (std, variance, interval width); ensure consistency when comparing.

Uncertainty Calibration

(Predicted vs. Observed Confidence)

Predicted vs. Observed Confidence

What it is: A reliability diagram for prediction intervals: for each nominal level p (e.g., 0.5…0.95), it shows the fraction of observed outcomes that fall inside the model’s centered two-sided interval.

How to use it:

Diagonal dashed line = perfect calibration (observed proportion = predicted proportion).
Curve **below** diagonal → intervals too narrow (overconfident).
Curve **above** diagonal → intervals too wide (underconfident).
Shaded miscalibration area = ∫ |observed − predicted| dp; smaller is better.
Check key operating points (e.g., p=0.90 → should contain ~90% of truths).

Why it matters:

Validates the uncertainty estimates used in **Observed vs Predicted** and **Prediction Interval** plots by testing calibration across many levels, not just a single PICP value.
Well-calibrated intervals are critical for Bayesian optimization policies that depend on mean ± k·σ to balance exploration and exploitation.
Provides actionable insight into whether intervals need widening/narrowing (e.g., add aleatoric noise, adjust likelihood, or recalibrate).

Limitations:

Assumes symmetric intervals around the mean; asymmetric predictive distributions may appear miscalibrated even if tails are accurate.
Curves may be jagged with small datasets; bootstrap confidence bands can improve stability.
Good calibration ≠ high accuracy—always pair with MAE/RMSE and residual diagnostics.