📊 Explore Data

Focus Question: What’s in my dataset, and how are my variables distributed, related, and missing?

1 Training Data

Training Data

(Completed Experiments Overview)

input_1 input_2 input_3 input_4 input_5 input_6 input_7 input_8 output_1 output_2
0 100.0 DMAc 10.00 1.0 L29 | DPPF 0.015 (TMS)3SiH 0.6 9.0 77.0
1 100.0 DMSO 10.00 1.0 L29 | DPPF 0.015 (TMS)3SiH 0.6 0.0 91.0
2 100.0 NMP 10.00 1.0 L29 | DPPF 0.015 (TMS)3SiH 0.6 12.0 77.0
3 100.0 DMF 10.00 1.0 L29 | DPPF 0.015 (TMS)3SiH 0.6 2.0 78.0
4 100.0 DMPU 10.00 1.0 L29 | DPPF 0.015 (TMS)3SiH 0.6 6.0 91.0
5 100.0 Propionitrile 10.00 1.0 L29 | DPPF 0.015 (TMS)3SiH 0.6 0.0 98.0
6 100.0 DMAc 10.00 1.0 L33 | XantPhos 0.015 (TMS)3SiH 0.6 69.0 19.0
7 100.0 DMSO 10.00 1.0 L33 | XantPhos 0.015 (TMS)3SiH 0.6 0.0 86.0
8 100.0 NMP 10.00 1.0 L33 | XantPhos 0.015 (TMS)3SiH 0.6 54.0 37.0
9 100.0 DMF 10.00 1.0 L33 | XantPhos 0.015 (TMS)3SiH 0.6 6.0 87.0
10 100.0 DMPU 10.00 1.0 L33 | XantPhos 0.015 (TMS)3SiH 0.6 11.0 84.0
11 100.0 Propionitrile 10.00 1.0 L33 | XantPhos 0.015 (TMS)3SiH 0.6 0.0 94.0
12 100.0 DMAc 10.00 1.0 L59 | N-XantPhos 0.015 (TMS)3SiH 0.6 76.0 23.0
13 100.0 DMSO 10.00 1.0 L59 | N-XantPhos 0.015 (TMS)3SiH 0.6 0.0 91.0
14 100.0 NMP 10.00 1.0 L59 | N-XantPhos 0.015 (TMS)3SiH 0.6 0.0 100.0
15 100.0 DMF 10.00 1.0 L59 | N-XantPhos 0.015 (TMS)3SiH 0.6 0.0 90.0
16 100.0 DMPU 10.00 1.0 L59 | N-XantPhos 0.015 (TMS)3SiH 0.6 12.0 76.0
17 100.0 Propionitrile 10.00 1.0 L59 | N-XantPhos 0.015 (TMS)3SiH 0.6 0.0 96.0
18 100.0 DMAc 10.00 1.0 DPPP 0.015 (TMS)3SiH 0.6 89.0 2.0
19 100.0 DMSO 10.00 1.0 DPPP 0.015 (TMS)3SiH 0.6 1.0 91.0
20 100.0 NMP 10.00 1.0 DPPP 0.015 (TMS)3SiH 0.6 33.0 61.0
21 100.0 DMF 10.00 1.0 DPPP 0.015 (TMS)3SiH 0.6 32.0 62.0
22 100.0 DMPU 10.00 1.0 DPPP 0.015 (TMS)3SiH 0.6 38.0 51.0
23 100.0 Propionitrile 10.00 1.0 DPPP 0.015 (TMS)3SiH 0.6 0.0 90.0
24 100.0 DMAc 10.00 1.0 L29 | DPPF 0.015 PMHS 0.6 5.0 93.0
25 100.0 DMSO 10.00 1.0 L29 | DPPF 0.015 PMHS 0.6 4.0 94.0
26 100.0 NMP 10.00 1.0 L29 | DPPF 0.015 PMHS 0.6 6.0 85.0
27 100.0 DMF 10.00 1.0 L29 | DPPF 0.015 PMHS 0.6 7.0 80.0
28 100.0 DMPU 10.00 1.0 L29 | DPPF 0.015 PMHS 0.6 2.0 95.0
29 100.0 Propionitrile 10.00 1.0 L29 | DPPF 0.015 PMHS 0.6 15.0 88.0
30 100.0 DMAc 10.00 1.0 L33 | XantPhos 0.015 PMHS 0.6 5.0 89.0
31 100.0 DMSO 10.00 1.0 L33 | XantPhos 0.015 PMHS 0.6 18.0 71.0
32 100.0 NMP 10.00 1.0 L33 | XantPhos 0.015 PMHS 0.6 3.0 85.0
33 100.0 DMF 10.00 1.0 L33 | XantPhos 0.015 PMHS 0.6 8.0 79.0
34 100.0 DMPU 10.00 1.0 L33 | XantPhos 0.015 PMHS 0.6 3.0 95.0
35 100.0 Propionitrile 10.00 1.0 L33 | XantPhos 0.015 PMHS 0.6 4.0 90.0
36 100.0 DMAc 10.00 1.0 L59 | N-XantPhos 0.015 PMHS 0.6 0.0 91.0
37 100.0 DMSO 10.00 1.0 L59 | N-XantPhos 0.015 PMHS 0.6 5.0 92.0
38 100.0 NMP 10.00 1.0 L59 | N-XantPhos 0.015 PMHS 0.6 2.0 90.0
39 100.0 DMF 10.00 1.0 L59 | N-XantPhos 0.015 PMHS 0.6 2.0 88.0
40 100.0 DMPU 10.00 1.0 L59 | N-XantPhos 0.015 PMHS 0.6 0.0 99.0
41 100.0 Propionitrile 10.00 1.0 L59 | N-XantPhos 0.015 PMHS 0.6 1.0 93.0
42 100.0 DMAc 10.00 1.0 DPPP 0.015 PMHS 0.6 5.0 92.0
43 100.0 DMSO 10.00 1.0 DPPP 0.015 PMHS 0.6 10.0 82.0
44 100.0 NMP 10.00 1.0 DPPP 0.015 PMHS 0.6 3.0 86.0
45 100.0 DMF 10.00 1.0 DPPP 0.015 PMHS 0.6 5.0 88.0
46 100.0 DMPU 10.00 1.0 DPPP 0.015 PMHS 0.6 1.0 85.0
47 100.0 Propionitrile 10.00 1.0 DPPP 0.015 PMHS 0.6 7.0 85.0
48 110.0 DMAc 7.50 0.9 DPPP 0.020 (TMS)3SiH 0.3 72.0 13.0
49 120.0 DMAc 9.20 1.1 DPPP 0.036 (TMS)3SiH 0.6 86.0 0.0
50 110.0 DMAc 13.20 0.8 DPPP 0.037 (TMS)3SiH 0.1 66.0 20.0
51 90.0 DMAc 9.84 1.1 DPPP 0.037 (TMS)3SiH 0.6 76.0 21.0
52 100.0 DMAc 11.60 0.5 DPPP 0.024 (TMS)3SiH 0.2 60.0 30.0
53 110.0 DMAc 13.20 0.7 DPPP 0.041 (TMS)3SiH 0.2 89.0 0.0
54 90.0 DMAc 13.20 1.5 DPPP 0.040 PMHS 0.2 0.0 98.0
55 100.0 NMP 13.20 0.8 DPPP 0.042 (TMS)3SiH 0.2 87.0 7.0
56 110.0 DMAc 7.00 1.0 DPPP 0.015 (TMS)3SiH 0.1 84.0 0.0
57 110.0 DMAc 6.00 1.0 DPPP 0.015 (TMS)3SiH 0.1 85.0 0.0
58 110.0 DMAc 12.00 1.0 DPPP 0.015 (TMS)3SiH 0.1 87.0 0.0
59 100.0 DMAc 6.00 1.0 DPPP 0.037 (TMS)3SiH 0.1 82.0 14.0
60 100.0 DMAc 9.00 1.0 DPPP 0.030 (TMS)3SiH 0.1 86.0 0.0
61 90.0 DMAc 10.00 1.0 DPPP 0.015 (TMS)3SiH 0.1 88.0 0.0
62 120.0 NMP 13.00 1.0 DPPP 0.015 (TMS)3SiH 0.1 87.0 0.0
63 110.0 DMPU 10.00 1.0 DPPP 0.015 (TMS)3SiH 0.1 87.0 0.0
Table 1: Table of completed experiments in the training dataset.

2 Data Overview

Data Quality Warnings

(Warnings from training data)

Warning Message
Correlation Highly correlated pairs (|corr| > 0.90):

Outputs 'output_1' and 'output_2': -0.99.

Highly correlated inputs or outputs may reduce model robustness or optimization speed. Consider removing one of the variables if they carry redundant information.
Outliers The following columns contain values that are unusually far from typical values:

'output_1': [66.0, 69.0, 72.0, 76.0, 82.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 60.0]
'output_2': [0.0, 2.0, 7.0, 13.0, 14.0]

Why does this matter? Outliers can affect the accuracy of machine learning models by skewing results or causing the model to focus too much on rare, extreme values. Consider reviewing these outliers to see if they represent errors in the data. If they are not errors, allow a few optimization rounds to better map the trends.
Table 2: Table summary of data quality warnings identified in the training dataset.

Summary of Numeric Columns

(Numeric variable summary statistics)

input_1 input_3 input_4 input_6 input_8 output_1 output_2
count 64.000 64.000 64.000 64.000 64.000 64.000 64.000
mean 101.250 10.062 0.991 0.018 0.500 28.016 63.594
std 5.195 1.292 0.105 0.008 0.193 34.952 36.803
min 90.000 6.000 0.500 0.015 0.100 0.000 0.000
25% 100.000 10.000 1.000 0.015 0.600 2.000 22.500
50% 100.000 10.000 1.000 0.015 0.600 6.500 85.000
75% 100.000 10.000 1.000 0.015 0.600 66.750 91.000
max 120.000 13.200 1.500 0.042 0.600 89.000 100.000
Table 3: Table of summary statistics for numeric variables in the dataset.

Summary of Categorical Columns

(Categorical Variable Summary Statistics)

input_2 input_5 input_7
count 64 64 64
unique 6 4 2
top DMAc DPPP (TMS)3SiH
freq 21 28 39
Table 4: Table of summary statistics for categorical variables in the dataset.

3 Explore Variable Distributions

Variable Distributions

(Histogram matrix)

Figure 1: Histogram matrix showing the distribution of all variables in the dataset.

Category–Output Comparisons

(Violin plots)
(a) input_2-vs-output_1
(b) input_2-vs-output_2
(c) input_5-vs-output_1
(d) input_5-vs-output_2
(e) input_7-vs-output_1
(f) input_7-vs-output_2

Figure 2: Violin plots comparing category levels to output variables.

4 Variable Relationships

Variable Correlation

(Correlation heatmap)

Figure 3: Heatmap showing pairwise correlations between numeric variables.

Variable Association

(Mixed-type association heatmap)

Figure 4: Heatmap showing associations between mixed variable types.