Data & Methods
Comprehensive overview of our data preprocessing, machine learning workflows, and optimization methodologies for ceramic-filled UV resin formulation development.
Choose the right machine learning approach for your specific resin formulation project. Our interactive guide helps you select the optimal model based on your data size, task type, available knowledge, and interpretability requirements.
Key Features:
- • Dynamic recommendations based on data size (10-2000+ samples)
- • Task-specific guidance for prediction vs. optimization scenarios
- • Integration strategies for ontology and expert knowledge
- • Bilingual support (English/German) with instant switching
- • Detailed explanations of model categories and trade-offs
Missing Value Handling
Missing data points are intended to be addressed through multiple strategies: deletion of incomplete records (when <5% of data), K-Nearest Neighbor imputation for continuous variables, and mode imputation for categorical variables.
Outlier Detection & Treatment
Statistical outliers are designed to be identified using the Interquartile Range (IQR) method and 3-sigma rule. Physical plausibility checks ensure viscosity values remain within instrument limits and cure depths don't exceed film thickness.
Feature Engineering
- • One-hot encoding for categorical variables (dispersant type, thickener class)
- • Derived features: Solid volume fraction calculated from mass fraction and component densities
- • Reactive monomer fraction: Monomer weight / (Monomer + Oligomer)
- • Sedimentation rate: (Sed₂₄ₕ - Sed₄ₕ) / 20h
Python Implementation Example:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import KNNImputer
# Load and preprocess data
df = pd.read_excel('R_T-161-X.xlsx')
# Calculate solid volume fraction (assuming binder density rho_b is known or estimated)
# w_s = mass fraction of solids, rho_s = particle density, rho_b = binder density
rho_b = 1.1 # Example binder density in g/cm3
w_s = df['solids_content'] / 100.0
w_b = 1.0 - w_s
df['solid_volume_fraction'] = (w_s / df['particle_density']) / ((w_s / df['particle_density']) + (w_b / rho_b))
df['reactive_monomer_fraction'] = df['monomer_weight'] / (df['monomer_weight'] + df['oligomer_weight'])
# Handle missing values (Note: In production, fit on train set only to avoid leakage)
imputer = KNNImputer(n_neighbors=5)
df_numeric = imputer.fit_transform(df.select_dtypes(include=[np.number]))
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_numeric)Algorithm Selection
Baseline Models:
- • Linear Regression (Ridge/Lasso)
- • Decision Tree Regressor
- • k-Nearest Neighbors (KNN)
Advanced Models:
- • Random Forest Regressor
- • Gradient Boosting (XGBoost/LightGBM)
- • Support Vector Regression (SVR)
- • Gaussian Process Regression (GPR)
Hyperparameter Optimization
Systematic optimization using GridSearchCV for small parameter spaces, RandomizedSearchCV for larger spaces, and Bayesian Optimization (using scikit-optimize) for efficient exploration of high-dimensional parameter spaces.
Model Training Example:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.multioutput import MultiOutputRegressor
# Define parameter grid (using estimator__ prefix for wrapped regressor)
param_grid = {
'estimator__n_estimators': [100, 200, 300],
'estimator__max_depth': [10, 20, None],
'estimator__min_samples_split': [2, 5, 10],
'estimator__min_samples_leaf': [1, 2, 4]
}
# Multi-output regression for viscosity, cure depth, etc.
rf = RandomForestRegressor(random_state=42)
multi_rf = MultiOutputRegressor(rf)
# Grid search with cross-validation
grid_search = GridSearchCV(multi_rf, param_grid, cv=5,
scoring='neg_mean_squared_error', n_jobs=-1)
# Note: Ensure X_train and y_train are properly split before fitting
# grid_search.fit(X_train, y_train)
# print(f"Best parameters: {grid_search.best_params_}")Cross-Validation Strategy
5-fold cross-validation on training data to prevent overfitting. Performance metrics include Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and R² score for each target variable.
Feature Importance Analysis
- • Tree-based importance: Built-in feature_importances_ from Random Forest/XGBoost
- • Permutation importance: Model-agnostic method measuring performance degradation
- • SHAP values: Local explanations showing individual prediction contributions
Residual Analysis
Systematic examination of prediction residuals to identify systematic biases, heteroscedasticity, and model limitations across different parameter ranges.
Multi-Objective Optimization
Simultaneous optimization of conflicting objectives (e.g., minimize viscosity while maximizing cure depth) using Pareto front analysis to identify optimal trade-offs.
Bayesian Optimization
Efficient exploration of parameter space using Gaussian Process surrogate models and acquisition functions (Expected Improvement, Upper Confidence Bound) to minimize experimental iterations.
Weighted Sum Optimization Example:
import numpy as np
from scipy.optimize import minimize
# Define objective function (Scalarization)
# Goal: Minimize viscosity while maximizing cure depth (minimize negative cure depth)
def objective(x, model_viscosity, model_cure_depth, w1=0.5, w2=0.5):
# Predict properties for input x
viscosity = model_viscosity.predict([x])[0]
cure_depth = model_cure_depth.predict([x])[0]
# Normalize objectives to comparable scales before weighting
# (Simplified example)
obj1 = viscosity / 10.0 # Normalize viscosity
obj2 = -cure_depth / 100.0 # Normalize and negate cure depth
return w1 * obj1 + w2 * obj2
# Optimization
# res = minimize(objective, x0, args=(model_viscosity, model_cure_depth),
# method='SLSQP', bounds=bounds)