Data & Methods

Comprehensive overview of our data preprocessing, machine learning workflows, and optimization methodologies for ceramic-filled UV resin formulation development.

Database & Master Data Management

Manage the master data that powers this application. Add, edit, and view raw materials, devices, operators, and measurement protocols. This section is intended for administrative use.

Interactive Model Selection Guide

Choose the right machine learning approach for your specific resin formulation project. Our interactive guide helps you select the optimal model based on your data size, task type, available knowledge, and interpretability requirements.

Key Features:

  • • Dynamic recommendations based on data size (10-2000+ samples)
  • • Task-specific guidance for prediction vs. optimization scenarios
  • • Integration strategies for ontology and expert knowledge
  • • Bilingual support (English/German) with instant switching
  • • Detailed explanations of model categories and trade-offs
Data Preprocessing & Feature Engineering

Missing Value Handling

Missing data points are addressed through multiple strategies: deletion of incomplete records (when <5% of data), K-Nearest Neighbor imputation for continuous variables, and mode imputation for categorical variables.

Outlier Detection & Treatment

Statistical outliers are identified using the Interquartile Range (IQR) method and 3-sigma rule. Physical plausibility checks ensure viscosity values remain within instrument limits and cure depths don't exceed film thickness.

Feature Engineering

  • One-hot encoding for categorical variables (dispersant type, thickener class)
  • Derived features: Volume fraction from solids content and particle density
  • Reactive monomer fraction: Monomer weight / (Monomer + Oligomer)
  • Sedimentation rate: (Sed₂₄ₕ - Sed₄ₕ) / 20h

Python Implementation Example:

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import KNNImputer

# Load and preprocess data
df = pd.read_excel('R_T-161-X.xlsx')
df['volume_fraction'] = df['solids_content'] / 100 * df['particle_density']
df['reactive_monomer_fraction'] = df['monomer_weight'] / (df['monomer_weight'] + df['oligomer_weight'])

# Handle missing values
imputer = KNNImputer(n_neighbors=5)
df_numeric = imputer.fit_transform(df.select_dtypes(include=[np.number]))

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_numeric)
Model Selection & Training

Algorithm Selection

Baseline Models:

  • • Linear Regression (Ridge/Lasso)
  • • Decision Tree Regressor
  • • k-Nearest Neighbors (KNN)

Advanced Models:

  • • Random Forest Regressor
  • • Gradient Boosting (XGBoost/LightGBM)
  • • Support Vector Regression (SVR)
  • • Gaussian Process Regression (GPR)

Hyperparameter Optimization

Systematic optimization using GridSearchCV for small parameter spaces, RandomizedSearchCV for larger spaces, and Bayesian Optimization (using scikit-optimize) for efficient exploration of high-dimensional parameter spaces.

Model Training Example:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.multioutput import MultiOutputRegressor

# Define parameter grid
param_grid = {
  'n_estimators': [100, 200, 300],
  'max_depth': [10, 20, None],
  'min_samples_split': [2, 5, 10],
  'min_samples_leaf': [1, 2, 4]
}

# Multi-output regression for viscosity, cure depth, etc.
rf = RandomForestRegressor(random_state=42)
multi_rf = MultiOutputRegressor(rf)

# Grid search with cross-validation
grid_search = GridSearchCV(multi_rf, param_grid, cv=5, 
                        scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_}")
Model Evaluation & Interpretability

Cross-Validation Strategy

5-fold cross-validation on training data to prevent overfitting. Performance metrics include Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and R² score for each target variable.

Feature Importance Analysis

  • Tree-based importance: Built-in feature_importances_ from Random Forest/XGBoost
  • Permutation importance: Model-agnostic method measuring performance degradation
  • SHAP values: Local explanations showing individual prediction contributions

Residual Analysis

Systematic examination of prediction residuals to identify systematic biases, heteroscedasticity, and model limitations across different parameter ranges.

Recipe Optimization & Inverse Modeling

Multi-Objective Optimization

Simultaneous optimization of conflicting objectives (e.g., minimize viscosity while maximizing cure depth) using Pareto front analysis to identify optimal trade-offs.

Bayesian Optimization

Efficient exploration of parameter space using Gaussian Process surrogate models and acquisition functions (Expected Improvement, Upper Confidence Bound) to minimize experimental iterations.

Pareto Front Calculation:

import numpy as np
from scipy.optimize import minimize
from sklearn.gaussian_process import GaussianProcessRegressor

def is_pareto_optimal(costs):
  """Find Pareto optimal points in multi-objective space"""
  is_efficient = np.ones(costs.shape[0], dtype=bool)
  for i, c in enumerate(costs):
      if is_efficient[i]:
          # Remove dominated points
          is_efficient[is_efficient] = np.any(costs[is_efficient] < c, axis=1)
          is_efficient[i] = True
  return is_efficient

# Multi-objective function
def objective(x):
  viscosity = model_viscosity.predict([x])[0]
  cure_depth = model_cure_depth.predict([x])[0]
  return [viscosity, -cure_depth]  # Minimize viscosity, maximize cure depth

# Generate Pareto front
pareto_points = []
for _ in range(1000):
  x0 = np.random.uniform(bounds[:, 0], bounds[:, 1])
  result = minimize(objective, x0, bounds=bounds)
  if result.success:
      pareto_points.append(result.x)

# Filter for Pareto optimal solutions
objectives = np.array([objective(x) for x in pareto_points])
pareto_mask = is_pareto_optimal(objectives)
pareto_front = np.array(pareto_points)[pareto_mask]
Download Resources

Code Examples