Computational Mechanics (2024) 74:151–168
https://doi.org/10.1007/s00466-023-02427-3

ORIG INAL PAPER

Knowledge-based modeling of simulation behavior for Bayesian
optimization

Felix Huber1,2 · Paul-Christian Bürkner3,4 · Dominik Göddeke1,3 ·Miriam Schulte2,3

Received: 17 June 2023 / Accepted: 27 November 2023 / Published online: 4 January 2024
© The Author(s) 2024

Abstract
Numerical simulations consist of many components that affect the simulation accuracy and the required computational
resources. However, finding an optimal combination of components and their parameters under constraints can be a difficult,
time-consuming and often manual process. Classical adaptivity does not fully solve the problem, as it comes with signifi-
cant implementation cost and is difficult to expand to multi-dimensional parameter spaces. Also, many existing data-based
optimization approaches treat the optimization problem as a black-box, thus requiring a large amount of data. We present a
constrained, model-based Bayesian optimization approach that avoids black-box models by leveraging existing knowledge
about the simulation components and properties of the simulation behavior. The main focus of this paper is on the stochastic
modeling ansatz for simulation error and run time as optimization objective and constraint, respectively. To account for data
covering multiple orders of magnitude, our approach operates on a logarithmic scale. The models use a priori knowledge of
the simulation components such as convergence orders and run time estimates. Together with suitable priors for the model
parameters, the model is able to make accurate predictions of the simulation behavior. Reliably modeling the simulation
behavior yields a fast optimization procedure because it enables the optimizer to quickly indicate promising parameter val-
ues. We test our approach experimentally using the multi-scale muscle simulation framework OpenDiHu and show that we
successfully optimize the time step widths in a time splitting approach in terms of minimizing the overall error under run time
constraints.

Keywords Bayesian optimization · Constrained optimization · Numerical simulation · Bayesian modeling · Multi-scale
simulation

B Felix Huber
felix.huber@mathematik.uni-stuttgart.de

Paul-Christian Bürkner
paul.buerkner@gmail.com

Dominik Göddeke
dominik.goeddeke@mathematik.uni-stuttgart.de

Miriam Schulte
miriam.schulte@ipvs.uni-stuttgart.de

1 Institute of Applied Analysis and Numerical Simulation,
University of Stuttgart, Allmandring 5b, 70569 Stuttgart,
Germany

2 Institute for Parallel and Distributed Systems,
University of Stuttgart, Universitätsstraße 38, 70569 Stuttgart,
Germany

3 Stuttgart Center for Simulation Science,
University of Stuttgart, Universitätsstraße 32, 70569 Stuttgart,
Germany

4 Department of Statistics, TU Dortmund University,
Vogelpothsweg 78, 44227 Dortmund, Germany

1 Introduction

To solve complex multi-X problems numerically, simula-
tions are typically composed of many components, which
in turn are parameterized in many ways. For instance, there
are components that capture individual physics phenomena
on their individual spatial and temporal scales, realized by
individual software solutions, and there are components that
capture their complex and challenging interplay across scales
and aspects specific to the used hardware. Such components
can include discretization parameters, solver accuracies, cou-
pling schemes and hardware specific algorithms. In an ideal
world, such a parameterization of the involved numerical
schemes, their implementation, and their interplay would be
determined automatically, to achieve an overall optimal bal-
ance between the quality of solution and time to solution.
The quality can, e.g., be measured in terms of accuracy or
precision of the simulation result, or, more generally speak-
ing, in terms of gained knowledge. The optimal setting of
parameters must satisfy various constraints, some given by

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s00466-023-02427-3&domain=pdf
http://orcid.org/0000-0003-4704-1339
http://orcid.org/0000-0001-5765-8995
http://orcid.org/0000-0002-1552-497X
http://orcid.org/0000-0003-4594-6142


152 Computational Mechanics (2024) 74:151–168

the simulated problem, some given by the interplay of com-
ponents, some given by the implementation choices, and
some induced by the simulation pipeline itself. Examples
include constraints on the simulation time and error, time step
width limitations for stability reasons, hardware constraints
that effect suitable algorithmsor parallelizationmethods. The
state of the art in this scenario is based on employing expert
knowledge and expert experience from the methods side and
the domain side. However, typical users of simulations are
not always equipped with this knowledge or experience.

In this paper, we use “simulations” as an encompass-
ing term for finding numerical approximations to forward
problems and deliberately restrict the discussion to exem-
plary multi-scale problems governed by PDEs. We assume
an existing understanding of the underlying simulation com-
ponents, and focusmostly on the interplay of the components.
This is already a challenging problem, because the efficiency
of one component often depends on choices for other com-
ponents and simply trying every parameter combination is
infeasible. Choosing good combinations by expert “intu-
ition” requires years of experience—which is naturally not
given for new challenging simulations.

1.1 Motivating example

Abasic example for combinations of components and param-
eters is the balancing of multiple resolution parameters to
minimize the numerical error without wasting resources by
refining too much in one direction. Such problems arise for
different time step widths in time splitting schemes for multi-
physics or multi-scale simulations or when balancing spatial
and temporal discretization widths. If the error induced by
one of the discretization parameters is already orders of mag-
nitude smaller than the error induced by the other one, further
refinement of the first only increases the computational cost
without impacting the overall error in a measurable manner.
Even if stability bounds, such as the Courant–Friedrichs–
Lewy (CFL) condition, limit the relation of the discretization
widths in some settings, such bounds only ensure numerical
stability and do not ensure spatial and temporal error balanc-
ing.

In all these scenarios, we typically have some “textbook”
information at hand about asymptotic error behavior in the
form of convergence orders, but we do not know the respec-
tive constants, which makes the problem of choosing several
discretization widths non-trivial. More complex examples,
whichwe envision to solvewith similarmethods as those pre-
sented in this paper, include considering all kinds of optimal
combinations of further methodological components such as
the choices of the discretization scheme, discretizationwidth,
linear solver and stopping criterion of the solver. These ques-
tions can also be addressed in terms of classical simulation

adaptivity, while in this paper, we aim to establish an alter-
native, minimally invasive, more data-driven approach.

1.2 General formulation

We can formalize the general problem we want to address
in the framework of optimization comprising an objective
function f : � → R and a constraint function g : � → R

with an upper bound gmax. Typically, the objective function
is the simulation error and the constraint is the run time, or
vice versa. The admissible set � parameterizes the simula-
tion components and can be either continuous or discrete.
The stopping criterion of a linear solver is an example for
a continuous parameter, while a solver type is chosen from
a discrete set. However, depending on the setting, continu-
ous parameters can also be restricted to a discrete set if, for
example, we only allow powers of 1

2 as discretization width.
The optimization functions f and g are unknown and have to
be learned from function evaluations, each of which means
that we have to execute a simulation in our setting. This also
implies that the feasible region {x ∈ � | g(x) ≤ gmax} is
not explicitly given and has to be learned together with the
constraint function g. The final optimization problem then
reads

argmin
x∈�

f (x) s.t. g(x) ≤ gmax.

1.3 Knowledge-infusedmodeling for f and g

Without knowing anything about f and g, the optimization
problem is difficult to solve and severely ill-posed. How-
ever, typical simulation behavior of subcomponents is not a
black-box function: The individual components used inside
a simulation are often numerical methods for which qualita-
tive behavior is known a priori at least asymptotically, e.g., in
terms of convergence rates, error reduction or computational
complexity. For example, an Euler scheme is first order in
time,while aCrank–Nicolson scheme is secondorder in time.
This knowledge is very valuable, as it gives a good idea of
how the simulation is expected to behave over a large range of
the optimization space. Thus, we suggest to explicitly exploit
this knowledge during the optimization. To predict f and g
quantitatively, we learn models for f and g from data and the
a priori knowledge. As simulations are usually performed in
the pre-asymptotic regime of the numerical methods and as
measurements of, e.g., run time, are usually noisy, we addi-
tionally enhance the model by terms that allow for deviations
from the ideal asymptotic behavior.

123


Computational Mechanics (2024) 74:151–168 153

1.4 Challenges

The main challenges of solving the outlined optimization
problem by infusing a priori knowledge about the qualitative
behavior of f and g are:

Expensive evaluations The optimizer has to work with
data that are potentially expensive to generate and are, thus,
scarce: Each evaluation of f and g corresponds to executing
at least one simulation run and additional computations to
estimate the simulation error and run time. Such simulation
runs and the computation of error estimators are expensive.
Therefore, the number of such expensive evaluations has
to be kept small to keep the optimization cost acceptable.
This, however, requires function models that can extrapo-
late with sufficient accuracy from these evaluation points to
usually further away regions where we might find the opti-
mum. Overall, the cost of running a simulation for specific
parameters is assumed to be the dominating cost factor dur-
ing the optimization. Therefore, it is important to devise a
data-efficient optimization that uses available data optimally
during the optimization, i.e., an optimization that carefully
chooses new evaluation points in combination with a model
ansatz that allows for data extrapolation instead of only inter-
polation.

Scales covered by function values The optimization has to
work with data spanning many orders of magnitude. Typi-
cal values from simulations such as error norms or run times
can quickly converge to zero or increase to very large val-
ues, depending on the parameters of the simulation, e.g., in
terms of step widths of discretizations. While this is in gen-
eral not necessarily a problem for fitting such functions in a
regression approach, it becomes an issue in our case due to
the deviations from the expected ideal asymptotic behavior
that we mentioned earlier. Their scale and also the scale of
model uncertainties between data points depend on the scale
of the predicted data values. This means, that we assume
larger absolute deviations and uncertainties for large values
and smaller deviations for smaller values. As the function
values of f and g span over multiple orders of magnitude,
this influences the modeling process, as the model has to be
accurate on all scales.

Non-convexity and local minima Simulations often consist
of many different interacting components which influence
the simulation outcome and cannot be optimized individu-
ally. Interactions of components typically also cause both the
objective function and the feasible region to be non-convex.
In particular, the optimumcan be in the interior of the feasible
region and we might have to face several local minima. An
examplewith an optimum inside the feasible regionwould be
a simulationwhere a finer temporal discretization reduces the
run time as it decreases the number of iterations of iterative

solvers because the smaller time step width makes the prob-
lem in each time step easier to solve. Several local optima
might be caused by sweet spots where, e.g., errors cancel out
for specific combinations of parameters.

1.5 Summary of contributions

We propose an optimization approach specifically suited for
data from a simulation context. The approach infuses a pri-
ori knowledge in themodels and combines thesemodels with
Bayesian optimization. We make the following specific con-
tributions:

Bayesian optimizationWe suggest a Bayesian optimization
approach: In each optimization step, known data are used to
build a stochastic model of the optimization functions, which
is then used to select new evaluation points. This allows the
optimizer to work with scarce data and gives the flexibility
to adjust the internal model.

Knowledge infusion To build accurate models of the opti-
mization functions, we infuse available a priori knowledge
about the objective function f and the constraint g into the
modeling process. This allows the model to make extrapo-
lated predictions further away from known data points and
reduces the number of function evaluations necessary to find
a good approximation of the optimum.

Logarithmic scaling To deal with data ranging across multi-
ple orders of magnitude, we model the optimization function
on a logarithmic scale. This also allows modeling uncer-
tainties and deviations from the a priori infused asymptotic
behavior as additive terms on the logarithmic scale, which
corresponds to a multiplicative factor on the original scale.
Modeling on a logarithmic scale in addition ensures that func-
tion values in our model are always positive for quantities
such as simulation errors or run times.

Prior choice We show how a weakly informative prior can
be applied for the qualitative a priori knowledge on the log-
arithmic scale. This matches the modeling approach on the
logarithmic scale and significantly reduces spurious trends
in the posterior distribution.

2 Related work

The need to optimize simulation components is not new
[25]. However, often the parameters are optimized manu-
ally, heuristically or set to values taken from literature and
only adjustedwhen problems arise. Automatically optimized
parameters are typically present in the context of adaptiv-
ity. Adaptivity adjusts parameters during the runtime of a
simulation, frequently using local criteria. Examples include

123


154 Computational Mechanics (2024) 74:151–168

adaptive time step sizes [20]while ensuring stability based on
CFL conditions, adaptive stopping criteria for linear solvers
[31] or inexact non-linear Newton solvers [11] based on
residuals, or spatial grid adaptivity based on local error
estimators, using hierarchical surpluses [9] or a posteriori
estimates [4, 12].

Performance modeling techniques that focus on hardware
aspects, such as the roofline model [10, 43], complement
these mathematical approaches and help to select well-
performing algorithms for the used hardware.

Adaptive methods often restrict adaptivity to selected
parameters that can easily be changed during the simulation
[11, 20, 31], andhence neglect the interplay betweendifferent
simulation components. Usually, they do not allow to opti-
mize a global objective such as a global error norm under
global constraints such as limited overall runtime. Also, it
is sometimes easier to optimize parameters on subproblems
[8] instead of the full simulation setup. This means that
many parameters remain unchanged during the simulation
and reaching an overall optimum can not be expected. While
adaptive methods can change solver properties dynamically,
they usually come at a relatively high implementation cost
and are not always feasible for all parameters [4, 12].

In contrast to dynamical adaptivity, external parameter
optimization can improve existing simulations without mod-
ifying the underlying algorithms. This noninvasive approach
is particularly useful when the parameters remain constant
throughout the simulations. To solve a global parameter opti-
mization problem externally, many different methods can
be used: When evaluations of the objective and constraint
functions are relatively cheap, many evaluations of these
functions are feasible and direct grid-based or random search
methods can be used [5, 6]. In our setting, each function eval-
uation is expensive because each evaluation corresponds to
a simulation run. Therefore, the optimization methods have
to be more efficient in terms of knowledge gained from an
evaluation, but can spend more time in selecting suitable
data points. Model-based optimization methods address this
issue: They build an internal model of the unknown objective
and constraint functions that is used as an easy to evaluate
surrogate in the optimization process. Commonly used mod-
els include random forests [13, 23], geometric models [32]
and Gaussian processes [38]. Such models can then be used
in a Bayesian setting to simultaneously approach the opti-
mum and improve the models by selecting new data points
according to different selection strategies [14, 21, 24, 33, 39,
42]. However, in many of these settings little is known about
the functions to be optimized which are therefore commonly
treated as black-box functions with few prior assumptions
[44]. This is in contrast to our setting,where knowledge about
the simulation can be used:

The simulation behavior is usually not a black-box and
stronger a priori knowledge is available. For example, a pri-

ori knowledge in the form of global trends can be included
by parameterizing the mean function of a Gaussian process
with additional basis functions [34–36] which also yields a
Gaussian process posterior. However, this approach does not
ensure that the model is, e.g., positive and it is difficult to
model deviations relative to the global trend. Approaches as
in [1] and [2] make use of a known composite structure in
the objective function by only modeling the unknown com-
ponents as Gaussian processes and computing the known
parts explicitly. This approach does not directly transfer to
our setting as it does not model a global trend. Other, more
flexible properties that can be expressed as linear constraints
for a Gaussian process such as monotony or convexity are
discussed in [26, 27].

Outside the simulation parameter optimization context,
optimization problems with unknown constraints are often
solved bymodeling the objective and the constraint explicitly
inside a Bayesian optimization as in [16, 17], or combined
with classical constrained optimization algorithms as in [19].

3 Knowledge-infusedmodeling of objectives
and constraints

To deal with scarce and expensive data, we use a Bayesian
optimization approach for active learning. In this section,
we explain all components of this approach with a special
focus on the model ansatz for the stochastic approximation
of the objective and constraint functions f and g including
the choice of a suitable prior for the respective model coeffi-
cients.

3.1 Bayesian optimization overview

The Bayesian optimization approach approximates the opti-
mal parameter setting for our numerical methods in an iter-
ative process. In each iteration, it simultaneously improves
the models for the objective and the constraint while trying
to get closer to the optimum. Each iteration consists of the
following steps:

Based on N known data points {(xi , f (xi ), g(xi ))}Ni=1,
we build a model of the objective and constraint functions
using the knowledge-basedmodel described later in Sect. 3.2.
More specifically, we approximate the posterior distribution
for these models.

Then, an acquisition function is evaluated based on the
model and its predictions. The acquisition function is used to
approach the optimum by selecting a new promising eval-
uation point. For this, the acquisition function has to be
maximized, either by using an optimization method (such
as in [3]) or by evaluating all possible evaluation points if
the parameter space is discrete and small enough. The acqui-
sition function steers the optimization process by choosing

123


Computational Mechanics (2024) 74:151–168 155

points that are closer to the optimum or that reduce the model
uncertainty.

Last, the optimization functions f and g are evaluated at
the new evaluation point. This is the cost-dominating step
in our setting, as it requires at least one new simulation run.
In our setting, this step also includes additional computa-
tions such as estimating the simulation error or measuring
the run time. The simulation result yields a new data point
(xi+1, f (xi+1), g(xi+1)) for the next optimization step.

The optimization can be stopped after a fixed number of
iterations or when the acquisition function becomes small,
which indicates that little further improvement is expected.
For a more detailed introduction into Bayesian optimization,
we refer to [15, 17, 37].

Note that,whilemanydifferent acquisition functions exist,
they all depend on a good model for the optimization func-
tions in order to give good results. Therefore, we consider
the model ansatz as a natural starting point to add a priori
knowledge into the optimization in order to improve the per-
formance and to use sparse and expensive data efficiently.

In the following, we discuss how models for numerical
methods can be build that exploit the available knowledge.

3.2 Model ansatz

Our model for both the objective and the constraint consists
of two parts: A global model μ that is based on the a priori
knowledge and a local model s for deviations from the global
model. We combine both model components to approximate
the optimization functions on a logarithmic scale. We model
the objective f : � → R in a stochastic settingwith a random
variable as

ln f̂ (x) = μ(x) + s(x). (1)

We now give an overview on each sub-model for f . The
modeling of the constraint g is done analogously.

As global model μ(x) we use the logarithm of a linear
combination of q basis functions ϕ j representing the a pri-
ori knowledge, e.g., on asymptotic convergence rates of a
numerical method,

μ(x) = ln
( q∑
j=1

β jϕ j (x)
)

with coefficients β = (β1, . . . , βq) ∈ B ⊆ R
q which are

restricted to a set B such that the linear combination is posi-
tive. The global model is meant to give a good representation
of global trends in the optimization functions on a logarithmic
scale, which enables the model to make predictions further
away from known data points. Hence, promising parameter
regions can be identified for further investigation in the opti-
mization process and less promising regions can be ruled

out based on only few data points. As an example, the error
of a second order time stepping scheme with step width h
as parameter could use the basis function h2 in the global
model with x = h to represent the asymptotic convergence
order for h → 0. Similarly, the basis function h−1 might
represent the expected run time scaling in the global model
for the run time constraint in this example. The coefficients
β = (β1, . . . , βq) ∈ R

q are random variables inferred in our
Bayesian approach as discussed in Sect. 3.3.

The local model s : � → R accounts for deviations from
the global model and is modeled as a zero-mean Gaussian
process with covariance kernel function k : � × � → R:

s ∼ GP(0, k)

This allows corrections to μ due to non-modeled higher
order terms, pre-asymptotic behavior or local effects such
as error cancellation of different simulation components. In
the current model, we do not include measurement noise
such that our measured data are exact function evaluations
of f and g. However, measurement noise can be included by
an extra additive term in the model, representing pointwise
deviations. In our experiments we consider an optimization
problem, where we model pointwise deviations.

3.2.1 Logarithmic scaling

Modeling on a logarithmic scale has several advantages:
First, it allows the model to represent values accurately over
the whole parameter range. This is essential when model-
ing simulation behavior, which often covers a large range of
scales as seen in our example fromSect. 4, wherewe consider
the numerical error and run time depending on discretiza-
tion resolutions. Second, the additive correction term from
the local model accounts for corrections and noise relative
to the global model. Applying the corrections on the loga-
rithmic scale implies, that deviations from the local model
scale proportionally to themagnitudeof themodeled function
f . Therefore, further adjustments to the Gaussian process
to account for the local order of magnitude of data values
are often not necessary and using a stationary kernel for the
Gaussian process is sufficient to capture scale-dependent cor-
rections and noise. Third, the model output, and thus the
predictive distribution, on the non-logarithmic original scale
is always positive. Therefore, the model is well-suited for
simulation quantities such as simulation errors or run times
that are known to be positive.

Intuitively, these advantages are also the reason why con-
vergence results or run time scaling plots are usually shown
and discussed on a logarithmic scale: It is often much easier
for analysts to derive conclusions on the polynomial degree
of trends from such a representation.

123


156 Computational Mechanics (2024) 74:151–168

3.2.2 Global model

In the global model, the parameters β j of the linear combi-
nation are unknown and have to be learned by the Bayesian
model. Because the modeled quantity is known to be posi-
tive, the coefficients β j of the global model are restricted to
a set

B ⊆
{
β ∈ R

q | ∀x∈�

q∑
j=1

β jϕ j (x) > 0

}
.

This ensures that the global model is positive for all eval-
uation points and, thus, the logarithm in the model is
well-defined.Knowing that the coefficients represent individ-
ual error components or run time scaling, choosing β j > 0
for j = 1, . . . , q is often sufficient.

Using a linear combination of basis functions in the global
model is sufficient for our experiments in Sect. 5. Generally,
there is much flexibility when choosing a suitable global
model and other parameterized functions can be used to
describe the global trends.

3.2.3 Local model

The local model consists of a zero-mean Gaussian process
with kernel function k. As kernel we suggest to use stan-
dard radial basis functions, such as a Gaussian or Matérn
kernels [34], to model corrections to be locally correlated in
the parameter space. This means that the correlation of two
points depends on their distance relative to the length scale of
the kernel. For parameters such as discretization widths, that
are typically varied on a logarithmic scale, e.g., as powers
of 1

2 , we suggest using a kernel that operates on this scale
as well. For this purpose, we transform the kernel k to a
kernel on the logarithmic scale k̃ by transforming its inputs:
k̃(x, x ′):=k(ln x, ln x ′).

3.3 Posterior and predictive distribution

A drawback of modelling on the logarithmic scale is that
the model becomes nonlinear in the unknown coefficients
β j and, therefore, the model prediction and its stochastic
quantities such as mean value and standard deviation cannot
be computed analytically. In the following,we shortly outline
the calculation of the posterior distribution and the model
prediction.

Our Bayesian approach computes the posterior distribu-
tion p(β | x, y) of the model coefficients β = (β j )

q
j=1 based

on the evaluated data y = (ln f (xi ))Ni=1 at x = (xi )Ni=1 and
the Bayesian theorem:

p(β | y, x) = p(y | β, x) p(β)

p(y | x)

The posterior is computed using the prior p(β) for the coef-
ficients and the likelihood p(y | β, x) of observing the data
given the coefficients β. The term p(y | x) does not depend
on the coefficients β and is therefore a constant scaling fac-
tor. In Sect. 3.4, we go more into detail on how the prior
distribution for β is chosen.

The likelihood p(y | β, x) of the data is implied by the
local model. Given fixed values for β, the measured values
are evaluations of a realization of a Gaussian process with
mean μ and kernel k

ln f̂ ∼ GP(μ, k) (2)

because the observations are the sum of the global and the
local model as given in (1), i.e., ln f̂ (x) = μ(x) + s(x).
For a finite number of observations y, the function values are
distributed according to a normal distribution

y ∼ N(m, �)

with mean m = (μ(xi ))Ni=1 and the covariance matrix � =
(k(xi , x j ))Ni, j=1. In other words, the likelihood p(y | β, x)
is the density of the normal distributionN(m, �), where the
global model serves as a parameterized mean function and
the local model defines the covariance.

To predict the simulation behavior at unknown evaluation
points, the predictive distribution

p(ln f̂ (x) | y, x)
=

∫

R
q
p(ln f̂ (x) | β, y, x) p(β | y, x) dβ (3)

is computed using the posterior distribution p(β | y, x) of
the coefficients and the likelihood p(ln f̂ (x) | β, y, x) as
given in (2).

The global model allows the model to extrapolate reli-
ably, while the local model ensures that locally correlated
deviations from the trend are representable in the model.
The predictive distribution also allows quantifying themodel
prediction and uncertainties for the predicted simulation
behavior by computing expected values, standard deviations
or quantiles of the predicted values.

Due to the nonlinearity of our model caused by the log-
arithmic scaling, the predictive distribution (3) cannot be
computed analytically. Therefore, we approximate the pos-
terior distribution of the coefficients β j with Markov chain
Monte Carlo sampling based on the prior distribution p(β)

and the likelihood p(y | β, x). In particular, we use Hamil-
tonian Monte Carlo sampling with Stan [40]. The resulting
samples are then used to approximate the posterior predictive

123


Computational Mechanics (2024) 74:151–168 157

distribution (3) as

p(ln f̂ (x) | y, x) ≈ 1

|S|
∑
β∈S

p(ln f̂ (x) | β, y, x),

where S denotes the set of posterior samples, and |S| the
number of samples. Importantly, sampling from the posterior
and the predictive distribution is comparably cheap as we do
not run new simulations.

For a more detailed introduction to Bayesian modeling,
we refer to [18].

3.4 Prior for the global model coefficients

By introducing a parameterized global model μ, we have
to specify a prior for the coefficients β j in our Bayesian
setting. When the order of magnitude of these coefficients is
unknown, we want to apply a weakly informative prior that
admits a large range of values.

However, in our case applying a weakly informative prior
directly to the coefficients β j leads to unintended results:
The logarithm in the global model implies a prior density for
the global model that is skewed to larger function values on
a logarithmic scale. Therefore, the prior has a spurious bias
towards large values for μ(x), which hinders the extrapola-
tion properties of the global model. We illustrate this effect
experimentally in Sect. 5.1.2.

Therefore, it is important onwhich scale wewant the prior
to be weakly informative. As we model the unknown func-
tions on a logarithmic scale, we suggest applying a weakly
informative prior also on the logarithmic scale, i.e., to the
logarithm of the coefficients ln β j .

Applying the prior to β j on the logarithmic scale leads
to the desired property that the predictive distribution
p(ln f̂ (x) | y, x), as given in (3), becomes nearly invari-
ant to scaling the data f = ( f (xi ))Ni=1 and the coefficients β j

by a factor b > 0 simultaneously.
For this,wefirst show that the likelihood p(y | β) is invari-

ant to this scaling.With given coefficientsβ, the likelihood of
the observed data y is a normal distribution with the global
model as mean and the covariance given by the Gaussian
process in the local model as discussed in Sect. 3.3. Using
y = ln f and the definition of the global model μ for given
coefficients β, we can write the likelihood for the scaled data
and coefficients as

p(ln(bf) | bβ, x) = N(ln(bf) | μ(x | bβ),�)

= N(ln f + ln b | μ(x | β) + ln b, �)

= N(ln f | μ(x | β),�)

= p(ln f | β, x)

where N(y | m, �) denotes the probability density of the
normal distribution N(m, �) evaluated at y, and μ(x | β)

is the global model for given coefficients β evaluated at the
data points x.

Having shown that the likelihood is scaling invariant, the
remaining term affecting the shape of the predictive distri-
bution is the prior p(β). Therefore, we want to use a prior,
that results in an approximately scaling invariant predictive
distribution. This can be achieved by a weakly informative
prior on the logarithmic coefficients. To show this, we write
the predictive distribution in terms of β̃:= ln β and perform a
variable substitution to β̃b:= ln(bβ) = ln β + ln b to rescale
the coefficients:

p(ln( f̂ (x)) | ln f, x)
=

∫

R
q
p(ln( f̂ (x)) | eβ̃ , ln f, x)

· p(ln f | eβ̃ , x) p(β̃)

p(ln f | x) dβ̃

=
∫

R
q
p(ln( f̂ (x)) |b−1eβ̃b ,ln f, x)

· p(ln f |b−1eβ̃b , x)p(β̃b−ln b)

p(ln f | x) dβ̃b.

This substitution usually leads to a Jacobi determinant, that
alters the prior distribution for the new variables. However,
because the scaling invariance corresponds to a translation
on the logarithmic scale, the Jacobi determinant becomes 1
for the logarithmic parameterization and does not perturb the
prior. Using the scaling invariance of the likelihood, the terms
can be rewritten as

=
∫

R
q
p(ln( f̂ (x)) | eβ̃b , ln f, x)

· p(ln(bf) | eβ̃b , x) p(β̃b − ln b)

p(ln f | x) dβ̃b.

If the prior density is rather flat over a large range of the
parameter space, the prior density p(β̃ − ln b) is similar to
p(β̃). Applying a similar argument to themarginal likelihood
that serves as a normalization factor, shows the approximate
scaling invariance:

≈
∫

R
q
p(ln(b f̂ (x)) | eβ̃b , ln f, x)

· p(ln(bf) | eβ̃b , x) p(β̃b)

p(ln(bf) | x) dβ̃b

= p(ln(b f̂ (x)) | ln(bf), x).

Therefore, we use a weakly informative prior on the param-
eters ln β j on the logarithmic scale to achieve an approxi-

123


158 Computational Mechanics (2024) 74:151–168

mately scaling invariant prediction. An example for such a
prior is a normal distribution with a large variance. A similar
argument is made in [7] to get an approximately invariant
posterior distribution from a data-translated likelihood by
choosing a suitable scale for the prior.

3.5 Acquisition function

New evaluation points for the optimization are chosen
by maximizing the acquisition function. The acquisition
function uses the predictive distribution to compute how
promising possible evaluation points are for the optimiza-
tion. As the focus of this work lies on the general modeling
aspects for the optimization, we do not discuss the broad
spectrum of acquisition functions that can be used. In our
experiments, we use a constrained version of the expected
improvement acquisition function [16, 17]

aEI(x) = E
(
max(ln f ∗

N − ln f̂ (x), 0) | x, y)

· P(ĝ(x) ≤ gmax | x, y) (4)

where

f ∗
N = min

i=1,...,N
g(xi )≤gmax

f (xi )

is the smallest feasible value evaluated so far. The acquisi-
tion function consists of two terms that are computed over
the predictive distributions (3) of f̂ (x) and ĝ(x), accounting
for the minimization goal of the objective and the con-
straint in the optimization problem. The first term is the
unconstrained expected improvement function, that favors
evaluation points, that improve the objective function over
f ∗
N , ignoring the feasibility of such a point. To make the

acquisition function aware of the constraint, the second term
P(ĝ(x) ≤ gmax | x, y) is the probability that an evaluation
point is feasible, to ensure that the optimizer stays close to
the feasible region. Evaluations which violate the constraint
ĝ(x) ≤ gmax with a high probability are penalized with a
low probability. Analogously to our modeling approach, we
apply the acquisition function on the logarithmic scale, to
measure the improvement over the current minimal value
f ∗
N on a logarithmic scale.

4 Test application

To evaluate our modeling approach on an exemplary test
application, we use the multi-physics multi-scale biome-
chanics simulation framework OpenDiHu1 [29, 30]. Typical

1 https://github.com/maierbn/opendihu.

simulations ofmuscle activation startingwith the signal com-
ing from the spinal cord and leading to the actual contraction
or release of a skeletal muscle comprise the following steps:
On the subcellular level, the signal triggers a range of bio-
electrical-chemical reactions, resulting in the generation of a
transmembrane voltage. On the meso-scale, this transmem-
brane voltage leads to a traveling activation signal on the
individual muscle fibers. A muscle consists of many thou-
sands of muscle fibers and the activation signals control the
contraction of each muscle fiber. On the macro-scale, i.e.,
the scale of the whole muscle, this contraction is modeled
by classical 3D mechanics. Obviously, the three compo-
nents each have their own temporal and spatial scales. The
components are coupled through homogenization and are
implemented in separate software subcomponents. Since a
single muscle can comprise up to several hundred thousand
individual fibers [28, 30], the whole simulation is only fea-
sible in a massively parallel environment.

For simplicity reasons, we limit ourselves to a setup
with only two scales in this paper: We simulate the gen-
eration of an electrical activation signal and its traveling
along a single muscle fiber. While this is only a subset of
the full simulation and essentially removes parallelism from
the picture, it already demonstrates many typical challenges
that arise in simulations when multiple numerical compo-
nents are combined: The simulation consists of two coupled
sub-models, each having an individual time discretization.
Because numerical errors in the models influence each other,
an optimal discretization for eachmodel can not be computed
a priori, and classical adaptivity on each scale cannot exploit
its full potential.

4.1 Model equations and discretization

In detail, the simulation of a muscle fiber consists of two
coupled models that build a reaction–diffusion system:

∂t u = σ

AC
∂2x u − 1

C
Iion(u, z) (5)

∂tz = Z(u, z) (6)

The bidomain Eq. (5) models the homogenized transmem-
brane voltage u along a one-dimensional muscle fiber. The
subcellular model (6) describes how cells along the muscle
fiber react to the voltage, which results in a transmembrane
current Iion. The surface-to-volume ratio A, the electrical
capacitance C of the membrane and the conductivity of the
fiber σ are assumed to be constant for each fiber and are
listed in Table 1. The subcellular model is described by the
Hodgkin–Huxley model [22] and has additional state vari-
ables z which describe, for example, ion concentrations in
the cells. The system (5) and (6) is discretized in space with
a one-dimensional equidistant grid and linear Lagrange finite

123

https://github.com/maierbn/opendihu


Computational Mechanics (2024) 74:151–168 159

Table 1 Physical parameters for the bidomain equation (5) in the
OpenDiHu simulation. The length of the simulated fiber is lfiber

σ A C lfiber[
mS cm−1

] [
cm−1

] [
μF cm−2

]
[cm]

3.828 500 0.58 11.9

elements for the diffusion equation. The reaction equations
(6) are solved at each grid point.

In the time dimension, the system of equations is decou-
pled by a second order Strang operator splitting [41], which
allows to use an individual time stepping scheme for each
subproblem. Each splitting step of width hs consists of three
steps that are performed on each interval ]tn, tn + hs]:

1. First, we solve the reaction equations

∂t u = − 1

C
Iion(u, z)

∂tz = Z(u, z)
(7)

numerically on the interval ]tn, tn + hs
2 ]with z and u from

time tn as initial values.
2. Second, we solve the diffusion equation

∂t u = σ

AC
∂2x u (8)

numerically on the interval ]tn, tn + hs] using the values
of u at time tn + hs

2 from the previous step. The state z of
the subcellular model remains unchanged.

3. Third, we solve the reaction equations (7) again, this time
on the interval ]tn + hs

2 , tn + hs] with z and u from time
tn+1 = tn + hs from the previous step as initial values.

Because the linear diffusion equation (8) is a stiff problem, it
is solved with an implicit Euler method or a Crank–Nicolson
method with step width h1. The nonlinear reaction term (7)
is a system of nonlinear ordinary differential equations and
solved with Heun’s method with step width h0

2 in phase 1
and 3 of each splitting step. The time step width hs of the
Strang splitting is set to the maximum of h0 and h1 to avoid
the splitting error to become larger than necessary.

4.2 Optimization problem

The optimization goal is to find time step widths h0 and h1
for the subproblems, such that we minimize the simulation
error within an upper limit for the simulation run time. This
means, we solve

argmin
(h0,h1)∈�

f (h0, h1) s.t. g(h0, h1) ≤ gmax (9)

for a given run time limit gmax. We choose

� = {2−i × 4 × 10−3 ms | i = 0, . . . , 14}2

as the discrete admissible set for the time step widths h0 and
h1 to ensure that the time step width of each solver divides
the splitting time step width hs = max(h0, h1).

In space, we discretize the muscle fiber with 1190 lin-
ear Lagrange elements. To ensure that only the discretization
errors dominate the final result, we solve the linear system
that arises from the diffusion equation (8) with a conju-
gate gradient solver with a tight relative residual tolerance
of 10−14.

For each evaluation, we simulate 4 × 10−2 ms of physical
time starting from an initial activation wave profile. As our
example application is part of a larger simulation example in
OpenDiHu that is coupled to the transmembrane voltage, we
use the L2-error of theu variable at thefinal time4 × 10−2 ms
for the objective function. The error is estimated by comput-
ing the L2-difference against a very fine reference simulation
with a time discretization of h0 = h1 = 2−17×4 × 10−3 ms.

4.3 Error and run timemodeling

To build individual models for the error f and the run time
g in the optimization problem (9), we use the modeling
approach from Sect. 3:

ln f̂ (h0, h1) = μ f (h0, h1) + s f (h0, h1)

ln ĝ(h0, h1) = μg(h0, h1) + sg(h0, h1)

It combines global model components μ f and μg with local
deviation models s f and sg on a logarithmic scale.

To build the global models μ f and μg , we use a pri-
ori knowledge directly stemming from a priori asymptotic
convergence results and known asymptotic run time scaling.
The convergence order of the simulation error depends on
the specific solver combination. The Strang splitting, Heun’s
method and the Crank–Nicolson method are second order
accurate, whereas the implicit Euler method is first order
accurate. This leads to an ansatz of multivariate polynomials
in the two step widths. The model

μIE
f (h0, h1) = ln

(
cc + c0h

2
0 + c1h1 + cs max(h0, h1)

2)

constitutes the global model for the error when using the
implicit Euler scheme for the diffusion problem (8) and

μCN
f (h0, h1) = ln

(
cc + c0h

2
0 + c1h

2
1 + cs max(h0, h1)

2)

is applied if we use the second order accurate Crank–
Nicolson scheme instead. In both cases, the term cc represents
the fact that the simulation error will not become arbitrarily

123


160 Computational Mechanics (2024) 74:151–168

small by reducing the time step width alone, because other
error sources, e.g., induced by floating point arithmetic and
inaccuracies in the linear solver, also affect the outcome. cs is
the splitting error. For each time stepping scheme, a smaller
time step width results in more time steps to be computed
while the cost of the splitting itself is negligible. Therefore,
we use

μg(h0, h1) = ln
(
dc + d0h

−1
0 + d1h

−1
1

)

for the globalmodel of the run timeconstraint in both settings.
The coefficient dc represents constant run time parts such
as set up and initialization times. For all global models, we
restrict the coefficients to be positive.

The local model is based on a Gaussian process with the
kernel

k(x, x ′) = α2 exp

(
−

∥∥x − x ′∥∥2
2	2

)
+ σ 2δx,x ′ ,

which consists of two terms. The first term models local cor-
relation using a Gaussian kernel with variance α2 and length
scale 	. With the second term with the Kronecker delta δ,
we add additional pointwise noise with standard deviation
σ to account for pointwise, uncorrelated deviations from
the global model that cannot be explained by the Gaussian
kernel. Such pointwise deviations are caused by numerical
inaccuracies and fluctuations caused by the many layers of
complex computations in the simulation that affect the simu-
lation error and run time andmake the optimization functions
non-smooth. Not taking these effects into account can cause
instabilities in the modeling process.

4.4 Simplified one-dimensional analytical test case

To validate our modelling approach in the next section,
we construct a simpler one-dimensional example function,
where the ground truth is known, but with structural similar-
ity to the OpenDiHu example. More specifically, we assume
that the simple one-dimensional function has a global trend
consisting of a constant and a quadratic term. The constant
corresponds to underlying offsets, the quadratic term to the
convergence order in the real example.A local deviation from
this trend is represented by a squared exponential function
such that the objective function reads:

f (h) = 0.1 + 1.5h2(1 − 0.8e−(2.2−h)2) (10)

The coefficients are chosen to be on different scales so that
the effect of the prior is better visible. The constants in the
squared exponential functions are chosen such that the local
deviation is clearly visible in the experimental results.

The a priori knowledge for the model consists of the basis
functions ϕ1(h) = 1 and ϕ2(h) = h2 and, thus, the global
model reads

μ f (h) = ln
(
c1 + c2 · h2)

with c1, c2 > 0. The local model uses the same Gaussian
kernel as the test application in Sect. 4.3.

5 Experiments

To evaluate our modeling approach, we split our experiments
in two parts. We start with validating the model in Sect. 5.1
by showing how the log scale, the global model and the
prior affect the model predictions. In Sect. 5.2, we then show
how the model can be used inside a Bayesian optimization
approach.

5.1 Model validation

To examine the effect of the model and the prior, we employ
two examples: First, the simple one-dimensional example
function from Sect. 4.4, and second, the OpenDiHu example
we presented in Sect. 4.

In the following, we use the simple one-dimensional
example function to show the usefulness of the a priori
knowledge in the global model and the logarithmic scale in
Sect. 5.1.1. In Sect. 5.1.2 we show the effect of the prior on
the model quality, followed by an evaluation of the overall
model quality for OpenDiHu in Sect. 5.1.3.

In both setups, the structure of the deviations from the
global trend is assumed to be known for simplicity reasons.
The variables α, 	, and σ in the local model are selected a
priori and are not subject to the Bayesian inference.

5.1.1 Influence of the global model and the logarithmic
scale

In this first experiment, we show the effect of including a
global model representing our a priori knowledge on the
global trend and of modeling on a logarithmic scale. In par-
ticular, we show that our model is able to represent the global
trend, that the predicted values are positive, and that the mag-
nitude of the model uncertainties is relative to the magnitude
of the global trend. For our experiments, we use the simple
one-dimensional function defined in (10) and five given eval-
uation points for the parameters h chosen equidistantly on the
logarithmic scale between ln(0.25) and ln(4). The respective
known data expose the quadratic trend clearly, while the con-
stant term is more difficult to deduce from the evaluated data.

123


Computational Mechanics (2024) 74:151–168 161

Table 2 Priors and parameters
used in the model validation
experiments with the
one-dimensional problem (10)
in Sect. 5.1.1 and Sect. 5.1.2

Model Global model prior on c1 and c2 Local model HMC samples
	 α2 σ 2

Gaussian 0.25 0.1 0 Analytic

Linear c1, c2 ∼ Flat 0.25 0.1 0 Analytic

Logarithmic ln c1, ln c2 ∼ Student-t(4, 0, 7) 0.25 0.5 0 8000

Logarithmic c1, c2 ∼ Student-t(4, 0, 30) 0.25 0.5–8 0 8000

Logarithmic ln c1, ln c2 ∼ Student-t(4, 0, 7) 0.25 0.5–8 0 8000

We compare the model results of three different modeling
approaches using the parameters as listed in Table 2:

Gaussian process The Gaussian process model acts on the
original, linear scale and completely omits a global model
component:

f̂ (h) = s(h)

Themodel uses the radial basis function fromSect. 4.3 as ker-
nel. The predictive distribution can be computed analytically
for this model.

Linear The linear model extends the Gaussian process model
by using additional basis functions to expand the ansatz space
of the Gaussian process regression as in [34], i.e., it is also
on a linear scale, but includes a global model:

f̂ (h) = c1 + c2 · h2 + s(h)

Also for this model, the predictive distribution can be com-
puted analytically.

Logarithmic The logarithmic model represents the complete
model on the logarithmic scale as introduced above in Sect.
4.4. Due to the nonlinearity induced by the logarithmic
scaling, the prediction of this model can not be computed
analytically and requires sampling.

To evaluate the quality of the model predictions, we com-
pare the model predictions as the predicted mean and the
model uncertainties as the 50% confidence interval between
the 0.25 and 0.75 quantiles of the predictive distribution for
all three models in Fig. 1.

Themodel consisting only of a Gaussian process is unable
to represent the global trend because themean quickly decays
to zero further away from the evaluated data points. Further,
the predicted values are allowed to be negative which con-
tradicts the knowledge that errors and runtimes have positive
values. This also causes the lower quantiles to quickly decay
to zero and even reach into negative values.

Adding the prior knowledge linearly as additional basis
functions allows the linear model to capture the global trend
reasonably well. However, the confidence interval becomes
relatively large for small predicted values, because the model
is additive on a linear scale, which implies that the model

fits larger values more accurately in terms of relative error
than it does fit smaller values. Further, the posterior distri-
butions still permit negative values, which are not possible
for quantities such as errors and run times that are known to
be always positive. Alternative approaches that operate on a
linear scale, for example scaling the kernel functions relative
to the global model, at least also suffer from the latter draw-
back and increase the complexity as they require determining
a suitable kernel scaling.

Both of these problems are resolved when the model is
built on a logarithmic scale: The posterior distribution only
permits positive values and the posterior uncertainty scales
nicely relative to themean of the prediction. In contrast to the
previous models, the model uncertainties for the logarithmic
model cannot reach negative values and stay positive and
follow the trend. Note that the combination of a global and
a local model is also able to approximate the local minimum
as the local model approximates the local deviation from the
global trend.

5.1.2 Effect of the prior on the model quality

In our second experiment, we assess how the scale on which
we apply aweakly informative prior to the coefficients affects
the extrapolation quality of our model. We use the same
setup with the one-dimensional function (10) and compare
the model predictions for models with two different priors
on c1 and c2 for our logarithmic model from Sect. 4.4: First,
we use a wide Student-t distribution as weakly informative
prior directly for the coefficients:

c1, c2 ∼ Student-t(4, 0, 30)

The arguments define the degrees of freedoms, the mean and
the standard deviation of the distribution. Second, we use a
weakly uninformative prior for the logarithms of the model
coefficients:

ln c1, ln c2 ∼ Student-t(4, 0, 7)

Due to the different scales, the distribution on the logarithmic
scale uses a smaller variance.

123


162 Computational Mechanics (2024) 74:151–168

Fig. 1 Model validation: Verification of the usefulness of the global
model and the logarithmic scale: Model prediction means E( f̂ | x, y)
(blue lines) and model uncertainties as the 50% confidence interval of
the predictive distribution (blue shaded areas) for the 1d example (10)
using a Gaussian process model, a linear model, and our logarithmic

model. The unknown function f is shown in gray with the 5 evalu-
ation points marked by black crosses. The mean of the global model
E(μ f | x, y) is shown as dotted line. The Gaussian process model does
not have a global model. (Color figure online)

In addition to the two prior choices, we vary the variance
α2 of the Gaussian process in the local model, which corre-
sponds to assuming different degrees of deviations from the
global model. Table 2 summarizes the used model parame-
ters. For the resulting combinations of priors and local model
variances, we compare the extrapolation quality of the result-
ing global model predictions.

Figure2 shows how the prior affects the model prediction.
For small variances in the local model the scale on which

the weakly informative prior on the global model coefficients
is applied has only a small effect on themodel and the overall
model is constrained by the local model. However, when we
assumemore deviations from the globalmodel, i.e., for larger
variances, the influence of the data points decreases and the
effect of the prior becomes more pronounced:

When using aweakly informative prior for the coefficients
c1 and c2, i.e. on the linear scale, the mean of the global
model moves towards too large values for larger local model
variances. Hence, the global model does not represent the
trend of the data and the model is unable to predict mean-
ingful values outside the range of known data points. This is
caused by the nonlinearity in the global model as discussed
above in Sect. 3.4: The weakly informative prior on c1 and
c2 becomes more informative on the logarithmic scale as it
implies a preference for larger values. This effect is visu-
alized in Fig. 3, where the probability density of a random
variable is transformed to the logarithmic scale.

Using a weakly informative prior for ln c1 and ln c2 avoids
this problem, as it is directly applied on the correct scale. As
a result, the global model stays closer to the data points also
for larger prior variances in the Gaussian process. This also

leads to more meaningful predictions further away from the
data.

5.1.3 Model quality for OpenDiHu

We now show that our modeling approach is also able to rep-
resent data from the two-dimensional OpenDiHu example
from Sect. 4.2. For this we compute a maximum likelihood
estimation (MLE) for the model parameters using data for
all time step combinations in � and a simulated time of
4 × 10−2 ms. The MLE optimizes the coefficients β j for the
basis functions, and the prior variance α2, the length scale
	 and the additional noise σ of the kernel. The parameters
are optimized with a restarted L-BFGS method using the
optimize capability of Stan starting from randomized ini-
tial values. The simulation error is approximated as the L2

difference to a finer discretization as described in Sect. 4.2.
To measure the run time, we run the simulation 4 times and
take the minimum value. This is done to reduce the influence
of fluctuations caused by caching effects because the simu-
lation loads muscle geometry data from the hard drive and
compiles the subcellular model at runtime. While simulating
4 × 10−2 ms reduces the computational cost, the time inter-
val is not representative of a typical simulation, that simulates
multiple milliseconds to seconds. As the simulation behav-
ior does not change significantly over the simulated time, we
scale the run time to a more realistic simulated time of 4ms
to reduce the influence of the constant overhead from starting
the simulation and loading required input data. This is done
by measuring the run time for the entire simulation and the
time stepping schemes separately. The time spent in the time
stepping schemes is then scaled by a factor of 100.

123


Computational Mechanics (2024) 74:151–168 163

Fig. 2 Assessment of the effect of the prior on themodel quality:Model
prediction means E( f̂ | x, y) (blue lines) and model uncertainties as
the 50% confidence interval of the predictive distribution (blue shaded
areas) for the 1d example (10) for different prior variances α2 of the
Gaussian process using a weakly informative prior for the coefficients

on either the linear scale (c1, c2) or logarithmic scale (ln c1, ln c2). The
unknown function f is shown in gray with the 5 evaluation points xi
marked by black crosses. The mean of the global model E(μ f | x, y)
is shown as dotted line. (Color figure online)

Fig. 3 Visualization of a prior distribution transformed to the logarith-
mic scale. Left: Probability density function pB for a random variable B
that is distributed following a truncated normal distributionwithmean 2,

standard deviation 2 and a lower bound of 0. Right: Transformed prob-
ability density pB′ for B ′ = ln B. The transformed density is clearly
skewed towards larger values

123


164 Computational Mechanics (2024) 74:151–168

The estimated parameters are then used to visualize the
individual model components. In particular, it is of interest,
how well the global model captures the trends in the data,
and how the local model makes adjustments to the model to
match the data points.

Figure 4 shows the truedata (first column) from theCrank–
Nicolson and implicit Euler setup together with the model
components resulting from the MLE parameters.

The results show, that the global model (second column)
is able to capture the optimization function well and cap-
tures most features of the global behavior of the data. The
locally correlated component of the local model (third col-
umn) makes smaller local adjustments to the model. The
remaining pointwise noise (last column), caused by the σ

term in the kernel, is relatively small.
The data for the Crank–Nicolson setup also show that the

error decreases when both time step widths are of the same
size, which is visible along the diagonal h0 = h1. This effect
is caused by error cancellation between the splitting error and
the integration errors of the subproblems. The error cancel-
lation cannot be represented by the global model with only
positive coefficients c0 and c1 and has therefore be repre-
sented in the local model. The largest adjustments from the
local model are therefore concentrated along the diagonal
where cancellation effects have stronger influence.

The error cancellation could be included in the global
model by either allowing negative values for c0 and c1, or
by adding additional basis functions, that model cancellation
effects in a more general way. As such strong error cancel-
lation is a rather special property of our example problem
and often not known a priori, we continue with the model
with positive coefficients in the optimization experiments. In
the next experiment we see that this model is suitable in an
optimization loop.

5.2 Optimization results

Having seen that the model is able to represent the data,
we now show how the model works in conjunction with a
Bayesian optimization where data are scarce and expensive
to obtain. For this we perform an optimization of the exam-
ple application from Sect. 4 to minimize the simulation error
under runtime constraints.

For the sake of simplicity, we assume that the general
structure of the local model and the noise is known, by using
the values of the MLE results from Sect. 5.1.3, rounded to
two significant digits, as fixed values for the variance, length
scale and noise of the kernel in the local model. Table 3 lists
the used values and the priors used for the coefficients in the
global model. We use the expected improvement acquisition
function from Sect. 3.5 and perform up to 15 optimization
steps. The optimization ends earlier if the acquisition func-
tion becomes smaller than 10−5.

In a first experiment we use data from a setup with Heun’s
method for the reaction term and Crank–Nicolson as implicit
solver for the diffusion term. To reduce the computational
cost, the data for the simulation error and run time are
measured for a simulated time of 4 × 10−2 ms and scaled
to a more realistic simulated time of 4ms as described in
Sect. 5.1.3.

Figure5a shows optimization steps starting with four ini-
tial data points on coarse discretizations. We can see that the
global trend is captured well from only few data points, even
though the global model is unable to predict the details of the
error cancellation along the diagonal. Nonetheless, the opti-
mizer directly explores points close to the global optimum
and finds the minimum within 4 steps. In this example, the
optimizer is able to find the minimum so quickly, because
the optimization functions can be approximated well by the
global model alone and thus the deviations from the trend
are small. This allows accurate extrapolation using only the
global model.

In the next experiment we therefore use data, in which
the global model alone is not able to represent the optimiza-
tion functions. For this, we use data from the setup with the
implicit Euler method for the diffusion term. Again, we scale
the run time of the time stepping schemes to a simulated time
of 4ms. To increase the noise level, we decompose the full
data into the global, local and noise components by comput-
ing a MLE. The local models of the error and run time are
then scaled by a factor of 10 and the components are recom-
bined to create new optimization functions which are then
used in the optimization. Figure5b shows, that the unknown
functions are more complex and cannot be described by the
global model alone. Therefore, it takes the optimizer more
steps to approximate the optimization functions accurately.
Even though the models are able to extrapolate further away
from the data, it can be clearly seen, that it takesmultiple steps
until the model for the constraint function describes the con-
straint boundary well. The details in the constraint function
can only be described by the local model once enough data
close to the constrained boundary is available. This shows,
that our model can also model more complex functions by
combining the global and local model. We also see, that
the optimizer evaluates more points further away from the
constrained boundary. This is caused by the larger kernel
magnitude α, meaning the model assumes more uncertainty
in the local model, which leads to a wider predictive distri-
bution.

6 Discussion

We have shown that Bayesian optimization in combination
with a knowledge-based model ansatz for simulation error
and run time can successfully be used to find optimal numer-

123


Computational Mechanics (2024) 74:151–168 165

Fig. 4 Model validation on experimental data: Results of the maximum
likelihood estimation of the model for the OpenDiHu example (9) for
the setup with the Crank–Nicolson (CN) and implicit Euler (IE) method
as solver for the diffusion term. The data for f and g (first column) is

well-represented by the respective global model exp(μ f ) and exp(μg)

(second column) while the local model exp(s f ) and exp(sg) with a cor-
related component (third column) and pointwise noise (last column) is
able to represent local deviations

123


166 Computational Mechanics (2024) 74:151–168

Table 3 Priors and model
parameters used in the
optimization experiments in
Sect. 5.2 for the setup with the
Crank–Nicolson (CN) and
implicit Euler (IE) method as
solver for the diffusion term. All
models use 24.000 Monte Carlo
samples

logarithmic global model local model
model prior for coefficients 	 α2 σ 2

CN error ln c j ∼ Student-t(4, 0, 7) 1.2 3.6 × 10−2 7.7 × 10−3

CN run time ln d j ∼ Student-t(4, 0, 7) 1.1 1.7 × 10−1 2.9 × 10−4

IE error ln c j ∼ Student-t(4, 0, 7) 1.2 2.6 × 100 8.4 × 10−8

IE run time ln d j ∼ Student-t(4, 0, 7) 1.1 2.4 × 101 4.4 × 10−4

Fig. 5 Optimization steps for the OpenDiHu example using the
expected improvement acquisition function (4). The four initial points
on coarse discretizations are marked by blue crosses. Big plots: Objec-
tive f (blue) and constraint function g (dashed red, gray denoting the
infeasible set). The true minimum is marked with a blue circle. Con-
secutive evaluation points are connected by straight lines to visualize

the evaluation order of the optimizer. Smaller plots: Expected values
E( f̂ | x, y) and E(ĝ | x, y) of the models of optimization functions
in the first, fifth and tenth step for the respective Crank–Nicolson and
implicit Euler setup.Thenext evaluationpoint ismarked in black. (Color
figure online)

ical parameter combinations minimizing the global error
under run time constraints. In contrast to classical adaptiv-
ity, our approach is non-invasive, i.e., requires no changes in
the simulation code itself, requires only very few simulation
runs to generate the required data, and is generalizable to
other objective functions and constraints as well as to more
than two parameters. Two key ingredientsmake it well-suited
to model simulation behavior: A knowledge-based global
model for global trends, and a local model that accounts for
local deviations from the global model. By including prior
knowledge such as convergence orders or run time scaling,
themodel can be used tomakemeaningful predictions further
away from the data, which reduces the number of steps, and
thus simulation runs, needed to find a good parameter com-
bination. Modeling the data on a logarithmic scale ensures
accurate predictions for values spanning multiple orders of
magnitude, which is common for simulation errors and run
times. Using a weakly informative prior for the coefficients
of the global model on a logarithmic scale avoids an artificial
bias in the model predictions.

In our numerical experiments, we have computed the sim-
ulation errors by comparing against a finer discretization

for simplicity reasons, which can be computationally expen-
sive for many applications. In general, and in particular for
more complex problems, the evaluation of the optimization
functions can be non-trivial.While simulation-internal a pos-
teriori error estimators can be more efficient to evaluate the
simulation error they are not always available or come with
additional implementation cost. Data such as run time mea-
surements usually include measurement noise, which means
that the optimization has the additional trade-off between
precise measurements and the cost of running the simulation
multiple times.

For the simulation components in the OpenDiHu applica-
tion, good a priori knowledge is available. For more complex
problems, this might be more difficult, but for most solver
and discretization schemes, at least the asymptotic behavior
is generally known, and can be used as a starting point. The
quality of the model and the efficiency of the optimization
depend on the prior knowledge that is available. Motivated
by our example application, we used a linear combination of
basis functions in the global model. However, an interesting
extension of our approach is to investigate other forms of
global trends: As we use Monte Carlo to sample from the

123


Computational Mechanics (2024) 74:151–168 167

posterior distribution, the linear combination can easily be
replaced by other parameterized functions and correspond-
ing priors for the coefficients, if they describe the global
trends more accurately. For instance, for components where
the asymptotic convergence rate is not known a priori, such
as linear and nonlinear solvers, one could model the conver-
gence rate itself with an additional coefficient in the global
model and a suitable prior.

In addition, the performance of the optimization depends
also on the kernel parameters of the Gaussian process in the
local model. Too large or too small variances and kernel
widths that do not match the data, lead to a reduced qual-
ity of the data acquisition and consequently to excessive or
insufficient exploration during the optimization and reduced
model quality. If the values of the kernel parameters are not
known, they can either be estimated in each step using the
evaluated data, or they can be included into theBayesian opti-
mization to treat all parameters in a unified Bayesian setting
without altering our method significantly.

An obvious next step in our research is to take a closer look
at the acquisition function, in particular in terms of includ-
ing the cost of data point evaluations which vary drastically
with the chosen resolution. Further, the role of objective and
constraint does not matter for our modeling approach. How-
ever, exchanging objective and constraint, i.e., optimizing
run time under accuracy constraints, leads to a different rela-
tion between feasible parameters, and parameters that are
cheap to evaluate. The feasible region, that is all simulation
runs that result in errors below the given bound, would con-
tain potentially very expensive data evaluations such that it is
favorable to use an acquisition function that approaches the
optimum from the cheaper infeasible region.

In general, our approachdoes not exclude adaptivity inside
the simulations. Typical adaptive algorithms rely on param-
eters such as threshold values that control the accuracy of
the model. Controlling these with our presented approach
would allow an interesting combination of adaptive methods
and external optimization for the remaining parameters that
influence the simulation behavior. A simple example is the
optimization of resolutions in time and space in an external
Bayesian setting and an internal adaptive setting of stop-
ping criteria for linear and non-linear solvers based on error
estimators. The internal adaptivity would be part of the sim-
ulation implementation and, as such, be mirrored in the data
used in the external Bayesian optimization.

An interesting additional benefit of using a model-based
approach, is that after the optimization, the resulting model
can be used for additional insight in the simulation behav-
ior. For example, the model can be used to perform a rough
sensitivity analysis of the found optimizer: For instance, one
could estimate how much extra run time would be needed to
reduce the error by a certain factor, or to see, which simula-
tion component dominates the simulation error.

Acknowledgements Funded by Deutsche Forschungsgemeinschaft
(DFG, German Research Foundation) under Germany’s Excellence
Strategy - EXC 2075 – 390740016. We acknowledge the support by
the Stuttgart Center for Simulation Science (SimTech).

Author Contributions Conceptualization: FH; Data curation: FH; For-
mal analysis: FH, PCB, DG, MS; Funding acquisition: DG, MS;
Investigation: FH;Methodology: FH, PCB; Project administration: FH,
DG,MS; Software: FH; Supervision: DG,MS;Visualization: FH;Writ-
ing original-draft: FH, DG, MS; Writing - review & editing: FH, PCB,
DG, MS.

Funding Open Access funding enabled and organized by Projekt
DEAL.

Data availability The data for the OpenDiHu experiments are available
at https://doi.org/10.18419/darus-3550.

Declarations

Conflict of interest The authors have no competing interests to declare
that are relevant to the content of this article.

Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indi-
cate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
intended use is not permitted by statutory regulation or exceeds the
permitted use, youwill need to obtain permission directly from the copy-
right holder. To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.

References

1. Astudillo R, Frazier P (2019) Bayesian optimization of composite
functions. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of
the 36th International Conference on Machine Learning, vol 97.
PMLR, pp 354–363

2. Astudillo R, Frazier PI (2021) Thinking inside the box: A tutorial
on grey-box Bayesian optimization. In: Proceedings of the 2021
Winter Simulation Conference

3. BalandatM, Karrer B, JiangDR et al (2020) BoTorch: a framework
for efficient Monte-Carlo Bayesian optimization. In: Larochelle H,
Ranzato M, Hadsell R et al (eds) Advances in neural informa-
tion processing systems, vol 33. Curran Associates, New York, pp
21524–21538

4. Bangerth W, Rannacher R (2003) Adaptive finite element methods
for differential equations. Lectures in Mathematics. Birkhäuser,
Chan. https://doi.org/10.1007/978-3-0348-7605-6

5. Bergstra J, Bengio Y (2012) Random search for hyper-parameter
optimization. J Mach Learn Res 13(Feb):281–305

6. Bergstra JS, Bardenet R, Bengio Y et al (2011) Algorithms for
hyper-parameter optimization. In: Shawe-Taylor J, Zemel RS,
Bartlett PL et al (eds) Advances in neural information processing
systems, vol 24. Curran Associates, New York, pp 2546–2554

7. BoxGEP,TiaoGC (1992)Bayesian inference in statistical analysis.
Wiley, chap 1:3

123

https://doi.org/10.18419/darus-3550
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
https://doi.org/10.1007/978-3-0348-7605-6


168 Computational Mechanics (2024) 74:151–168

8. Bradley CP, Emamy N, Ertl T et al (2018) Enabling detailed,
biophysics-based skeletal muscle models on HPC systems. Front
Physiol 9:816. https://doi.org/10.3389/fphys.2018.00816

9. Bungartz HJ, Griebel M (2004) Sparse grids. Acta Numer 13:147–
269. https://doi.org/10.1017/S0962492904000182

10. Choi JW, Bedard D, Fowler R, et al (2013) A roofline model of
energy. In: 27th International parallel and distributed processing
symposium (IPDPS 2013), pp 661–672. https://doi.org/10.1109/
IPDPS.2013.77

11. Ern A, Vohralík M (2013) Adaptive inexact Newton methods
with a posteriori stopping criteria for nonlinear diffusion PDEs.
SIAMJSciComput 35(4):A1761–A1791. https://doi.org/10.1137/
120896918

12. Estep D, Ginting V, Ropp D et al (2008) An a posteriori-a pri-
ori analysis of multiscale operator splitting. SIAM J Numer Anal
46(3):1116–1146. https://doi.org/10.1137/07068237X

13. FeurerM,KleinA, EggenspergerK et al (2015) Efficient and robust
automatedmachine learning. In: Cortes C, Lawrence N, Lee D et al
(eds) Advances in neural information processing systems, vol 28.
Curran Associates, New York

14. Frazier P, Powell W, Dayanik S (2009) The knowledge-gradient
policy for correlated normal beliefs. INFORMS J Comput
21(4):599–613. https://doi.org/10.1287/ijoc.1080.0314

15. Frazier PI (2018) A tutorial on Bayesian optimization. arxiv
16. Gardner JR, KusnerMJ, Xu ZE, et al (2014) Bayesian optimization

with inequality constraints. In: XingEP, Jebara T (eds) Proceedings
of the 31st international conference on machine learning, vol 32.
PMLR, pp 937–945

17. Gelbart MA, Snoek J, Adams RP (2014) Bayesian optimization
with unknown constraints. In: Nevin L. Zhang JT (ed) Proceed-
ings of the 30th conference on uncertainty in artificial intelligence.
AUAI Press, pp 250–259

18. Gelman A, Carlin JB, Stern HS et al (2013) Bayesian data analysis,
3rd edn. Chapman and Hall/CRC, Boca Raton. https://doi.org/10.
1201/b16018

19. GramacyRB,GrayGA,LeDigabel S et al (2016)Modeling an aug-
mented Lagrangian for blackbox constrained optimization. Tech-
nometrics 58(1):1–11. https://doi.org/10.1080/00401706.2015.
1014065

20. Hairer E, Wanner G (1993) Solving ordinary differential equations
I: nonstiff problems, 2nd edn. Springer, Berlin. https://doi.org/10.
1007/978-3-540-78862-1

21. Hennig P, Schuler CJ (2012) Entropy search for information-
efficient global optimization. J Mach Learn Res 13(1):1809–1837

22. HodgkinAL,HuxleyAF (1952)Aquantitative description ofmem-
brane current and its application to conduction and excitation in
nerve. J Physiol 117(4):500–544. https://doi.org/10.1113/jphysiol.
1952.sp004764

23. Hutter F, Hoos HH, Leyton-Brown K (2011) Sequential model-
based optimization for general algorithm configuration. In: Coello
CA (ed) Learning and intelligent optimization. Springer, Berlin, pp
507–523. https://doi.org/10.1007/978-3-642-25566-3_40

24. JonesDR,SchonlauM,WelchWJ (1998)Efficient global optimiza-
tion of expensive black-box functions. J Global Optim 13(4):455–
492. https://doi.org/10.1023/A:1008306431147

25. Leube PC, de Barros FPJ, Nowak W et al (2013) Towards optimal
allocation of computer resources: trade-offs between uncertainty
quantification, discretization and model reduction. Environ Model
Softw 50:97–107. https://doi.org/10.1016/j.envsoft.2013.08.008

26. López-Lopera AF, Bachoc F, Durrande N et al (2018) Finite-
dimensional Gaussian approximation with linear inequality con-
straints. SIAM/ASA J Uncertain Quantif 6(3):1224–1255. https://
doi.org/10.1137/17M1153157

27. Maatouk H, Bay X (2017) Gaussian process emulators for
computer experiments with inequality constraints. Math Geosci
49(5):557–582. https://doi.org/10.1007/s11004-017-9673-2

28. MacDougall JD, SaleDG,AlwaySE et al (1984)Muscle fiber num-
ber in biceps Brachii in bodybuilders and control subjects. J Appl
Physiol 57(5):1399–1403. https://doi.org/10.1152/jappl.1984.57.
5.1399

29. Maier B, Emamy N, Krämer A, et al (2019) Highly parallel
multi-physics simulation of muscular activation and EMG. In: Pro-
ceedings of 8th international conference on computationalmethods
for coupled problems in science and engineering. coupled problems
2019. International Centre for Numerical Methods in Engineering,
pp 610–621

30. Maier B, Göddeke D, Huber F, et al (2022) OpenDiHu: an effi-
cient and scalable framework for biophysical simulations of the
neuromuscular system (submitted)

31. Meidner D, Rannacher R, Vihharev J (2009) Goal-oriented error
control of the iterative solution of finite element equations. J Numer
Math 17(2):143–172. https://doi.org/10.1515/JNUM.2009.009

32. Meißner J, Göddeke D, Herschel M (2023) Data-informed opti-
mization for parameter and scheme selection in numerical simu-
lations. In: Proceedings of the 2023 international conference on
management of data. Association for computing machinery

33. Picheny V (2014) A stepwise uncertainty reduction approach to
constrained global optimization. In: Kaski S, Corander J (eds)
Proceedings of the 17th international conference on artificial intel-
ligence and statistics, vol 33. PMLR, pp 787–795

34. Rasmussen CE, Williams CKI (2006) Gaussian processes for
machine learning. MIT Press, Cambridge

35. Schaback R (1997) Native Hilbert spaces for radial basis functions
I. In: MüllerMW, BuhmannMD,Mache DH et al (eds) New devel-
opments in approximation theory: international series of numerical
mathematics, vol 132. Birkhäuser, Cham, pp 255–282. https://doi.
org/10.1007/978-3-0348-8696-3_16

36. ScheuererM, SchabackR, SchlatherM (2013) Interpolation of spa-
tial data: a stochastic or a deterministic problem? Eur J Appl Math
24(4):601–629. https://doi.org/10.1017/S0956792513000016

37. Shahriari B, Swersky K, Wang Z et al (2016) Taking the human
out of the loop: a review of Bayesian optimization. Proc IEEE
104(1):148–175. https://doi.org/10.1109/JPROC.2015.2494218

38. Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian opti-
mization ofmachine learning algorithms. In: Pereira F,BurgesCJC,
Bottou L et al (eds) Advances in neural information processing sys-
tems, vol 25. Curran Associates, New York p, pp 2951–2959

39. Srinivas N, Krause A, Kakade SM et al (2012) Information-
theoretic regret bounds for Gaussian process optimization in the
bandit setting. IEEE Trans Inf Theory 58(5):3250–3265. https://
doi.org/10.1109/TIT.2011.2182033

40. Stan Development Team (2021) Stan modeling language users
guide and reference manual, 2.27. https://mc-stan.org

41. Strang G (1968) On the construction and comparison of difference
schemes. SIAM J Numer Anal 5(3):506–517. https://doi.org/10.
1137/0705041

42. Wang Z, Jegelka S (2017) Max-value entropy search for efficient
Bayesian optimization. In: Precup D, Teh YW (eds) Proceedings
of the 34th international conference on machine learning, vol 70.
PMLR, pp 3627–3635

43. Williams S, Waterman A, Patterson D (2009) Roofline: an insight-
ful visual performancemodel formulticore architectures. Commun
ACM 52(4):65–76. https://doi.org/10.1145/1498765.1498785

44. Yang L, Shami A (2020) On hyperparameter optimization of
machine learning algorithms: theory and practice. Neurocomput-
ing 415:295–316. https://doi.org/10.1016/j.neucom.2020.07.061

Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.

123

https://doi.org/10.3389/fphys.2018.00816
https://doi.org/10.1017/S0962492904000182
https://doi.org/10.1109/IPDPS.2013.77
https://doi.org/10.1109/IPDPS.2013.77
https://doi.org/10.1137/120896918
https://doi.org/10.1137/120896918
https://doi.org/10.1137/07068237X
https://doi.org/10.1287/ijoc.1080.0314
https://doi.org/10.1201/b16018
https://doi.org/10.1201/b16018
https://doi.org/10.1080/00401706.2015.1014065
https://doi.org/10.1080/00401706.2015.1014065
https://doi.org/10.1007/978-3-540-78862-1
https://doi.org/10.1007/978-3-540-78862-1
https://doi.org/10.1113/jphysiol.1952.sp004764
https://doi.org/10.1113/jphysiol.1952.sp004764
https://doi.org/10.1007/978-3-642-25566-3_40
https://doi.org/10.1023/A:1008306431147
https://doi.org/10.1016/j.envsoft.2013.08.008
https://doi.org/10.1137/17M1153157
https://doi.org/10.1137/17M1153157
https://doi.org/10.1007/s11004-017-9673-2
https://doi.org/10.1152/jappl.1984.57.5.1399
https://doi.org/10.1152/jappl.1984.57.5.1399
https://doi.org/10.1515/JNUM.2009.009
https://doi.org/10.1007/978-3-0348-8696-3_16
https://doi.org/10.1007/978-3-0348-8696-3_16
https://doi.org/10.1017/S0956792513000016
https://doi.org/10.1109/JPROC.2015.2494218
https://doi.org/10.1109/TIT.2011.2182033
https://doi.org/10.1109/TIT.2011.2182033
https://mc-stan.org
https://doi.org/10.1137/0705041
https://doi.org/10.1137/0705041
https://doi.org/10.1145/1498765.1498785
https://doi.org/10.1016/j.neucom.2020.07.061

	Knowledge-based modeling of simulation behavior for Bayesian optimization
	Abstract
	1 Introduction
	1.1 Motivating example
	1.2 General formulation
	1.3 Knowledge-infused modeling for f and g
	1.4 Challenges
	1.5 Summary of contributions

	2 Related work
	3 Knowledge-infused modeling of objectives and constraints
	3.1 Bayesian optimization overview
	3.2 Model ansatz
	3.2.1 Logarithmic scaling
	3.2.2 Global model
	3.2.3 Local model

	3.3 Posterior and predictive distribution
	3.4 Prior for the global model coefficients
	3.5 Acquisition function

	4 Test application
	4.1 Model equations and discretization
	4.2 Optimization problem
	4.3 Error and run time modeling
	4.4 Simplified one-dimensional analytical test case

	5 Experiments
	5.1 Model validation
	5.1.1 Influence of the global model and the logarithmic scale
	5.1.2 Effect of the prior on the model quality
	5.1.3 Model quality for OpenDiHu

	5.2 Optimization results

	6 Discussion
	Acknowledgements
	References