Computational Optimization and Applications (2025) 91:1263–1308
https://doi.org/10.1007/s10589-025-00684-x

On the forward–backward method with nonmonotone
linesearch for infinite-dimensional nonsmooth nonconvex
problems

Behzad Azmi1 ·Marco Bernreuther2

Received: 31 May 2023 / Accepted: 9 April 2025 / Published online: 8 May 2025
© The Author(s) 2025

Abstract
This paper provides a comprehensive study of the nonmonotone forward–backward
splitting (FBS) method for solving a class of nonsmooth composite problems in
Hilbert spaces. The objective function is the sum of a Fréchet differentiable (not
necessarily convex) function and a proper lower semicontinuous convex (not neces-
sarily smooth) function. These problems appear, for example, frequently in the context
of optimal control of nonlinear partial differential equations (PDEs) with nonsmooth
sparsity-promoting cost functionals. We discuss the convergence and complexity of
FBS equipped with the nonmonotone linesearch under different conditions. In par-
ticular, R-linear convergence will be derived under quadratic growth-type conditions.
We also investigate the applicability of the algorithm to problems governed by PDEs.
Numerical experiments are also given that justify our theoretical findings.

Keywords Nonsmooth nonconvex optimization · Forward–backward algorithm ·
Infinite-dimensional problems · Nonmonotone linesearch · Quadratic growth ·
PDE-constrained optimization

Mathematics Subject Classification 90C26 · 49M41 · 65K05 · 65K15 · 49M37 ·
65J22

B Behzad Azmi
behzad.azmi@uni-konstanz.de

Marco Bernreuther
marco.bernreuther@iff.uni-stuttgart.de

1 Department of Mathematics and Statistics, University of Konstanz, Universitätsstraße 10, 78457
Konstanz, Germany

2 Institute for Industrial Manufacturing and Management IFF, University of Stuttgart, Allmandring 35,
70569 Stuttgart, Germany

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s10589-025-00684-x&domain=pdf


1264 B. Azmi, M. Bernreuther

1 Introduction

In this work, we are concerned with the following composite problem

min
u∈H

�(u) := F(u) + R(u), (P)

where H is a real Hilbert space, F is a continuously Fréchet differentiable function
(possibly nonconvex), andR is a convex function whose proximal operator is assumed
to be explicitly computable. The precise details will be introduced in Sect. 2. Problems
of the form (P) appear in several fields of application such as optimal control problems
[1, 2], system identification, signal, and image processing [3], machine learning, and
statistics [4].

Arguably one of the most well-known algorithms for solving problem (P) is
forward–backward splitting (FBS), also known as the proximal gradient method [3,
5], which is the generalization of the classical gradient method for problems with an
additional nonsmooth term (see (7)). The iterations of FBS are defined by

uk+1 = Prox 1
αk

R

(
uk − 1

αk
∇F(uk)

)
for k ∈ N0, (1)

where the step-sizes αk > 0 are supposed to be chosen in a way that guarantees
convergence of the algorithm and accelerates it. It is known that the global convergence
of FBS is to be sublinear of order (1/k) for the convex case [5], where k stands for the
number of iterations. This order can be improved to (1/k2) using an inertial variant
of the algorithm based on Nesterov’s accelerated techniques [6, 7]. Convergence of
the iterates of FBS to a critical point of problem (P), even for the nonconvex case, has
been shown for functions� satisfying the Kurdyka–Łojasiewicz property e.g., [8–12]
or quadratic growth error bounds e.g., [13–16]. Due to the simplicity and efficiency
of FBS, its convergence, complexity, and applicability have been extensively studied,
making it an ongoing area of active research. Numerous step-size strategies have
been proposed and analyzed within the context of FBS under various assumptions
and structures. For structured nonconvex finite-dimensional problems, works such as
[17–19] have focused on the inertial method and constant step-size strategies, while
[20–25] have explored linesearch strategies based on function evaluations, as well as
quasi-Newton and Newton-type directions. Additionally, for problem (P) posed in an
infinite-dimensional Hilbert space, references such as [13, 26–31] have investigated
FBS assuming the convexity of F , with [31] also addressing D.C. programming. In
case where F is not necessarily convex, we can also mention [32, 33], which explore
a class of inexact trust region methods to ensure the convergence of FBS and apply
them to problems governed by PDEs.

As is well-known for smooth problems, nonmonotone linesearch strategies appear
to be numerically efficient in situations where a monotone scheme is forced to prop-
agate along the bottom of a narrow, curved valley. Additionally, because they allow
for increases in function values, they can be effectively combined with spectral gradi-
ent methods, such as the Barzilai-Borwein (BB) step-sizes [34, 35], which naturally
exhibit nonmonotone behavior. Building on the nonmonotone approach developed by

123


On the forward–backward method with nonmonotone… 1265

Grippo et al. [36], Wright et al. [37] proposed a nonmonotone FBS for nonsmooth
convex composite problems posed in R

n . The convergence of this approach, known
as SpaRSA, was further studied in [38], where the authors demonstrated the order of
(1/k) convergence and R-linear convergence for convex and strongly convex func-
tions, respectively. Very recently, in [39], the authors proved convergence to a critical
point of this scheme for finite-dimensional nonsmooth nonconvex composite functions
under milder conditions, specifically, local Lipschitz continuity of the gradient for the
smooth part and uniform continuity for the objective function on its level sets.

From a computational perspective, both classical gradient and FBS methods can be
accelerated if incorporated with the BB step-sizes. These step-sizes approximate the
curvature of the Hessian for the smooth part of the objective function and are partic-
ularly efficient in the context of PDE-constrained optimization [34, 40]. For strongly
quadratic functionsF , R-linear convergence of the BB step-sizes has been established
for (P) provided the condition number of the quadratic operator is sufficiently small
(< 2); see [34, Remark 3.4]. However, even for strongly quadratic functions F with
condition numbers larger than 2, convergence of the BB step-sizes is not clear. Thus, to
guarantee convergence, one needs to combine the BB step-sizes with a nonmonotone
linesearch, which relies on function evaluations.

Weak and strong convergence can be distinguished only in the infinite-dimensional
setting. In practice, numerical solutions to infinite-dimensional problems are typically
obtained by implementing algorithms for finite-dimensional approximations. How-
ever, analyzing the convergence of these algorithms in the infinite-dimensional context
is crucial for ensuring the numerical robustness and stability of the finite-dimensional
approximations. Such properties are expressed by the so-called mesh-independent
principle (MIP); see, e.g., [40–46]. Roughly speaking, MIP uses results from infinite-
dimensional convergence to predict the convergence properties of finite-dimensional
approximations (discretized problems). Additionally, MIP offers a theoretical foun-
dation for developing refinement strategies; see, e.g., [47].

In light of the above discussion, we investigate the well-posedness, convergence,
and complexity of the nonmonotone FBS for problems of the form (P) posed in
infinite-dimensional Hilbert spaces, under various assumptions such as nonconvex-
ity, convexity, and quadratic growth-type conditions. These results also apply to
finite-dimensional problems, where our convergence and complexity analyses and
assumptions differ from those in previous works [37–39]. Notably, the convergence of
this approach under quadratic growth-type conditions, and its worst-case evaluation
complexity for reaching an approximate stationary point have not been previously
investigated. We will clearly outline our contributions in the next section.

In particular, we investigate the applicability of the nonmonotone FBS for prob-
lems governed by PDEs. Despite its numerical efficiency, to the best of our knowledge,
this approach has not yet been studied for infinite-dimensional problems. For these
problems, due to the lack of compactness in the strong topology, we need to explore
convergence in the weak topology, which involves studying notions of sequential
continuity for some operators with respect to the weak topology. Here, the nonmono-
tonicity of the approach, and working with the weak topology are two issues that we
overcomewith our analysis using the Lipschitz continuity of the gradientmapping (see
Definition 1). We will also highlight the challenges and differences that arise when

123


1266 B. Azmi, M. Bernreuther

considering infinite-dimensional spaces after each proof. Additionally, we investigate
and discuss the applicability of our assumptions and, consequently, our results to two
examples of nonsmooth nonconvex problems with PDEs.

Contributions

More precisely, the contributions of this work can be summarized as follows:

(i) Starting with the nonconvex case, we prove well-posedness of the algorithm
evenwithout Lipschitz continuity of the gradient ofF . Under the global Lipschitz
continuity of the gradient ofF , we prove the global convergence of the algorithm
with complexity (1/

√
k). To be more precise, we show that the norm of the prox-

gradient mapping of iterates vanishes with complexity (1/
√

k).We also establish
that every weak sequential cluster point of iterates is a stationary point.

(ii) We derive the worst-case evaluation complexity of finding an εtol-stationary
point. More precisely, we give estimates on the maximal number of the objec-
tive function and prox-grad operator evaluations for computing an approximate
stationary point with a user-defined accuracy threshold εtol > 0.

(iii) In the convex setting, relying on the concept of quasi-Fejer sequences,we are able
to extend previously established results to global convergence, both in terms of
function values and iterateswith respect to theweak sequential topology. Further,
we show that the convergence is sublinear of order (1/k) in function values.

(iv) Under quadratic growth-type conditions, we show global R-linear convergence,
both in terms of function values and iterates. The proof of the latter is more deli-
cate for infinite-dimensional problems since the transition from weak sequential
convergence to strong convergence is not straightforward.

(v) Finally, aiming at optimizationproblemsgovernedbyPDEs,wediscuss the valid-
ity of the convergence results of the algorithmwithout strongLipschitz continuity
of ∇F . Our theoretical framework is supported by two nonsmooth nonconvex
problems governed by PDEs, including semilinear elliptic and parabolic equa-
tions. We also show that our results are applicable to these problems and report
on related numerical experiments.

Outline of the paper

The rest of this paper is organized as follows: Sect. 2 presents the preliminaries,
assumptions on the optimization problem (P), the algorithm, and the nonmonotone
linesearch strategy. Section 3 investigates the convergence and complexity of the algo-
rithmcomprehensively under different conditions. Section 4 discusses the applicability
of the results from the previous section to PDE-constrained optimization problems.
Finally, numerical experiments are reported in Sect. 5 that justify our theoretical find-
ings. To improve the readability of the paper, we provide proof of some results from
Sect. 3 in Appendix A.

123


On the forward–backward method with nonmonotone… 1267

Notation

Throughout this paper, the Hilbert space H is endowed with the scalar product (·, ·)H

and the induced norm ‖ · ‖H . For a radius r > 0 and ū ∈ H , we define Br (ū) := {u ∈
H : ‖u−ū‖H < r}.We also denotewith PS : H → H , the orthogonal projection onto

the set S ⊂ H . Further for every �̃ ∈ R,wedefine
[
� ≤ �̃

]
:= {u ∈ H : �(u) ≤ �̃}.

ByArgmin� we denote the set ofminimizers of�. Sometimeswewill use set-valued
(in-)equalities, i.e. A ≤ B for A, B ⊂ H , meaning that a ≤ b for all a ∈ A and b ∈ B.

We call a sequence {uk}k ⊂ H quasi-Féjer monotone with respect to a non-empty
set S ⊂ H , if for every v ∈ S it holds that

‖uk+1 − v‖2H ≤ ‖uk − v‖2H + εk,

where {εk}k ⊂ R≥0 is a summable sequence, i.e.,
∑∞

k=0 εk < ∞. To avoid confusion
in the notation, the derivative and gradient of F are defined by F ′ : H → H ′ and
∇F : H → H , respectively. In this case, for every u ∈ H we identify ∇F(u) ∈
H by the unique Riesz representative of F ′(u) ∈ H ′, with H ′ denoting the dual
space of H . We also use the notion of the convex subdifferential for convex function
φ : H → R ∪ {+∞} at u ∈ dom φ := {u ∈ H : φ(u) < +∞} defined as
∂φ(u) := {w ∈ H : φ(v) − φ(u) ≥ (w, v − u)H for all v ∈ H}.

2 Problem formulation and algorithm

In this section, the precise theoretical framework and the algorithmwill be introduced.
First, we state the precise assumptions on (P).

Assumption 1 For problem (P),

A1: R : H → R ∪ {+∞} is proper, convex, and lower semicontinuous.
A2: F : H → R ∪ {+∞} is continuously Fréchet differentiable on int(domF) con-

taining domR, that is, domR ⊆ int(domF).
A3: ∇F : int(domF) → H is globally LF ′-Lipschitz continuous.
A4: The gradient ∇F : int(domF) → H is weak-to-strong sequentially continuous.

Conditions A1–A2 are standard in order to guarantee the well-posedness of the algo-
rithm. A3 will be used to derive complexity results for iterations and we will discuss
the relaxation of this condition for a large class of problems governed by PDEs later
in Sect. 3.4. We use A4 to show that every weak sequential accumulation point of
iterates belongs to S∗. Note that, if dim(H) < ∞, A4 follows from A2.

According to Assumption 1, the Fermat principle [48, Proposition 9.1.5] for (P)
reads as follows: If u∗ ∈ H is a local minimizer of �, then

− ∇F(u∗) ∈ ∂R(u∗). (2)

123


1268 B. Azmi, M. Bernreuther

Further, with similar arguments as in the proof of [26, Theorem 26.2], the condition
(2) can be equivalently expressed as

u∗ = Prox 1
α
R(u∗ − 1

α
∇F(u∗)) for some α > 0, (3)

where the proximal operator Prox 1
α
R : H → H is defined by

Prox 1
α
R(u) := argmin

v∈H

(
R(v) + α

2
‖v − u‖2H

)
.

This operator is well-defined due to A1. We also define the set of critical points by

S∗ := {u ∈ H : −∇F(u) ∈ ∂R(u)}.

Now we turn our attention towards formalizing (1) and the proposed step-size update
strategy by a nonmonotone linesearch method. Therefore we introduce the prox-grad
operator and the gradient mapping in the Hilbert space setting analogously to, e.g., [5,
14].

Definition 1 For every α ∈ R>0, we define

1. the prox-grad operator Tα : int(domF) → domR with u �→ Prox 1
α
R(u −

1
α
∇F(u)).

2. the gradient mapping Gα : int(domF) → H with u �→ α(u − Tα(u)).

Due to Definition 1 and by some simple computations, we obtain for every u ∈
int(domF) that

Gα(u) − ∇F(u) ∈ ∂R(Tα(u)). (4)

Further, if we define for every u ∈ int(domF), w ∈ H , and α ∈ R>0

Qα(w, u) := F(u) + (∇F(u), w − u)H + α

2
‖w − u‖2H + R(w),

then Tα(u) is the unique minimizer of Qα(·, u), i.e.

Tα(u) = argmin
w∈H

Qα(w, u).

As a consequence, we can write

(∇F(u), Tα(u) − u)H + 1

2α
‖Gα(u)‖2H + R(Tα(u)) ≤ R(u). (5)

By means of the prox-grad operator and the gradient mapping, the iterations (1) can
be reformulated as follows:

123


On the forward–backward method with nonmonotone… 1269

uk+1 = Tαk (uk) for k ∈ N0, (6)

uk+1 = uk − 1

αk
Gαk (uk) for k ∈ N0. (7)

In this case, for every u0 ∈ int(domF), the iterations are well-defined, that is {uk}k ⊂
domR ⊆ H . Further, due to (7), we can see the analogy between the proximal
gradient method and classical gradient descent for smooth minimization. Using (3)
and the notion of the gradient mapping, we also can characterize the critical points of
(P), as follows.

Proposition 1 For u∗ ∈ int(domF), it holds that Gα(u∗) = 0 for some α ∈ R>0 if
and only if u∗ ∈ S∗.

ThereforeGα(ū) = 0 defines a natural termination condition analogously to the smooth
case.

There are many possible choices for the step-size αk in our iterative scheme. In this
work, we will consider step-size updates by a BB-type update rule or a combination
of them with a nonmonotone linesearch. To be more precise, as the initial trial step-
size within the nonmonotone linesearch, we will choose one of the spectral gradient
BB-types step-sizes defined by

αBB1a
k := (uk−uk−1,∇F(uk )−∇F(uk−1))H

(uk−uk−1,uk−uk−1)H
,

αBB2a
k := (∇F(uk )−∇F(uk−1),∇F(uk )−∇F(uk−1))H

(uk−uk−1,∇F(uk )−∇F(uk−1))H
,

αBB1b
k := (uk−uk−1,Gαk−1 (uk )−Gαk−1 (uk−1))H

(uk−uk−1,uk−uk−1)H
,

αBB2b
k := (Gαk−1 (uk )−Gαk−1 (uk−1),Gαk−1 (uk )−Gαk−1 (uk−1))H

(uk−uk−1,Gαk−1 (uk )−Gαk−1 (uk−1))H
.

The first two strategies correspond to the BB-method presented e.g. in [40]. The last
two novel BB-methods modify the classical BB-method and try to incorporate full
first-order information by using the gradient mapping and not only the gradient of F .
Furthermore, we will use so-called alternating BB-type update rules given by

αABBa := αBB1a for k even and αBB2a for k odd,

αABBb := αBB1b for k even and αBB2b for k odd.

Remark 1 In the case of R = 0, we have Gl(u) = ∇F(u) for every l > 0 and
u ∈ H . Therefore, many of the introduced step-size updates above are identical, i.e.
αBB1b = αBB1a, αBB2b = αBB2a, and αABBb = αABBa.

For the nonmonotone linesearch update, we consider

�(uk+1) ≤ max
0≤ j≤m(k)

�(uk− j ) − δ

αk
‖Gαk (uk)‖2H , (8)

123


1270 B. Azmi, M. Bernreuther

where 0 < δ < 1 and memory m : N0 → N0 satisfies

m(0) = 0 and m(k) = min{m(k − 1) + 1, mmax} for k ∈ N, (9)

with a given upper bound mmax ∈ N0. Similarly to [49], we also define the functions
	 : N0 → N0 and ν : N0 → N0 with

	(k) := k − arg max
0≤ j≤m(k)

�(uk− j ) for k ≥ 0 with k − m(k) ≤ 	(k) ≤ k,

ν(k) := 	(kmmax + k) for k ≥ 0.

Thus, by these notations, we have max
0≤ j≤m(k)

�(uk− j ) = �(u	(k)) and (8) can be

rewritten as

�(uk+1) ≤ �(u	(k)) − δ

αk
‖Gαk (uk)‖2H . (10)

Further we set

αk = αint,kη
ik , (11)

where η > 1. The initial step-size αint,k > 0 is chosen by the BB-method and a
lower bound αinf > 0 is given to ensure nonnegativity. To limit the initial step-size
numerically, also an upper bound αsup > αinf is employed. In the update, we choose
the smallest integer ik ∈ N0 in (11), such that (8) is satisfied. This procedure is
summarized in Algorithm 1.

Algorithm 1 Nonmonotone FBS
Input: 0 < δ < 1, mmax ∈ N0, η > 1, αsup > αinf > 0, u0 ∈ H , and α0 > 0.
Output: A stationary point u∗ ∈ H of (P).
1: Set k = 0;
2: while ‖Gαk (uk )‖H > 0 do
3: if k = 0 then
4: Set αint,k := max{αinf ,min{αsup, α0};
5: else
6: Compute αBBk according to a BB-method and set

αint,k := max{αinf ,min{αsup, αBBk }};

7: end if
8: Set αk = αint,kηik , where ik ≥ 0 is the smallest integer for which (8) holds;

9: Set uk+1 = uk − 1
αk

Gαk (uk ) and k = k + 1 ;
10: end while

Note that by Proposition 1, the criterion ‖Gαk (uk)‖H ≤ εtol with some tolerance
εtol > 0 is a reasonable stopping criterion in the numerical realization of Algorithm 1.

123


On the forward–backward method with nonmonotone… 1271

3 Convergence and complexity analysis

In this section, we present a detailed convergence and complexity analysis for Algo-
rithm 1 under different assumptions and conditions.

3.1 General case

We start by summarizing useful properties of the gradient mapping.

Lemma 2 Suppose that A1–A2 hold, then we have the following properties:

P1 For every l1 ≥ l2 > 0 and u ∈ int(domF) it holds that 1
l1

‖Gl1(u)‖H ≤
1
l2

‖Gl2(u)‖H .
P2 For every l1 ≥ l2 > 0 and u ∈ int(domF) it holds that ‖Gl1(u)‖H ≥ ‖Gl2(u)‖H .
P3 Assume in addition that A3 holds, then the gradient mapping is Lipschitz contin-

uous, that is

‖Gl(u) − Gl(v)‖H ≤ (2l + LF ′)‖u − v‖H ,

for every l > 0 and v, u ∈ int(domF).

Proof The proof follows using the same arguments, e.g., given in [5, Theorem 10.9,
Lemma 10.10] for finite-dimensional problems. ��

Furthermore, thewell-known sufficient decrease condition can be formulated for the
gradient mapping analogously to the finite-dimensional case presented in [5, Lemma
10.4].

Lemma 3 (Sufficient decrease lemma) Suppose that A1–A3 hold, then for every u ∈
int(domF) and l ∈ (

LF ′
2 ,∞), we have

�(Tl(u)) ≤ �(u) − l − LF ′
2

l2
‖Gl(u)‖2H .

The previous lemmas allow us to summarize some important properties of Algo-
rithm 1.

Lemma 4 Suppose that A1–A2 hold. Then for every k ≥ 0, the following statements
hold true:

(i) For δ ∈ (0, 1
2 ), the nonmonotone linesearch is well-defined. That is, there exists

α > 0 such that for every uk ∈ int(domF) and αk ∈ [α,∞) the nonmonotone
rule (8) holds and thus, the nonmonotone linesearch terminates after finitely many
iterations.

Assume, in addition, A3 holds.

(i) Then the statement of (i) is true for α := LF ′
2(1−δ)

and every fixed δ ∈ (0, 1).

123


1272 B. Azmi, M. Bernreuther

(ii) The step-sizes are uniformly bounded from above. That is, for every k ≥ 1, we
have αk ≤ α with α := max{ ηLF ′

2(1−δ)
, αsup}.

(iii) It holds ‖Gαk+1(uk+1)‖H ≤ CG‖Gαk (uk)‖H , where CG := 3α+LF ′
αinf

.

Proof (i) If uk ∈ S∗, then the claim clearly holds true for every α > 0. Thus, we
assume that uk /∈ S∗. We suppose also on contrary there does not exist α > 0 for
which the claim holds true. Then the nonmonotone linesearch generates a sequence
of step-sizes αk,i := αint,kη

i with i ∈ N0 satisfying αk,i → ∞ and

δ
αk,i

‖Gαk,i (uk)‖2H > �(u	(k)) − �(Tαk,i (uk)) ≥ �(uk) − �(Tαk,i (uk)). (12)

First, we note that Tαk,i (uk) → uk as i → ∞. This follows from the fact that
Prox 1

αk,i
R(uk) → uk as i → ∞ (see e.g., [26, Theorem 23.47]) and

‖Tαk,i (uk) − uk‖H ≤ ‖Prox 1
αk,i

R(uk − 1
αk,i

∇F(uk)) − Prox 1
αk,i

R(uk)‖H

+ ‖Prox 1
αk,i

R(uk) − uk‖H ≤ ‖Prox 1
αk,i

R(uk) − uk‖H + 1
αk,i

‖∇F(uk)‖H ,

wherewe have used the firm nonexpansiveness of the proximal operator. Further, using
(12) and the mean value theorem for F , we obtain for every i ∈ N0, that

δ
αk,i

‖Gαk,i (uk)‖2H ≥ �(uk) − �(Tαk,i (uk))

≥ R(uk) − R(Tαk,i (uk)) + (∇F(uk + ti (Tαk,i (uk) − uk)), uk − Tαk,i (uk))H ,

where ti ∈ (0, 1) for all i ∈ N0. Using (5) we obtain that

δ
αk,i

‖Gαk,i (uk)‖2H ≥ 1
2αk,i

‖Gαk,i (uk)‖2H
+ (∇F

(
uk + ti (Tαk,i (uk) − uk)

)− ∇F(uk), Tαk,i (uk) − uk
)

H

≥ 1
2αk,i

‖Gαk,i (uk)‖2H − 1
αk,i

‖∇F
(
uk+ti (Tαk,i (uk) − uk)

)−∇F(uk)‖H ‖Gαk,i (uk)‖H .

Together with the fact that δ < 1
2 , we obtain

( 12 − δ)‖Gαk,i (uk)‖H ≤ ‖∇F
(
uk + ti (Tαk,i (uk) − uk)

)− ∇F(uk)‖H .

Sending i → ∞ and using the continuity of ∇F , we obtain that

lim
i→∞ ‖Gαk,i (uk)‖H = 0.

Finally, using P2, we can infer that ‖Gαinf (uk)‖H ≤ ‖Gαk,i (uk)‖H and thus
‖Gαinf (uk)‖H = 0. Now Proposition 1 implies uk ∈ S∗. This contradicts our assump-
tion, which concludes the proof.

123


On the forward–backward method with nonmonotone… 1273

(ii) For every given αk ≥ α >
LF ′
2 we can invoke Lemma 3 and write

�(Tαk (uk)) ≤ �(uk) − αk− LF ′
2

α2
k

‖Gαk (uk)‖2H

≤ max
0≤ j≤m(k)

�(uk− j ) − αk− LF ′
2

α2
k

‖Gαk (uk)‖2H .

Thus, (8) holds since αk satisfies
αk− LF ′

2
α2

k
≥ δ

αk
by assumption.

(iii) To derive α, we consider the following cases:

• ik = 0: In this case, due to (11), we have αk = αint,k ≤ αsup.
• ik ≥ 1: In this case, (8) holds for αk and uk . Then due to (11), we can write

�(Tαk
η

(uk)) > max
0≤ j≤m(k)

�(uk− j ) − δη
αk

‖Gαk
η

(uk)‖2H
≥ �(uk) − δη

αk
‖Gαk

η
(uk)‖2H .

(13)

Further, by assumingwithout loss of generality that αk
η

≥ LF ′
2 ,we canuseLemma3

with l = αk
η
and u = uk to obtain

�(Tαk
η

(uk)) ≤ �(uk) −
αk
η

− LF ′
2

(
αk
η

)2
‖Gαk

η
(uk)‖2H . (14)

Combining (13) and (14), we obtain δη
αk

>

αk
η

− LF ′
2

(
αk
η

)2
and as a consequence, αk <

ηLF ′
2(1−δ)

.

Summarizing the two above cases, we can conclude the second part with α :=
max{ ηLF ′

2(1−δ)
, αsup}.

(iv) Finally for proving the last part, we use Lemma 2 and (iii) to obtain that

‖Gαk+1(uk+1)‖H
P2≤ ‖Gα(uk+1)‖H ≤ ‖Gα(uk+1) − Gα(uk)‖H + ‖Gα(uk)‖H

P3≤ (2α + LF ′)‖uk+1 − uk‖H +‖Gα(uk)‖H
Def. 1≤ (2α+LF ′ )

αk
‖Gαk (uk)‖H +‖Gα(uk)‖H

P1≤ (3α+LF ′ )
αk

‖Gαk (uk)‖H ≤ 3α+LF ′
αinf

‖Gαk (uk)‖H .

Setting CG := 3α+LF ′
αinf

, the proof is complete. ��
Before stating the main convergence result of this section, we present a lemma

concerning the subsequence {uν(k)}k and well-posedness of Algorithm 1.

123


1274 B. Azmi, M. Bernreuther

Lemma 5 Suppose that the sequences {uk}k and {αk}k ⊂ R>0 are generated by Algo-
rithm 1. Then the following properties hold:

L1 {uν(k)}k is a subsequence of {uk}k with ν(k) − ν(k − 1) ≤ 2mmax + 1 and ν(k) ≤
(mmax + 1)k. Further, for every k ∈ N0, it holds �(uk) ≤ �(u

ν(
⌈

k
mmax+1

⌉
)
).

L2 For every k ≥ 1, we have

�(uν(k)) ≤ �(uν(k−1)) − δ
αν(k)−1

‖Gν(k)−1(uν(k)−1)‖2H , (15)

and in particular the subsequence {�(uν(k))}k is monotonically decreasing.
L3 Assume that � is bounded from below, i.e. inf

u∈H
�(u) > −∞. Then we have

∞∑
k=1

1
αν(k)−1

‖Gν(k)−1(uν(k)−1)‖2H < ∞ and lim inf
k→∞

1
αk

‖Gαk (uk)‖2H = 0. (16)

In particular, if mmax = 0, we have

∞∑
k=0

1
αk

‖Gαk (uk)‖2H < ∞ and lim
k→∞

1
αk

‖Gαk (uk)‖2H = 0. (17)

Proof (L1) Using the fact that 	(k) ≥ k − mmax for every k, we obtain

ν(k) = 	(kmmax + k) ≥ kmmax + k − mmax

> (k − 1)mmax + (k − 1) ≥ 	((k − 1)mmax + (k − 1)) = ν(k − 1),

and thus {uν(k)}k ⊂ {uk}k . Moreover, we have

ν(k) = 	(kmmax + k) ≤ kmmax + k,

and

ν(k) − ν(k − 1) = 	(kmmax + k) − 	((k − 1)mmax + (k − 1))

≤ kmmax + k − ((k − 1)mmax + (k − 1) − mmax) = 2mmax + 1.

Further, for a given k ∈ N0, we have 0 ≤
⌈

k
mmax+1

⌉
(mmax +1)−k ≤ mmax, and thus

�(uk) ≤ �(u
	(
⌈

k
mmax+1

⌉
(mmax+1))

) = �(u
ν(
⌈

k
mmax+1

⌉
)
).

This completes the verification of L1.
(L2) Inserting ν(k) − 1 in place of k in (10), we obtain

�(uν(k)) ≤ �(u	(ν(k)−1)) − δ
αν(k)−1

‖Gαν(k)−1(uν(k)−1)‖2H . (18)

123


On the forward–backward method with nonmonotone… 1275

Further, we can write

�(u	(k+1)) = max
0≤ j≤m(k+1)

�(uk+1− j ) ≤ max
0≤ j≤m(k)+1

�(uk+1− j )

≤ max

[
�(uk+1), max

1≤ j≤m(k)+1
�(uk+1− j )

]

≤ max
[
�(u	(k)) − δ

αk
‖Gαk (uk)‖2H , �(u	(k))

]
≤ �(u	(k)),

(19)

where we have used m(k + 1) ≤ m(k) + 1. Therefore, {�(u	(k))}k is decreasing and
we can write

�(u	(ν(k)−1)) = �(u	(	(kmmax+k)−1)) ≤ �(u	(kmmax+k−mmax−1))

= �(u	((k−1)mmax+(k−1))) = �(uν(k−1)).
(20)

Together with (18) we can conclude (15) and, thus, L2 holds.
(L3) Assume that Algorithm 1 does not converge after finitely many iterations.

Summing (15) up for k = 1, . . . , k′, we obtain

k′∑
k=1

δ
αν(k)−1

‖Gαν(k)−1(uν(k)−1)‖2H ≤
k′∑

k=1

�(uν(k−1)) − �(uν(k))

≤ �(uν(0)) − �(uν(k′)).

(21)

Sending k′ to infinity and using the fact that� is bounded from below,we can conclude
(16). Similarly (17) follows by using the fact that formmax = 0 it holds ν(k) = 	(k) =
k. Thus, using (16), we can conclude the proof. ��

Finally we are ready to present our main convergence result of this section.

Theorem 6 Suppose that A1–A3 hold and that � is bounded from below. Then, for
the sequence {uk}k ⊂ H generated by Algorithm 1 with {αk}k ⊂ R>0, the following
statements hold true:

(i) Either Algorithm 1 terminates after finitely many iterations with a stationary
point of (P) or the sequence {‖Gαk (uk)‖H }k converges to zero, that is

lim
k→∞ ‖Gαk (uk)‖H = 0. (22)

(ii) The following inequality holds true

min
0≤i≤k

‖Gαinf (ui )‖H ≤ Cmmax
G

√
α(mmax+1)(�(u0)−�(uk ))

kδ
. (23)

(iii) If, in addition, A4 holds, every weak sequential accumulation point of {uk}k ⊂ H
is a stationary point of (P).

123


1276 B. Azmi, M. Bernreuther

Proof (i) Note that if Algorithm1 converges after finitelymany iterations, by definition
of the stopping criterion and Proposition 1, a stationary point of (P) has been found.
Now assume that Algorithm 1 does not converge after finitely many iterations. Then,
using L3, and the fact that αk ≤ α for all k by (iii) from Lemma 4, we arrive at

lim
k→∞ ‖Gαν(k)−1(uν(k)−1)‖H = 0. (24)

It remains now to show that (22) holds. To show this, we will successively use (iv) of

Lemma 4. Let k ≥ 0 be arbitrary. Using the fact that 0 ≤ k −
⌊

k
mmax+1

⌋
(mmax + 1) ≤

mmax and

⌊
k

mmax+1

⌋
(mmax + 1) − 	

(⌊
k

mmax+1

⌋
(mmax + 1)

)
≤ mmax,

we can write

‖Gαk (uk)‖H ≤ Cmmax
G ‖Gα⌊ k

mmax+1

⌋
(mmax+1)

(u⌊ k
mmax+1

⌋
(mmax+1)

)‖H

≤ C2mmax
G ‖Gα

	
(⌊
k

mmax+1

⌋
(mmax+1)

)(u
	
(⌊

k
mmax+1

⌋
(mmax+1)

))‖H

≤ C2mmax
G ‖Gα

ν

(⌊
k

mmax+1

⌋)(u
ν
(⌊

k
mmax+1

⌋))‖H

≤ C2mmax+1
G ‖Gα

ν

(⌊
k

mmax+1

⌋)
−1

(u
ν
(⌊

k
mmax+1

⌋)
−1

)‖H .

(25)

Finally, sending k to∞ and using (24), we obtain (22) and the proof of (i) is complete.
(ii) Due to the facts that �(uk) ≤ �(u

ν(
⌈

k
mmax+1

⌉
)
) (by L1), �(u0) = �(uν(0)),

and by successively using (8), we obtain that

�(uk) − �(u0) ≤ �(u
ν(
⌈

k
mmax+1

⌉
)
) − �(u0)

≤ �(u
ν(
⌈

k
mmax+1

⌉
)
) − �(u

ν(
⌈

k
mmax+1

⌉
−1)

) + �(u
ν(
⌈

k
mmax+1

⌉
−1)

) − · · ·
− �(uν(1)) + �(uν(1)) − �(u0)

≤

⌈
k

mmax+1

⌉
∑
i=1

− δ
αν(i)−1

‖Gαν(i)−1(uν(i)−1)‖2H ≤

⌈
k

mmax+1

⌉
∑
i=1

− δ
α
‖Gαν(i)−1(uν(i)−1)‖2H .

(26)

123


On the forward–backward method with nonmonotone… 1277

Thus, using L1, we can infer that

kδ
α(mmax+1) min

0≤i≤
⌈

k
mmax+1

⌉
(mmax+1)−1

‖Gαi (ui )‖2H

≤
⌈

k
mmax+1

⌉
δ

α
min

1≤i≤
⌈

k
mmax+1

⌉ ‖Gαν(i)−1(uν(i)−1)‖2H

≤

⌈
k

mmax+1

⌉
∑
i=1

δ
α
‖Gαν(i)−1(uν(i)−1)‖2H ≤ �(u0) − �(uk),

and this yields

min
0≤i≤

⌈
k

mmax+1

⌉
(mmax+1)−1

‖Gαi (ui )‖H ≤
√

α(mmax+1)(�(u0)−�(uk ))
kδ

. (27)

Further, using the second part of Lemma 4 and the fact that
⌈

k
mmax+1

⌉
(mmax + 1) −

1 − k < mmax, we can deduce that

min
0≤i≤k

‖Gαi (ui )‖H ≤ Cmmax
G min

0≤i≤
⌈

k
mmax+1

⌉
(mmax+1)−1

‖Gαi (ui )‖H . (28)

Finally, (23) follows from (27), (28), P2, and the fact that αk ≥ αinf for all k ≥ 0.
(iii) We show that every weak sequential accumulation point of {uk}k ⊂ H is a

stationary point of (P). In other words, we suppose that uki ⇀u∗ for a subsequence
{uki }i ⊂ {uk}k and u∗ ∈ H . Then we show that u∗ ∈ S∗. For the sake of convenience,
we use the same notation for the subsequence as for the sequence itself. To begin, due
to (22) in (i), P2, and the fact that αk ≥ αinf , we can conclude that

lim
k→∞ ‖Gαinf (uk)‖H = 0 and lim

k→∞ ‖Tαinf (uk) − uk‖H = 0. (29)

Applying (4) for α = αinf and u = uk , we obtain for every k ∈ N0 that

Gαinf (uk) − ∇F(uk) ∈ ∂R(Tαinf (uk)). (30)

Due to A4, we can infer that uk⇀u∗ implies ∇F(uk) → ∇F(u∗) and, thus, using
(29) we have Gαinf (uk) − ∇F(uk) → −∇F(u∗) as k → ∞. Due to (29), we also
can conclude that Tαinf (uk)⇀u∗ as k → ∞. Therefore, sending k → ∞ in (30)
and using the fact that the graph of ∂R is sequentially closed [26, Proposition 16.26]
under the weak topology for domain and the strong topology for codomain, we obtain
−∇F(u∗) ∈ ∂R(u∗). This completes the proof. ��
Remark 2 In the case that dim(H) < ∞, the statement of (iii) in Theorem 6 holds
true for every (strong) accumulation point of {uk}k without requiring A4.

123


1278 B. Azmi, M. Bernreuther

In the next theorem, we derive an estimate that reflects the worst-case complexity
of the required function and gradient-like evaluations of Algorithm 1 to find an εtol-
stationary point. The proof is inspired by the one given in [50, Theorem3.4] for smooth
problems.

Theorem 7 (Worst-case complexity) Suppose that A1–A3 hold and that � is bounded
from below with �̄ := infu∈H �(u) > −∞. Then for a given tolerance εtol > 0,
Algorithm 1 requires at most

k f
max :=

⌊
γ

f
comp(�(u0)−�̄)

ε2tol

⌋
(31)

function evaluations of � and

kg
max :=

⌊
γ

g
comp(�(u0)−�̄)

ε2tol

⌋
(32)

Gradient-like Gα(·) evaluations to find an iterate uk satisfying ‖Gαk (uk)‖H ≤ εtol,
where

γ f
comp := (mmax+1)C2mmax

G
γdecr

and γ g
comp := (mmax+1)αC2mmax

G
δ

,

with

γdecr := min
{

δ
αsup

,
2(1−δ)δ
n1ηLF ′

}
and n1 :=

⌊∣∣∣logη

(
ηLF ′

2αinf (1−δ)

)∣∣∣⌋ .

Proof The proof can be found in Appendix A. 1. ��
This finishes our considerations of convergence and complexity in the general

setting.

3.2 Convex case

In this section, higher-order convergence rates will be shown in two cases of addi-
tional structural assumptions. Firstly, we consider the case where F is also convex.
Afterwards, the case of a quadratic growth assumption on � will be investigated. We
assume the following modified version of Assumption 1.

Assumption 2 Assume that A1–A3 hold. Further, instead of A4, assume that

A’4: F : H → R is convex.

Under Assumption 2, the whole function � is convex. In this case, we can conclude
that the set of minimizers of � coincides with S∗ provided that S∗ �= ∅ and that S∗ is
closed and convex. Further, we have

S∗ = Argmin� = (∂�)−1(0).

123


On the forward–backward method with nonmonotone… 1279

The associated minimal function value is denoted by �∗ ∈ R.
Next we prove an auxiliary lemma which will be used later.

Lemma 8 Suppose that Assumption 2 holds, S∗ �= ∅, and {uk}k is generated by Algo-
rithm 1. Then for every λ ∈ [0, 1], it holds

�(uν(k)) − �∗ ≤ (1 − λ)
(
�(uν(k−1)) − �∗)+ αλ2

2 dist2(uν(k)−1, S∗)

+ C̃
αν(k)−1

‖Gαν(k)−1(uν(k)−1)‖2H ,
(33)

where C̃ := LF ′
2αinf

. Further, we have the following inequality for the initial iterations

�(uν(1)) − �∗ ≤ C0 dist
2(u0, S∗), (34)

with constant C0 which is independent of u0.

Proof The proof can be found in Appendix A. 2. ��
Now we are ready to provide the main convergence result for the convex case. For

the sake of convenience in the presentation, we set

Ek := �(uk) − �∗ for every k ≥ 0,

and use this notation in the remainder of this section.

Theorem 9 (Global convergence and O(k−1) complexity for the convex case) Sup-
pose that Assumption 2 holds, S∗ �= ∅, and {uk}k is generated by Algorithm 1. Then
the following statements hold true:

(i) Every weak sequential accumulation point of {uk}k belongs to S∗.
(ii) {uk}k converges weakly to a minimizer u∗ ∈ S∗ and its “shadow” sequence

converges strongly to u∗, that is PS∗uk → u∗.
(iii) It holds �(uk) → �∗ as k → ∞ and for large enough k ≥ 0, there exist

constants ρ1 > 0 and ρ2 > 0 such that

�(uk) − �∗ ≤ ρ1

ρ2 + k
. (35)

Proof (i) We suppose that a subsequence {uki }i with uki ⇀u∗ is given. We will show
that u∗ ∈ S∗. To show this, we prove that there exists a vanishing sequence of sub-
gradients {wki }i ⊂ H , i.e. wki → 0, corresponding to {uki }i with wki ∈ ∂�(uki ) for
every i ∈ N. Using (4), we define

wk+1 := Gαk (uk) + ∇F(uk+1) − ∇F(uk) ∈ ∂�(uk+1). (36)

Further, since S∗ �= ∅, � is bounded from below. Hence, we can use (i) of Theorem 6
and (22) holds. This, together with (36) and the boundedness of αk , implies

‖wk+1‖H ≤ ‖Gαk (uk)‖H + ‖∇F(uk+1) − ∇F(uk)‖H ≤ (1 + LF ′
αk

)‖Gαk (uk)‖H → 0,

123


1280 B. Azmi, M. Bernreuther

as k → ∞. In particular, we can infer that wki → 0 as i → ∞. Using the fact that the
graph of ∂� is sequentially closed [26, Proposition 16.26] under the weak topology
for domain and the strong topology for codomain together withwki → 0 and uki ⇀u∗,
we arrive at 0 ∈ ∂�(u∗) and therefore u∗ ∈ S∗.

(ii) We show that uk⇀u∗ with u∗ ∈ S∗. In this matter, we show that {uk}k is a
quasi-Fejér sequence with respect to S∗ �= ∅. Using (4), we can write for every k ∈ N0
that

0 ∈ αk(uk+1 − uk) + ∂R(uk+1) + ∇F(uk). (37)

Further, since R is convex and ∇F is Lipschitz continuous, by the Haddad-Bailon
Theorem [26, Corollary 18.16, p. 270], we have

(∇F(v) − ∇F(w), v − w)H ≥ LF ′−1‖∇F(v) − ∇F(w)‖2H .

Further, we can write for every v,w, z ∈ int(domF) that

(∇F(v) − ∇F(w), z − w)H

= (∇F(v) − ∇F(w), v − w)H + (∇F(v) − ∇F(w), z − v)H

≥ LF ′−1‖∇F(v) − ∇F(w)‖2H − ‖∇F(v) − ∇F(w)‖H ‖z − v‖H

≥ − LF ′
4 ‖z − v‖2H ,

(38)

where in the last line we have used the fact that f (x) := LF ′−1x2 − x‖z − v‖H is
strictly convex and attains its global minimum at x∗ = LF ′

2 ‖z − v‖H .
Using (38) for uk+1, uk , and any u∗ ∈ S∗ in place of v, z, and w, respectively, we

obtain

(∇F(uk) − ∇F(u∗), uk+1 − u∗)H ≥ − LF ′
4 ‖uk+1 − uk‖2H .

Together with the fact that ∂R is monotone, cf. [26], and (2), we can write

(η + ∇F(uk), uk+1 − u∗)H ≥ − LF ′
4 ‖uk+1 − uk‖2H for all η ∈ ∂R(uk+1). (39)

Using (37) and (39), we can deduce that

(uk+1 − uk, uk+1 − u∗)H ≤ LF ′
4αk

‖uk+1 − uk‖2H . (40)

Further, using (40) and the fact that

(w − v,w − z)H = 1
2‖w − v‖2H − 1

2‖v − z‖2H + 1
2‖w − z‖2H for all z, w, v ∈ H ,

123


On the forward–backward method with nonmonotone… 1281

we can infer that

1
2‖uk+1 − u∗‖2H − 1

2‖uk − u∗‖2H ≤
(

LF ′
4αk

− 1
2

)
‖uk+1 − uk‖2H

≤ LF ′
4α3

inf
‖Gαk (uk)‖2H ,

(41)

where it can be seen from (21) and (25) that

∞∑
k=mmax+1

‖Gαk (uk)‖2H ≤ (mmax + 1)C4mmax+2
G

∞∑
k=1

‖Gαν(k)−1(uν(k)−1)‖H < ∞.

Therefore, the sequence {uk}k ⊂ H is quasi-Fejér monotone with respect to S∗ and
since, due to (i), every weak sequential accumulation point of {uk}k ⊂ H belongs to
S∗ �= ∅, we can conclude by [51, Proposition 1(3)] that {uk}k is weakly convergent
and it has a unique accumulation point.

In addition, since S∗ is closed and convex, we can conclude, due to [52, Proposition
3.6 (iv)], that {PS∗uk}k converges strongly to a point û ∈ S∗. Moreover, since u∗ −
PS∗uk → u∗ − û and uk − PS∗uk⇀u∗ − û, it follows from the definition of orthogonal
projection that ‖u∗ − û‖2H = limk→∞(u∗ − PS∗uk, uk − PS∗uk)H ≤ 0. Hence, we
obtain that u∗ = û.

(iii) The proof of this part is inspired by the one in [38, Theorem 3.2.]. First, we
show that �(uk) → �∗. Due to (ii), uk⇀u∗ for some u∗ ∈ S∗. As in the proof of (i),
there exists a sequence wk → 0 with wk ∈ ∂�(uk). Therefore, we can write

�(uk) ≤ �∗ + (wk, uk − u∗)H for every k ∈ N0. (42)

Sending k → ∞ in (42) and using the facts that uk⇀u∗ and wk → 0 and the weak
sequential lower semicontinuity of �, we can conclude that

�∗ ≤ lim inf
k→∞ �(uk) ≤ lim sup

k→∞
�(uk) ≤ �∗.

Hence, �(uk) → �∗.
Next, we turn to the verification of (35) for a large enough k. Due to (33) of

Lemma 8, for every λ ∈ [0, 1], it holds

Eν(k) ≤ (1 − λ)Eν(k−1) + αλ2

2 dist2(uν(k)−1, S∗) + C̃
αν(k)−1

‖Gαν(k)−1(uν(k)−1)‖2H .

(43)

Since {uk}k is quasi-Féjer monotone with respect to S∗ �= ∅, due to [52, Proposition
3.6 (ii)], the sequence dist2(uk, S∗) is convergent. Therefore, for a positive constant
κ > 0, we have

dist2(uν(k)−1, S∗) ≤ κ, for all k ≥ 1. (44)

123


1282 B. Azmi, M. Bernreuther

Further, using (43), (44), and L2, we can write

Eν(k) ≤ (1 − λ)Eν(k−1) + αλ2κ
2 + C̃

δ

(
Eν(k−1) − Eν(k)

)
, (45)

where the expression on the right-hand side is strictly convex in λ, since

d2

dλ2

(
(1 − λ)Eν(k−1) + αλ2κ

2

)
= ακ > 0.

Therefore, it possesses the unique minimizer λ = Eν(k−1)
ακ

. Since {Eν(k)}k → 0, this

implies that for large enough k, we can set λ = Eν(k−1)
ακ

≤ 1 in (45) and obtain that

Eν(k) ≤ Eν(k−1) − E2
ν(k−1)
2ακ

+ C̃
δ

(
Eν(k−1) − Eν(k)

)
.

Together with the fact that Eν(k) ≤ Eν(k−1), we can write

Eν(k) ≤ Eν(k−1) − Eν(k−1)Eν(k)

2ακ
+ C̃

δ

(
Eν(k−1) − Eν(k)

)

and, thus, obtain

Eν(k) ≤
(
1 + δ

2ακ(C̃+δ)
Eν(k−1)

)−1
Eν(k−1). (46)

Further, (46) can be expressed as

1
Eν(k)

≥ 1
Eν(k−1)

+ δ

2ακ(C̃+δ)
.

Applying this inequality recursively for integers k1 and k2 with k2 ≥ k1 and large

enough k1 satisfying
Eν(k1)

ακ
≤ 1, we obtain

1
Eν(k2)

≥ 1
Eν(k1)

+ δ(k2−k1)
2ακ(C̃+δ)

,

and by easy computations, also

Eν(k2) ≤ 2ακ(C̃+δ)Eν(k1)

2ακ(C̃+δ)+Eν(k1)δ(k2−k1)
.

Thus, for k ≥ 0 large enough, we set k2 =
⌈

k
mmax+1

⌉
and obtain

Ek ≤ E
ν(
⌈

k
mmax+1

⌉
)
≤ 2ακ(C̃+δ)Eν(k1)

2ακ(C̃+δ)+Eν(k1)δ(
⌈

k
mmax+1

⌉
−k1)

≤ 2ακ(C̃+δ)Eν(k1)

−k1Eν(k1)δ+2ακ(C̃+δ)+Eν(k1)δ(
k

mmax+1 )
.

123


On the forward–backward method with nonmonotone… 1283

Therefore, (35) follows by setting

ρ1 := (mmax + 1)δ−12ακ(C̃ + δ)

and

ρ2 := (mmax + 1)(δEν(k1))
−1
(
2ακ(C̃ + δ) − k1Eν(k1)δ

)
.

Thus, the proof is complete. ��
Comparing Theorem 9 to Theorem 6, one obtains weak convergence of the whole
sequence and convergence of the associated objective function evaluations with rate
1/k.

Remark 3 Note that for the case where dim(H) < ∞, due to the equivalence of the
weak and strong topologies, it follows from (ii) of Theorem 9 that uk converges to u∗
in the strong topology. Moreover, we will later use the convergence PS∗uk → u∗ of
the shadow sequence to derive the strong convergence uk → u∗ under the quadratic
growth conditions. This is obviously no longer necessary when dim(H) < ∞.

3.3 Convergence under quadratic growth-type conditions

In this section, we turn our attention towards quadratic growth-type conditions and
study convergence of Algorithm 1 under these conditions.

Definition 2 (Quadratic growth condition) We say that � satisfies the quadratic
growth condition, if

�(u) − �∗ ≥ γ�,� dist2(u, S∗) for all u ∈ � ∩ dom� (47)

holds with a set � ⊂ H , a constant γ�,� > 0, and S∗ �= ∅. We refer to this notion
as global if � = H , and as local, if for u∗ ∈ S∗, r ∈ (0,∞], and ω > 0, we have
� = Br (u∗) ∩ [�∗ < � + ω]. Additionally, � is said to satisfy the strong quadratic
growth condition at u∗ if S∗ = {u∗} on �. That is,

�(u) − �(u∗) ≥ γ�,�‖u − u∗‖2H for all u ∈ � ∩ dom�. (48)

The quadratic growth condition is a geometrical assumption which describes the flat-
ness of the objective function around its minimizers. Roughly speaking, this condition
is considered as a relaxation of the strong convexity condition and allows us to obtain
faster rates of convergence (linear) and also convergence in the strong topology for the
iteration sequence. It is also closely related to the notion of Tikhonov well-posedness
[53]. The relationship between the quadratic growth condition and the so-called met-
ric subregularity of the subdifferential has been investigated e.g. in [10, 54–57]. The
strong quadratic growth condition (48) is said to be the quadratic functional growth
property in [16] provided that � is continuously differentiable over a closed convex

123


1284 B. Azmi, M. Bernreuther

set. In [15, 58], � is also called 2-conditional on � if it satisfies the quadratic growth
condition (47). This property was recently proved in [10, Theorem 5] to be equivalent
with the case where � satisfies the Kurdyka-Łojasiewicz inequality with order 1/2.

Theorem 10 Suppose that Assumption 2 and the quadratic growth condition (47) hold
for � := [� < �∗+ω] with ω > 0. Then, for the sequence of iterates {uk}k generated
by Algorithm 1, there exists k̄ ∈ N such that for every k ≥ k̄, we have

�(uk) − �∗ ≤ Ccσ
k, (49)

and

dist2(uk, S∗) ≤ Cdσ k, (50)

where the constants Cc > 0, Cd > 0, and 0 < σ < 1 are independent of k.
Further, there exists u∗ ∈ S∗ such that uk converges in the strong topology to u∗,

i.e. uk → u∗, and it holds

‖uk − u∗‖2H ≤ C pσ
k, (51)

with a constant C p > 0 which is independent of u∗ and k.

Proof First, due to L2 and (iii) from Theorem 9, the sequence {�(uν(k))}k is monoton-
ically decreasing and converges to �∗. Further, using (20), we can deduce for k ≥ 1
that

�(uν(k)−1) ≤ �(u	(ν(k)−1)) ≤ �(uν(k−1)). (52)

Thus, for given ω > 0, there exists k̄ω ∈ N such that

�(uν(k)−1)) ∈ [� < �∗ + ω] for all k ≥ k̄ω.

Next, we show that

Eν(k) ≤ θEν(k−1) for all k ≥ k̄ω, (53)

with a constant θ ∈ (0, 1) independent of k.
Let an arbitrary k ≥ k̄ω be given. To show (53), we choose an arbitrary ζ satisfying

0 < ζ < min{ 1
δ
, 1
2C̃

,
γ�,�

2αC̃
}.

Now we consider the following cases:

• The inequality 1
αν(k)−1

‖Gν(k)−1(uν(k)−1)‖2H ≥ ζEν(k−1) holds. In this case,
using L2, we obtain

Eν(k) ≤ Eν(k−1) − δ
αν(k)−1

‖Gν(k)−1(uν(k)−1)‖2H ≤ (1 − ζ δ)Eν(k−1),

123


On the forward–backward method with nonmonotone… 1285

where due to the choice of ζ we have 1 − ζ δ < 1.
• The inequality

1
αν(k)−1

‖Gν(k)−1(uν(k)−1)‖2H < ζEν(k−1) (54)

holds. This case is more delicate. Using the quadratic growth condition (47) and
(52), we obtain for k ≥ k̄ω that

dist2(uν(k)−1, S∗) ≤ 1
γ�,�

Eν(k)−1 ≤ 1
γ�,�

Eν(k−1). (55)

Now, using Lemma 8, (33), and (55), we can write for every λ ∈ [0, 1] that

Eν(k) ≤
(

C̃ζ + 1 − λ + α
2γ�,�

λ2
)
Eν(k−1), (56)

where it can be easily seen that the minimum of
(

C̃ζ + 1 − λ + α
2γ�,�

λ2
)
attained

at λ∗ := min{1, γ�,�

α
} is strictly smaller than 1 and therefore (53) holds for k ≥ k̄ω.

Let now k ≥ k̄ := k̄ω(mmax + 1) be given. By successively applying (53), we
obtain

Ek
L1≤ E

ν(
⌈

k
mmax+1

⌉
)
≤ θ1−k̄ωθ

⌈
k

mmax+1

⌉
Eν(k̄ω−1)

≤ θ1−k̄ωθ
k

mmax+1 Eν(k̄ω−1) ≤ θ1−k̄ω

(
θ

1
mmax+1

)k

Eν(k̄ω−1).

(57)

Together with the quadratic growth condition (47), we have

dist2(uk, S∗) ≤ 1
γ�,�

Ek ≤ θ1−k̄ω

γ�,�

(
θ

1
mmax+1

)k

Eν(k̄ω−1). (58)

Thus, for every k ≥ k̄, the iterates uk stay in �.

Setting Cc := θ1−k̄ωEν(k̄ω−1), Cd := γ −1
�,�θ1−k̄ωEν(k̄ω−1), and σ := θ

1
mmax+1 < 1,

we are finished with the verification of (49) and (50).
Next, we show that uk → u∗ for uk⇀u∗ with u∗ ∈ S∗ given in Theorem 9 (ii).

Due to (50), dist(uk, S∗) → 0 and we can infer that uk − PS∗uk → 0. This together
with fact that PS∗uk → u∗ (see (i i) in Theorem 9) leads to uk → u∗.

Finally, we show that (51) holds true. To see this, let p ∈ N be arbitrary, using
Young’s inequality we have

‖uk − uk+p‖2H ≤ 2
(
‖uk − PS∗uk‖2H + ‖uk+p − PS∗uk‖2H

)

= 2
(
dist2(uk, S∗) + ‖uk+p − PS∗uk‖2H

)
.

(59)

123


1286 B. Azmi, M. Bernreuther

Using (41), we obtain for an arbitrary u∗ ∈ S∗ that

‖uk+p − u∗‖2H ≤ ‖uk − u∗‖2H + LF ′
2α2

inf

k+p−1∑
j=k

‖Gα j (u j )‖2H .

In particular, we have for u∗ = PS∗uk that

‖uk+p − PS∗uk‖2H ≤ dist2(uk, S∗) + LF ′
2α2

inf

k+p−1∑
j=k

‖Gα j (u j )‖2H . (60)

Further, using L2 and (25), we can write for large enough k that

k+p−1∑
j=k

‖Gα j (u j )‖2H ≤ C4mmax+2
G

k+p−1∑
j=k

‖Gα
ν

(⌊
j

mmax+1

⌋)
−1

(u
ν

(⌊
j

mmax+1

⌋)
−1

)‖2H

≤ C4mmax+2
G αδ−1

k+p−1∑
j=k

δ
α

ν

(⌊
j

mmax+1

⌋)
−1

‖Gα
ν

(⌊
j

mmax+1

⌋)
−1

(u
ν

(⌊
j

mmax+1

⌋)
−1

)‖2H

≤ C4mmax+2
G αδ−1

⎛
⎝E

ν
(⌊

k
mmax+1

⌋
−1
) − E

ν

(⌊
k+p

mmax+1

⌋)
⎞
⎠ .

(61)

Combining (59), (60), (61), and setting C̄ p := C4mmax+2
G αLF ′

δα2
inf

, we arrive at

‖uk − uk+p‖2H ≤ 4 dist2(uk, S∗) + C̄ p

⎛
⎝E

ν
(⌊

k
mmax+1

⌋
−1
) − E

ν

(⌊
k+p

mmax+1

⌋)
⎞
⎠ .

Sending p → ∞ and using Theorem 9 (iii), (50), and similar computations as in (57),
we obtain for every k ≥ k̄ that

‖uk − u∗‖2H ≤ 4 dist2(uk, S∗) + C̄ pE
ν
(⌊

k
mmax+1

⌋
−1
)

≤ 4Cdσ k + C̄ pθ
−1−k̄ωσ kEν(k̄ω−1) = C pσ

k .

Thus, (51) holds true with C p := 4Cd + C̄ pθ
−1−k̄ωEν(k̄ω−1) and this completes the

proof. ��
Corollary 11 Suppose that Assumption 2 holds and the quadratic growth condition
(47) is satisfied for � := Br (u∗)∩[� < �∗ +ω] with r , ω ∈ (0,∞). Further assume
that, {uk}k generated by Algorithm 1, converges strongly to some u∗ ∈ S∗. Then there

123


On the forward–backward method with nonmonotone… 1287

exists k̄ ∈ N such that (49)–(51) hold with constants Cc > 0, Cd > 0, C p > 0, and
0 < σ < 1, which are independent of k and u∗.

Proof The proof proceeds along the lines of the proof of Theorem 10 with the differ-
ence that one also needs to be sure that {uk}k ⊂ Br (u∗) for a large enough k̄ ∈ N.
This follows from the fact that uk → u∗. ��

Remark 4 Due to the equivalence of weak and strong convergence in finite-
dimensional spaces, the assumption of Corollary 11 automatically holds if dim(H) <

∞. For the case that dim(H) = ∞, it is not clear how to guarantee that {uk}k ⊂ Br (u∗)
for large enough k ∈ N.

In many cases of PDE-constrained optimization, the assumptions of the following
corollary are very likely satisfied.

Corollary 12 Suppose that A1–A3 in Assumption 1 hold and that F is convex on
Br (u∗) with u∗ ∈ S∗ and r ∈ (0,∞). Further, we assume that the strong quadratic
growth condition (48) holds for � := Br (u∗)∩[� < �∗ +ω] with ω ∈ (0,∞). Then,
the sequence of iterations {uk}k generated by Algorithm 1 converges locally R-linear
with respect to the strong topology. In other words, there exists a radius r0 ≤ r such
that for every u0 ∈ Br0(u

∗) ∩ [� < �∗ + ω] and k ≥ 1, we have

‖uk − u∗‖2H ≤ CRσ k‖u0 − u∗‖2H (62)

and

�(uk) − �(u∗) ≤ σ k (�(u0) − �∗) , (63)

where the constants CR > 0 and 0 < σ < 1 are independent of u0 and k.

Proof By similar argument as in the proof of Theorem 10, we can show that for any
uν(k)−1 ∈ Br (u∗) ∩ [� < �∗ + ω] with k ≥ 1, it holds

Eν(k) ≤ θEν(k−1), (64)

with θ ∈ (0, 1) independent of k.
Let now k ≥ 1 be given.Assuming ui ∈ Br (u∗)∩[� < �∗+ω] for i = 0, . . . , k−1

and successively applying (64), we obtain

Ek
L2≤ E

ν(
⌈

k
mmax+1

⌉
)
≤ θ−1θ

⌈
k

mmax+1

⌉
Eν(1) ≤ θ−1θ

k
mmax+1 Eν(1)

≤ θ−1
(

θ
1

mmax+1

)k

Eν(1).

(65)

123


1288 B. Azmi, M. Bernreuther

Together with the quadratic growth condition (48) and (34) from Lemma 8, we have

‖uk − u∗‖2H ≤ 1
γ�,�

Ek ≤ 1
θγ�,�

(
θ

1
mmax+1

)k

Eν(1)

≤ C0
θγ�,�

(
θ

1
mmax+1

)k

‖u0 − u∗‖2H .

(66)

Thus, for every u0 ∈ Br0(u
∗) with r0 ≤

(
θγ�,�

C0

) 1
2 r , iterates uk stay in Br (u∗) for

k ≥ 0 and, as a consequence, the inequalities (65) and (66) are well-defined. By setting

CR := C0
θγ�,�

and σ := θ
1

mmax+1 < 1, we are finished with the verification of (62).
As in (65), we can also write for every k ≥ 0 and u0 ∈ Br0(u

∗) ∩ [� < �∗ + ω]
that

Ek ≤ E
ν(
⌈

k
mmax+1

⌉
)
≤ θ

⌈
k

mmax+1

⌉
Eν(0) ≤ θ

k
mmax+1 Eν(0) ≤ σ kEν(0).

Thus (63) holds also true and this completes the proof. ��
Compared to Corollary 11, we do not need to assume the strong convergence of the

sequence {uk}k in Corollary 12 and the following corollary.

Corollary 13 Suppose that A1–A3 in Assumption 1 hold and F is locally strongly
convex with a constant κ > 0 on Br (u∗) with some r ∈ (0,∞) and u∗ ∈ S∗. Then,
the sequence of iterates {uk}k generated by Algorithm 1 converges locally R-linear in
the strong topology to u∗. In other words, there exists r0 ≤ r such that (62) and (63)
hold for every u0 ∈ Br0(u

∗).
Proof Since � is locally strongly convex, using [59, Proposition 3.23], we can write

�(u) − �(v) ≥ (w, u − v)H + κ
2‖u − v‖2H for all u, v ∈ Br (u

∗) and w ∈ ∂�(v).

(67)

Setting v = u∗ and w = 0, we can easily see that the strong quadratic growth
property (48) holds for � := Br (u∗) and γ�,� := κ

2 . Therefore, the proof follows by
Corollary 12. ��
Remark 5 Note that, if the strong convexity of F in Corollary 13 holds globally, then
R-linear convergence is also obtained globally. More precisely, (62) and (63) hold for
every u0 ∈ H .

3.4 On relaxing the global Lipschitz continuity of∇F for problems governed by
PDEs

In this section,we analyze a list of assumptions satisfied for a large class of optimization
problems governed by PDEs. We will study the applicability of these assumptions in
Sect. 4.

123


On the forward–backward method with nonmonotone… 1289

Assumption 3 For problem (P),

H1: � : H → R ∪ {+∞} is bounded from below and radially unbounded, i.e.
lim‖u‖H →∞ �(u) = ∞.

H2: R : H → R ∪ {+∞} is proper, convex, and lower semicontinuous.
H3: F : H → R ∪ {+∞} is continuously Fréchet differentiable on int(domF)

containing domR, that is, domR ⊆ int(domF).
H4: ∇F : int(domF) → H is LF ′-Lipschitz continuous on every weakly sequen-

tially compact subset of domR.
H5: F is weakly sequentially lower semicontinuous (wlsc) and∇F : int(domF) →

H is weak-to-strong sequentially continuous.

Here, compared to Assumption 1, we do not impose the global Lipschitz condition
on ∇F . Nevertheless, Assumption 3 will be sufficient to reproduce the results of
Sect. 3.1. Furthermore, H1, H2, and H5 impose a set of conditions that ensure the
existence of a global minimizer to (P), as will be shown in the following.

Proposition 14 Suppose that H1, H2, and H5 hold. Then, problem (P) possesses a
global minimizer ū ∈ H, and as a consequence, S∗ �= ∅. Further, every level set of �

is weakly sequentially compact.

Proof The proof uses standard arguments based on the direct methods in the calculus
of variations. By H1, there exists

�̄ := inf
u∈H

�(u),

and a minimizing sequence {un}n ⊂ H with �(un) → �̄ for n → ∞. Due to
H1 and [26, Proposition 11.11]), every level set of � is bounded and, thus, {un}n

admits a weakly convergent subsequence {unk }k with unk ⇀ū ∈ H on [� ≤ �(ũ)]
for some ũ ∈ domR. Properties H2 and H5 imply that � is wlsc, and thus, �(ū) ≤
lim infk→∞ �(unk ) = �̄. This shows that�(ū) = �̄, i.e. ū ∈ H is a globalminimizer
of (P). Hence S∗ �= ∅. Further, since � is wlsc, every level set of � is weakly
sequentially closed. Together, with the boundedness of the level sets, we conclude
that every level set of � is weakly sequentially compact. ��

Comparing the properties given in Assumptions 1 and Assumption 3 in detail, we
realize that besides the global Lipschitz continuity of ∇F , the remaining properties of
Assumption 1 follow from those of Assumption 3. Further, due to Lemma 4(i) and
Lemma 5, Algorithm 1 is well-defined, even without the Lipschitz continuity of ∇F .
Further, it can be seen from (10) and Definition 1, that for every u0 ∈ domR, the
whole sequence {uk}k generated by Algorithm 1 stays in U0 := [� ≤ �(u0)]. Since
U0 is weakly sequentially compact and U0 ⊂ domR, using H4 we can deduce that
∇F is LF ′-Lipschitz continuous on U0 with {uk}k ⊂ U0. Therefore all the results and
estimates in Sect. 3.1 can be easily applied in the presence of Assumption 3. Further,
similar observations are also valid for the results given in Sect. 3.2, provided that we
replace H5 by A’4 and analogously also for Sect. 3.3.

123


1290 B. Azmi, M. Bernreuther

4 Application to PDE-constrained optimization

In this section,we investigate the applicability of our theoretical results to twoproblems
governed by semilinear elliptic and parabolic PDEs. For each example, we discuss the
justification of Assumption 3.

4.1 Elliptic problem

As a first example, we consider the following semilinear elliptic problem

min
(y,u)

1

2
‖y − yd‖2L2(�)

+ σ

2
‖u‖2L2(�)

+ λ‖u‖L1(�) (PE)

subject to

⎧⎪⎨
⎪⎩

−κ�y + exp(y) = u in �,

y = 0 on ∂�,

ua ≤ u ≤ ub a.e. in �,

(68)

on a bounded domain � ⊂ R
n with n = 2, 3, which is either convex or possesses a

C1,1-boundary ∂�. Here the parameter κ > 0 stands for the diffusion, the parameters
σ, λ > 0 weigh the cost terms, ua, ub ∈ R with ua < 0 < ub are control bounds,
and yd ∈ L2(�) denotes the desired state. First, we show that problem (PE) can be
rewritten in the form (P). It is well-known, cf. [60, Theorems 2.7 and 2.12], that for
given u ∈ H := L2(�), the state equation (68) is uniquely solvable in the weak sense,
i.e., y(u) ∈ W ∩C(�)with W := H1

0 (�), and the control-to-state operator u �→ y(u)

is well-defined and twice continuously Fréchet differentiable from H to W ∩ C(�).
This operator is typically constructed by applying the implicit function theorem on
the following equality constraints

E(y, u) = 0 in Z ′ := H−1(�) with E(y, u) := −κ�y + exp(y) − u.

Here, E : W × H → Z ′ is twice continuously Fréchet differentiable, for every u ∈ H ,
Ey(y, u) ∈ L(W , Z ′) has a bounded inverse, and Eu(y, u) ∈ L(H , Z ′) is continuous.
Then, by defining

F(u) := 1

2
‖y(u) − yd‖2L2(�)

, R(u) := σ

2
‖u‖2L2(�)

+ λ‖u‖L1(�) + δU (u), (69)

with the indicator function δU of U := {u : ua ≤ u ≤ ub a.e. in �} ⊂ H , problem
(PE) has the form (P). Further, by standard computations as in [61, Chapter 1.6], it
can be shown that

∇F(u) = −p(u) in H , (70)

123


On the forward–backward method with nonmonotone… 1291

where p = p(u) ∈ W (see e.g., [60, Theorem 3.2]) is the weak solution of the adjoint
equation

{
−κ�p + exp(y)p = −(y − yd) in �,

p = 0 on ∂�,
(71)

with y = y(u). Now it remains to verify Assumption 3 for the reduced problem (P)
with (69). Properties H1 and H2 are clearly satisfied. H3 follows using the chain
rule and the continuous Fréchet differentiability of the control-to-state mapping y(u).
Note that

∇F(u) = (y′(u))∗J ′(y(u)) with J (y) := 1

2
‖y − yd‖2L2(�)

, (72)

where y′(u)h = −E−1
y (y(u), u)Eu(y(u), u)h for every h ∈ H and the superscript “*”

denotes the adjoint operator. It remains to verify H4 and H5. Since the control-to-
state operator u → y(u) and J are twice continuously Fréchet differentiable, using

(72), and the compact embedding W
c

↪−→ H , one obtains that the second Fréchet
derivative of F is bounded on bounded sets. Therefore ∇F is Lipschitz continuous
on any bounded set in H . Thus, H4 holds. Finally using the weak formulation of the
state (68) and adjoint (71) equations and [60, Theorem 2.11], it can be shown that

y(uk)⇀y(ū) and p(uk)⇀p(ū) in W as uk⇀ū in H . This together with W
c

↪−→ H ,
(70), and (69), implies thatF(uk) → F(ū) and ∇F(uk) → ∇F(ū) in W as uk⇀ū in
H . Hence, H5 holds. This finishes the verification of Assumption 3. Note that using
(70) and (3), the Fermat principle can be expressed as

u∗ = Prox 1
α
R(u∗ + 1

α
p(u∗)) for some α > 0,

where p(u∗) is the solution of (71) for u = u∗ and Prox 1
α
R can be characterized in a

pointwise a.e. sense. Due to [26, Proposition 24.13], we have for any α > 0

[
Prox 1

α
R(u)

]
(x) = Prox 1

α
R(u(x)) for almost all x ∈ �,

where a pointwise a.e. closed form representation of the proximal operator is given by

[
Prox 1

α
R(u)

]
(x) = min{max{

⎧⎪⎨
⎪⎩

1
C1

(u(x) − C2), u(x) > C2,
1

C1
(u(x) + C2), u(x) < −C2,

0, otherwise,

, ua}, ub},

withC1 := 1+σ/α andC2 := λ/α. The calculations are made similarly to [4, Section
6].

123


1292 B. Azmi, M. Bernreuther

4.2 Parabolic problem

As a second example, we will consider the following semilinear parabolic problem

min
(y,u)

1

2
‖y − yd‖2L2(0,T ;L2(�))

+ λ‖u‖L1(0,T ;L1(�)) (PP)

subject to

⎧⎪⎪⎪⎨
⎪⎪⎪⎩

ẏ − κ�y + y3 = u in (0, T ) × �,

y = 0 on (0, T ) × ∂�,

y(0) = y0 in �,

ua ≤ u ≤ ub a.e. in �,

(73)

where � ⊂ R
n with n = 2, 3 is a bounded domain with Lipschitz boundary ∂� and

T > 0. Further, κ > 0, λ ≥ 0, and the control bounds ua, ub ∈ R are defined as in
problem (PE). We also consider the desired state yd ∈ L2(0, T ; L2(�)) and initial
function y0 ∈ H1

0 (�).
Similarly to the problem (PE), we define the control-to-state operator u �→ y(u)

from H := L2(0, T ; L2(�)) to W := L2(0, T ; H2(�)∩ H1
0 (�))∩ H1(0, T ; L2(�)).

The well-posedness and twice continuous Fréchet differentiability of the control-to-
state operator can be established by similar arguments as given in [1, 62]. Compared
to the elliptic problem (PE), we define

F(u) := 1

2
‖y(u) − yd‖2L2(0,T ;L2(�))

, R(u) := λ‖u‖L1(0,T ;L1(�)) + δU (u), (74)

with U := {u : ua ≤ u ≤ ub a.e. in (0, T ) × �} ⊂ H and also consider

E(y, u) :=
(

ẏ − κ�y + y3 − u
y(0)

)
with Z ′ := L2(0, T ; L2(�)) × H1

0 (�).

In this case, (70) holds with p = p(u) ∈ W as the strong solution of the adjoint
equation

⎧⎪⎨
⎪⎩

− ṗ − κ�p + 3y2 p = −(y − yd) in (0, T ) × �,

p = 0 on ∂�,

p(T ) = 0 in �,

(75)

with y = y(u). The verification of Assumption 3 for (PP) follows by the similar

arguments given for (PE) and using the compact embedding W
c

↪−→ L2(0, T ; H1
0 (�))

and consequently W
c

↪−→ H . Moreover, the closed form pointwise a.e. representation

123


On the forward–backward method with nonmonotone… 1293

of the proximal operator is given by

Prox 1
α
R(u)(t, x) = min{max{

⎧⎪⎨
⎪⎩

u(t, x) − C2, u(t, x) > C2,

u(t, x) + C2, u(t, x) < −C2,

0, otherwise,

, ua}, ub},

where C2 = λ/α.

4.3 Discussion on the quadratic growth condition

In the remainder of the section, we present a situation in which F is locally strongly
convex and, as a consequence,Algorithm1 converges locally R-linear byCorollary 13.
We consider the case that for u∗ ∈ S∗, F is twice continuously Fréchet differentiable
and its second derivative satisfies

F ′′(u∗)(h, h) ≥ C‖h‖2H for all h ∈ H , (76)

with some C > 0. For the two previous problems, we express F ′′(u∗) in terms of
solutions to PDEs and discuss when F is locally strongly convex.

For both problems (PE) and (PP), the control-to-state operator u �→ y(u) from H
to W is twice continuously Fréchet differentiable, and by similar computation as in
[63], its second derivative can be expressed as

y′′(u)(h, q) = −E−1
y (y(u), u)Eyy(y(u), u)(y′(u)h, y′(u)q) for all h, q ∈ H , (77)

since the control operator is linear.
To be able to use Corollary 13 for problem (PE), we first redefine F and R as

follows

F(u) := 1

2
‖y(u) − yd‖2L2(�)

+ σ

2
‖u‖2L2(�)

, R(u) := λ‖u‖L1(�) + δU (u).

In this case, with the same arguments given in Sect. 4.1, (PE) can be rewritten as (P)
for which H1–H4 hold. Further, using the chain rule and (77), we can write for every
h, q ∈ H that

F ′′(u)(h, q) = 〈J ′′(y(u))y′(u)h, y′(u)q〉W ′,W

+ 〈−E−∗
y (y(u), u)J ′(y(u)), Eyy(y(u), u)(y′(u)h, y′(u)q)〉Z ,Z ′ + σ(h, q)H ,

where 〈·, ·〉 stands for the dual pairing and J was defined in (72). Hence,F ′′(u∗)with
u∗ ∈ S∗ can be expressed as

F ′′(u∗)(h, h) = ‖yh‖2L2(�)
+
∫

�

p∗ exp(y∗)(yh)2dx + σ‖h‖2H , (78)

123


1294 B. Azmi, M. Bernreuther

where y∗ = y(u∗), p∗ = p(u∗), and yh ∈ W is the weak solution of the linearized
state equation

{
−κ�yh + exp(y∗)yh = h in �,

yh = 0 on ∂�.

Note that, F is locally strongly convex on a neighborhood of u∗ ∈ S∗ provided that
(76) holds for some C > 0. Further, using (78), (76) can be equivalently rewritten as

σ‖h‖2L2(�)
+ ‖yh‖2L2(�)

+
∫

�

p∗ exp(y∗)(yh)2dx ≥ C‖h‖2L2(�)
for all h ∈ H .

(79)

We observe that the only term in (79) that can spoil the positive definiteness ofF ′′(u∗)
and, hence, local strong convexity ofF , is the term involving p∗. This term originates
from the nonlinearity in the state equation. Note that either a small enough adjoint
p∗ or a large enough parameter σ ensure that (79) and equivalently (76) hold. A
small enough adjoint can occur, for instance, if ‖y∗ − yd‖2

L2(�)
is sufficiently small.

Similarly, for the second example (PP), we obtain for F and R defined in (74) that

F ′′(u∗)(h, h) = ‖yh‖2L2(0,T ;L2(�))
+

T∫
0

∫
�

6p∗y∗(yh)2dxdt, (80)

where yh ∈ W is the weak solution to the linearized state equation

⎧⎪⎨
⎪⎩

ẏh − κ�yh + 3(y∗)2yh = h in (0, T ) × �,

yh = 0 on (0, T ) × ∂�,

yh(T ) = 0 in �,

and p∗ solves the weak formulation of (75) for y∗. For this problem, due to the absence
of the term σ

2 ‖·‖2H in the objective function, the local strong convexity is not clear. For
the elliptic problem (PE), the authors in [64] have studied a weaker condition which
implies that the quadratic growth condition (48) is satisfied. Furthermore from [65,
Section 3], it especially follows that for σ = 0 condition (79) cannot be satisfied for
the elliptic problem (PE).

5 Numerical experiments

In this section, we report on the numerical experiments for the problems (PE) and (PP)
in order to verify the capabilities of Algorithm 1 numerically. Throughout, we use
‖Gαk (uk)‖H ≤ εtol with some tolerance εtol > 0 as the termination condition, as it is
proposed in Sect. 2. Our codes have been implemented in Python 3 and use FEniCS

123


On the forward–backward method with nonmonotone… 1295

Table 1 Example 1: Parameter setting

Optimization problem Algorithm 1

� κ σ λ yd ua ub εtol αinf αsup α0 η δ mmax

(0, 1)2 10−2 10−4 10−3 See Fig. 1a −3 2 10−6 10−4 102 10 8 0.9 8

Fig. 1 Example 1: The desired state yd , the optimal state, and the optimal control (left to right)

(see [66]) for the matrix assembly. Sparse memory management and computations
have been implemented with SciPy (see [67]). All computations below have been run
on an Ubuntu 22.04 notebook with 32 GB main memory and an Intel Core i7-8565U
CPU.

We will also compare different step-size approaches for the iterations (1) with
respect to gradient-like evaluations and function evaluations as introduced in The-
orem 7. We consider a fixed step-size, different combination of BB-type step-sizes
presented in Sect. 2 (without linesearch method), and BB-type step-sizes incorporated
with the (non-)monotone linesearch approach.

Note that for problems governed by nonlinear PDEs such as (PE) and (PP), any
gradient-like Gα evaluation requires solving a nonlinear state equation and a linear
adjoint equation and any function � evaluation is involved with solving a nonlinear
state equation. Furthermore, the number of gradient-like evaluations corresponds to
the number of iterations k of Algorithm 1.

Example 1 (Elliptic problem) In this example, we consider problem (PE). For the
spatial discretization, we follow a discretize-before-optimize approach and use P1-
type finite elements on a Friedrichs-Keller triangulation of the spatial domain �. To
efficiently evaluate the nonlinearity, we resort to mass lumping. For the numerical
tests, we choose the parameters summarized in Table 1. Note that for the fixed step-
size approach, we set αk = α0 in (1) for all k ∈ N0.

The desired state, the optimal state, and the optimal control are illustrated in Fig. 1.
We can see that the bounds are active for the control, though no strong sparsity is
promoted, due to the choices of λ and σ .

We compare the different BB-type step-sizes presented in Sect. 2with the fixed step-
size approach and with a (non-)monotone linesearch approach. The results regarding
computational time, function evaluations, and gradient-like evaluations are gathered

123


1296 B. Azmi, M. Bernreuther

Fig. 2 Example 1: Convergence of Algorithm 1. “Error” refers to ‖Gαk (uk )‖H at the current iterate

Fig. 3 Example 1: Example of
non-convergence without
linesearch. “Error” refers to
‖Gαk (uk )‖H at the current
iterate

in Table 2. For the linesearchmethods, the most volatile of the novel BB-type step-size
updates, i.e. BB1b, is used as the initial trial step-sizes within Algorithm 1.

As can be seen from Table 2, for this example, all other approaches outperform
the one with fixed step-size by a huge margin of about two orders of magnitude. The
alternatingBB-methods appear to bemore efficient compared to the single BB-updates
for both the old and novel step-sizes. Furthermore, except in the case of BB1, the
novel step-sizes outperform the old ones and, overall, they appear to be competitive.
All of these considerations are valid for both computational time and gradient-like
evaluations. Function evaluations are only needed if a linesearch method is used.
Compared to the BB1b method, the nonmonotone linesearch method needs about 250
fewer gradient-like evaluations, but this comes at the cost of 887 additional function
evaluations for the linesearch and this results in an increased overall computational
time. Compared to the monotone (mmax = 0) linesearch, the nonmonotone approach
performs significantly better. The convergence behavior is also visualized in Fig. 2.

Figure 3 presents an example, where a BB-type step-size update without linesearch
fails to converge. In this example, we start the algorithms with α0 = 1 instead of
α0 = 10. This confirms the necessity of incorporating a linesearch strategy to ensure
convergence.

Example 2 (Parabolic problem) In this example, we consider (PP). Here, the spatial
domain is discretized in the samemanner as in Example 1. For the temporal discretiza-
tion, we use the Crank Nicolson/Adams Bashforth scheme [68]. In this scheme, the

123


On the forward–backward method with nonmonotone… 1297

Ta
bl
e
2

E
xa
m
pl
e
1:

N
um

er
ic
al
re
su
lts

fo
r
fix

ed
st
ep
-s
iz
e,
di
ff
er
en
ts
pe
ct
ra
lg

ra
di
en
tm

et
ho
ds

as
in
tr
od
uc
ed

in
Se
ct
.2
,a
nd

a
(n
on

-)
m
on

ot
on

e
lin

es
ea
rc
h
m
et
ho

d
w
ith

re
sp
ec
t

to
B
B
1b

Fi
xe
d

B
B
1a

B
B
2a

A
B
B
a

B
B
1b

B
B
2b

A
B
B
b

no
nm

on
.L

S
(B

B
1b

)
m
on

.L
S
(B

B
1b

)

gr
ad
.-
lik

e
ev
al
.

15
3,
66

2
61

8
10

46
57

1
94

1
60

8
38

3
69

7
99

1

fu
n.
ev
al
.

0
0

0
0

0
0

0
88

7
15

27

tim
e
[s
]

1.
56

·1
03

1.
01

·1
01

1.
36

·1
01

1.
55

·1
01

8.
39

·1
00

7.
92

·1
00

5.
65

·1
00

2.
21

·1
01

3.
30

·1
01

123


1298 B. Azmi, M. Bernreuther

Table 3 Example 2: Parameter setting

Optimization problem Algorithm 1

� T κ λ yd ua ub εtol αinf αsup α0 η δ mmax

(0, 1)2 1 10−2 10−2 See Fig. 4 −100 100 10−6 10−4 102 10 4 0.8 4

Fig. 4 Example 2: Snapshots of the desired state yd at time instances 0, 0.25, 0.5 and 0.75

Fig. 5 Example 2: Snapshots of the optimal state at time instances 0, 0.25, 0.5 and 0.75

Fig. 6 Example 2: Snapshots of the optimal control at time instances 0, 0.25, 0.5 and 0.75

implicit Crank Nicolson scheme is used except for the nonlinear terms which are
treated using the explicit Adams Bashforth scheme and mass lumping. For the numer-
ical tests, we choose the parameters summarized in Table 3. For the fixed step-size
approach, we choose αk = α0 for k ∈ N0, similarly to the previous example.

The desired state, the optimal state, and the optimal control at time instances t =
0, 0.25, 0.5, 0.75 are depicted in Figs. 4, 5, and 6, respectively. We can see sparsity in
space for the control. Note that sparsity in time can also be observed in the sense that
the control stays zero on an interval between time instances t = 0.25 and t = 0.75.

Similarly to Example 1, we compare the different BB-type step-sizes presented
in Sect. 2 with the fixed step-size approach and with a (non-)monotone linesearch

123


On the forward–backward method with nonmonotone… 1299

Ta
bl
e
4

E
xa
m
pl
e
2:

N
um

er
ic
al
re
su
lts

fo
r
fix

ed
st
ep
-s
iz
e,
di
ff
er
en
ts
pe
ct
ra
lg

ra
di
en
tm

et
ho
ds

as
in
tr
od
uc
ed

in
Se
ct
.2
,a
nd

a
(n
on

-)
m
on

ot
on

e
lin

es
ea
rc
h
m
et
ho

d
w
ith

re
sp
ec
t

to
B
B
1b

Fi
xe
d

B
B
1a

B
B
2a

A
B
B
a

B
B
1b

B
B
2b

A
B
B
b

no
nm

on
.L

S
(B

B
1b

)
m
on

.L
S
(B

B
1b

)

gr
ad
.-
lik

e
ev
al
.

51
5,
04

4
37

5
75

8
78

4
47

0
44

5
22

1
46

3
47

1

fu
n.
ev
al
.

0
0

0
0

0
0

0
63

9
71

3

tim
e
[s
]

5.
66

·1
04

4.
28

·1
01

8.
70

·1
01

8.
89

·1
01

5.
35

·1
01

5.
08

·1
01

2.
50

·1
01

8.
65

·1
01

9.
57

·1
01

123


1300 B. Azmi, M. Bernreuther

Fig. 7 Example 2: convergence of Algorithm 1. “Error” refers to ‖Gαk (uk )‖H at the current iterate

approach, and the results regarding computational time, function evaluations, and
gradient-like evaluations are presented in Table 4. To reveal the efficiency of the
linesearch strategy, we incorporate the least efficient BB-type step-size, namely BB1b,
with the linesearch strategy.

All considerations and observations fromExample 1 hold also true for this example.
That is, the iterations (1) for the choice of the BB-type step-sizes significantly outper-
form those with fixed step-size, as is to be expected. The novel alternating BB-method
outperforms the other approaches. The nonmonotone linesearch method outperforms
the monotone version. It also outperforms the BB1b version concerning gradient-like
evaluations, but the cost of additional 639 function evaluations again does not pay off
for overall computational time. The convergence behavior is also visualized in Fig. 7.

To summarize, all the numerical results from the above examples show the capa-
bilities of Algorithm 1 and the necessity for the use of a linesearch method. From the
above example, we can conclude that the incorporation of nonmonotone linesearch
and BB step-sizes leads to an efficient algorithm for a class of nonsmooth nonconvex
PDE-constrained optimization problems. Moreover, the convergence behavior is far
better than what our worst-case complexity results suggest. In particular, when com-
bined with the BB methods, this is typical behavior, see e.g., [34, 35, 40]. Whether
the strong quadratic growth condition helps to accelerate convergence in the elliptic
example (PE) towards the end is not entirely clear, although the orange curves in
Figs. 2c and 3 suggest it.

Conclusion

We have studied the nonmonotone FBS method for a class of nonsmooth infinite-
dimensional composite problems in Hilbert spaces. Starting with the general noncon-
vex setting, we have established global convergence with complexity (1/

√
k) and also

provided a worst-case complexity analysis. Under additional convexity assumption,
convergence has been improved to the sublinear of order (1/k) in function values. If
additionally, a quadratic growth-type condition is satisfied, we have shown R-linear
convergence, both in function values and iterates. Additional difficulties arising in the
transition from weak to strong convergence in the infinite-dimensional setting have
been discussed in detail. Finally, the nonmonotone FBS and the novel BB step-size

123


On the forward–backward method with nonmonotone… 1301

rules exploiting the nonsmooth part were successfully tested for elliptic and parabolic
PDE-constrained problems.

Appendix A Proofs

A. 1 Proof of Theorem 7

Here, we restrict ourselves mainly to deriving (31). Justification of (32) follows by
similar arguments. The proof is carried out by determining the minimum reduction of
the objective function between iteration step uk+1 and its maximal predecessor, i.e.
u	(k), with respect to (11) as long as Algorithm 1 has not been terminated. Note that
u	(k) can be calculated without further evaluations of the objective function by storing
previous iterates. When using Algorithm 1, we have the following two cases:

• Assume that (8) in Step 4 of Algorithm 1 holds for αk := αint,k and therefore
ik = 0. Then we have a decrease

�(u	(k)) − �(uk+1) ≥ δ
αint,k

‖Gαk (uk)‖2H ≥ δ
αsup

‖Gαk (uk)‖2H .

In this case, only one additional function evaluation is necessary to obtain this
decrease.

• Assume that (8) in Step 4 of Algorithm 1 fails to hold for αint,k and therefore
ik ≥ 1 backtracking steps are performed. Due to (iii) of Lemma 4 we have αk =
αint,kη

ik <
ηLF ′
2(1−δ)

, and, as a consequence, we can bound ik as follows ik <∣∣∣logη

(
ηLF ′

2αint,k (1−δ)

)∣∣∣ . Hence, Step 4 requires at most n1 :=
⌊∣∣∣logη

(
ηLF ′

2αinf (1−δ)

)∣∣∣⌋
function evaluations. At the time that the linesearch strategy terminates in this step,
we have the decrease

�(u	(k)) − �(uk+1) ≥ δ
αk

‖Gαk (uk)‖2H >
2(1−δ)δ
ηLF ′ ‖Gαk (uk)‖2H .

Gathering the two above cases, we can see that function value decrease per function
evaluation is given, in the worst case, by

�(u	(k)) − �(uk+1) ≥ γdecr‖Gαk (uk)‖2H ,

123


1302 B. Azmi, M. Bernreuther

with γdecr := min
{

δ
αsup

,
2(1−δ)δ
n1ηLF ′

}
. Further, using similar arguments as in the proof of

Theorem 6, we obtain

�(u0) − �̄ ≥ �(u0) − �(uk) ≥ �(uν(0)) − �(u
ν(
⌈

k
mmax+1

⌉
)
)

=

⌈
k

mmax+1

⌉
∑
i=1

�(uν(i−1)) − �(uν(i)) ≥

⌈
k

mmax+1

⌉
∑
i=1

γdecr‖Gαν(i)−1(uν(i)−1))‖2H

≥ γdecr

⌈
k

mmax+1

⌉
min

1≤i≤
⌈

k
mmax+1

⌉ ‖Gαν(i)−1(uν(i)−1)‖2H

≥
(

γdecrk
mmax+1

)
min

0≤i≤
⌈

k
mmax+1

⌉
(mmax+1)−1

‖Gαi (ui )‖2H

≥
(

γdecrk

C2mmax
G (mmax+1)

)
min
0≤i≤k

‖Gαi (ui )‖2H .

(A1)

To find an εtol-stationary point, we assume that up to the current iteration k Algorithm 1
has not been terminated, i.e. ‖Gαi (ui )‖H > ε for all i = 0, . . . , k. In this case, using

(A1), we can write �(u0) − �̄ >

(
γdecrk

C2mmax
G (mmax+1)

)
ε2tol, and, therefore, the total

number of function evaluations of � in Algorithm 1 is bounded from above by

k f
max ≤

⌊
C2mmax

G (mmax+1)(�(u0)−�̄)

γdecrε
2
tol

⌋
=
⌊

γ
f
comp(�(u0)−�̄)

ε2tol

⌋
.

Thus, we are finished with the verification of (31). Similarly, using (23), it can be
easily shown that the total number of Gαk (uk) evaluations is bounded by

kg
max ≤

⌊
C2mmax

G α(mmax+1)(�(u0)−�̄)

δε2tol

⌋
=
⌊

γ
g
comp(�(u0)−�̄)

ε2tol

⌋
,

and, thus, the proof is complete.

A. 2 Proof of Lemma 8

First, we show for every k ≥ 1 that

�(uk) ≤ min
w∈H

Qαk−1(w, uk−1) + LF ′
2αinfαk−1

‖Gαk−1(uk−1)‖2H . (A2)

Due to A3, the descent lemma, see [59, Lemma 1.30], implies

F(uk) ≤ F(uk−1) + (∇F(uk), uk − uk−1)H + LF ′
2α2

k−1
‖Gαk−1(uk−1)‖2H .

123


On the forward–backward method with nonmonotone… 1303

Using the definition of Qαk−1 and the fact that uk is the minimizer of Qαk−1(·, uk−1),
we obtain

�(uk) ≤ min
w∈H

Qαk−1(w, uk−1) +
(

LF ′
2α2

k−1
− 1

2αk−1

)
‖Gαk−1(uk−1)‖2H

≤ min
w∈H

Qαk−1(w, uk−1) + LF ′
2α2

k−1
‖Gαk−1(uk−1)‖2H ,

and, thus, (A2) follows from the fact that αk ≥ αinf for all k ∈ N0. Due to the convexity
of F , we can write

min
w∈H

Qαk−1(w, uk−1) ≤ min
w∈H

{�(w) + αk−1
2 ‖w − uk−1‖2H }. (A3)

Further, due to the convexity of �, we obtain for every w = (1− λ)uk−1 + λu∗ with
λ ∈ [0, 1] and u∗ ∈ S∗ that

min
w∈H

{�(w) + αk−1
2 ‖w − uk−1‖2H } ≤ �((1 − λ)uk−1 + λu∗) + αk−1λ

2

2 ‖uk−1 − u∗‖2H
≤ (1 − λ)�(uk−1) + λ�∗ + αk−1λ

2

2 ‖uk−1 − u∗‖2H .

Combining with (A2) and (A3), we can write

�(uk) ≤ (1 − λ)�(uk−1) + λ�∗+αk−1λ
2

2 ‖uk−1−u∗‖2H + LF ′
2αinfαk−1

‖Gαk−1(uk−1)‖2H
≤ (1 − λ)�(u	(k−1)) + λ�∗ + αk−1λ

2

2 ‖uk−1 − u∗‖2H + LF ′
2αinfαk−1

‖Gαk−1(uk−1)‖2H .

(A4)

Now, inserting ν(k) in the place of k in (A4) and using (20), we have

�(uν(k)) ≤ (1 − λ)�(uν(k−1)) + λ�∗ + αν(k)−1λ
2

2 ‖uν(k)−1 − u∗‖2H
+ LF ′

2αinfαν(k)−1
‖Gαν(k)−1(uν(k)−1)‖2H .

(A5)

Subtracting�∗ from both sides of (A5), setting C̃ := LF ′
2αinf

, and using (iii) of Lemma 4,
we obtain

�(uν(k)) − �∗ ≤ (1 − λ)
(
�(uν(k−1)) − �∗)+ αλ2

2 ‖uν(k)−1 − u∗‖2H
+ C̃

αν(k)−1
‖Gαν(k)−1(uν(k)−1)‖2H .

(A6)

Finally, (33) follows from the fact that u∗ ∈ S∗ was arbitrary and S∗ is non-empty,
closed, and convex.

Nowwe deal with the verification of (34). Let an arbitrary u∗ ∈ S∗ be given, setting
k = 1 and λ = 1 in (A6), we obtain

�(uν(1)) − �∗ ≤ α
2 ‖uν(1)−1 − u∗‖2H + C̃

αν(1)−1
‖Gαν(1)−1(uν(1)−1)‖2H . (A7)

123


1304 B. Azmi, M. Bernreuther

Using, the fact that 0 ≤ ν(1) − 1 ≤ mmax (see L1), the firm nonexpansiveness of the
proximal operator, and the Lipschitz continuity of ∇F , we can write that

‖uν(1)−1 − u∗‖H ≤
(
1 + LF ′

αν(1)−2

)
‖uν(1)−2 − u∗‖H

≤
(
1 + LF ′

αinf

)
‖uν(1)−2 − u∗‖H ≤ · · · ≤

(
1 + LF ′

αinf

)mmax ‖u0 − u∗‖H .

(A8)

Further, using (iv) of Lemma 4 successively, we obtain that

‖Gαν(1)−1(uν(1)−1)‖H ≤ Cmmax
G ‖Gα0(u0)‖H . (A9)

Combining (A7), (A8), and (A9), we can write

�(uν(1)) − �∗ ≤ α
2

(
1 + LF ′

αinf

)2mmax ‖u0 − u∗‖2H + C̃C2mmax
G
αinf

‖Gα0(u0)‖2H
≤ α

2

(
1 + LF ′

αinf

)2mmax ‖u0 − u∗‖2H + α2C̃C2mmax
G

αinf
‖u1 − u0‖2H .

(A10)

Further, the firm nonexpansiveness of the proximal operator and the Lipschitz conti-
nuity of ∇F again imply that

‖u1 − u0‖H ≤ ‖u0 − u∗‖H + ‖u1 − u∗‖H

≤ 2‖u0 − u∗‖H + LF ′
α0

‖u0 − u∗‖H ≤ (2 + LF ′
αinf

)‖u0 − u∗‖H .
(A11)

Thus, combining (A10) and (A11) and setting

C0 := α
2

(
1 + LF ′

αinf

)2mmax + α2C̃C2mmax
G

αinf

(
2 + LF ′

αinf

)2
,

we arrive at �(uν(1)) − �∗ ≤ C0‖u0 − u∗‖2H . Hence, (34) follows from the fact that
u∗ ∈ S∗ is arbitrary.

Acknowledgements The work of Marco Bernreuther was supported by the German Research Foundation
(DFG) under grant number VO 1658/5-2 within the priority program “Non-smooth and Complementarity-
based Distributed Parameter Systems: Simulation and Hierarchical Optimization” (SPP 1962).

Funding Open Access funding enabled and organized by Projekt DEAL.

Data Availability The solution data generated during the current study are available from the corresponding
author on reasonable request.

Declarations

Conflict of interest The authors declare that they have no conflict of interest.

OpenAccess This article is licensedunder aCreativeCommonsAttribution 4.0 InternationalLicense,which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included

123


On the forward–backward method with nonmonotone… 1305

in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

References

1. Azmi, B., Kunisch, K., Rodrigues, S.S.: Saturated feedback stabilizability to trajectories for the schlögl
parabolic equation. IEEE Trans. Autom. Control (2023). https://doi.org/10.1109/TAC.2023.3247511

2. O’Donoghue, B., Stathopoulos, G., Boyd, S.: A splitting method for optimal control. IEEE Trans.
Control Syst. Technol. 21(6), 2432–2442 (2013). https://doi.org/10.1109/TCST.2012.2231960

3. Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing. In: Fixed-Point Algo-
rithms for Inverse Problems in Science and Engineering. Springer Optimization Application, vol. 49,
pp. 185–212. Springer, New York (2011). https://doi.org/10.1007/978-1-4419-9569-8_10

4. Parikh, N., Boyd, S., et al.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
5. Beck, A.: First-order Methods in Optimization. MOS-SIAM Series on Optimization, vol. 25, p. 475.

Society for Industrial and AppliedMathematics (SIAM), Philadelphia (2017). https://doi.org/10.1137/
1.9781611974997.ch1

6. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542

7. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161
(2013). https://doi.org/10.1007/s10107-012-0629-5

8. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving
analytic features. Math. Program. 116(1–2), 5–16 (2009). https://doi.org/10.1007/s10107-007-0133-
5

9. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame
problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods.
Math. Program. 137(1–2), 91–129 (2013). https://doi.org/10.1007/s10107-011-0484-9

10. Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order
descent methods for convex functions. Math. Program. 165(2), 471–507 (2017). https://doi.org/10.
1007/s10107-016-1091-6

11. Frankel, P., Garrigos, G., Peypouquet, J.: Splitting methods with variable metric for Kurdyka-
łojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165(3), 874–900 (2015).
https://doi.org/10.1007/s10957-014-0642-3

12. Ioan Boţ, R., Robert Csetnek, E.: A forward–backward dynamical approach to the minimization of
the sum of a nonsmooth convex with a smooth nonconvex function. ESAIM Control Optim. Calc. Var.
24(2), 463–477 (2018). https://doi.org/10.1051/cocv/2017020

13. Bello-Cruz, Y., Li, G., Nghia, T.: On the linear convergence of forward–backward splitting method:
Part I-convergence analysis. J. Optim. Theory Appl. 188(2), 378–401 (2021). https://doi.org/10.1007/
s10957-020-01787-7

14. Drusvyatskiy, D., Lewis, A.S.: Error bounds, quadratic growth, and linear convergence of proximal
methods. Math. Oper. Res. 43(3), 919–948 (2018). https://doi.org/10.1287/moor.2017.0889

15. Garrigos, G., Rosasco, L., Villa, S.: Convergence of the forward–backward algorithm: beyond the
worst-case with the help of geometry. Math. Program. (2022). https://doi.org/10.1007/s10107-022-
01809-4

16. Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first ordermethods for non-strongly convex
optimization. Math. Program. 175(1–2), 69–107 (2019). https://doi.org/10.1007/s10107-018-1232-1

17. Boţ, R.I., Csetnek, E.R., László, S.C.: An inertial forward–backward algorithm for the minimization
of the sum of two nonconvex functions. EURO J. Comput. Optim. 4(1), 3–25 (2016). https://doi.org/
10.1007/s13675-015-0045-8

18. Li, G., Pong, T.K.: Global convergence of splitting methods for nonconvex composite optimization.
SIAM J. Optim. 25(4), 2434–2460 (2015). https://doi.org/10.1137/140998135

19. Boţ, R.I., Csetnek, E.R., Nguyen, D.-K.: A proximal minimization algorithm for structured noncon-
vex and nonsmooth problems. SIAM J. Optim. 29(2), 1300–1328 (2019). https://doi.org/10.1137/
18M1190689

123

http://creativecommons.org/licenses/by/4.0/
https://doi.org/10.1109/TAC.2023.3247511
https://doi.org/10.1109/TCST.2012.2231960
https://doi.org/10.1007/978-1-4419-9569-8_10
https://doi.org/10.1137/1.9781611974997.ch1
https://doi.org/10.1137/1.9781611974997.ch1
https://doi.org/10.1137/080716542
https://doi.org/10.1007/s10107-012-0629-5
https://doi.org/10.1007/s10107-007-0133-5
https://doi.org/10.1007/s10107-007-0133-5
https://doi.org/10.1007/s10107-011-0484-9
https://doi.org/10.1007/s10107-016-1091-6
https://doi.org/10.1007/s10107-016-1091-6
https://doi.org/10.1007/s10957-014-0642-3
https://doi.org/10.1051/cocv/2017020
https://doi.org/10.1007/s10957-020-01787-7
https://doi.org/10.1007/s10957-020-01787-7
https://doi.org/10.1287/moor.2017.0889
https://doi.org/10.1007/s10107-022-01809-4
https://doi.org/10.1007/s10107-022-01809-4
https://doi.org/10.1007/s10107-018-1232-1
https://doi.org/10.1007/s13675-015-0045-8
https://doi.org/10.1007/s13675-015-0045-8
https://doi.org/10.1137/140998135
https://doi.org/10.1137/18M1190689
https://doi.org/10.1137/18M1190689


1306 B. Azmi, M. Bernreuther

20. Ahookhosh, M., Themelis, A., Patrinos, P.: A Bregman forward–backward linesearch algorithm for
nonconvex composite optimization: superlinear convergence to nonisolated local minima. SIAM J.
Optim. 31(1), 653–685 (2021). https://doi.org/10.1137/19M1264783

21. DeMarchi, A.: Proximal gradientmethods beyondmonotony. J. NonsmoothAnal. Optim. 4, 18 (2023).
https://doi.org/10.46298/jnsao-2023-10290

22. Bonettini, S., Loris, I., Porta, F., Prato, M.: Variable metric inexact line-search-based methods for non-
smooth optimization. SIAM J. Optim. 26(2), 891–921 (2016). https://doi.org/10.1137/15M1019325

23. Scheinberg, K., Tang, X.: Practical inexact proximal quasi-Newton method with global complexity
analysis. Math. Program. 160(1–2), 495–529 (2016). https://doi.org/10.1007/s10107-016-0997-3

24. Themelis, A., Stella, L., Patrinos, P.: Forward-backward envelope for the sum of two nonconvex
functions: further properties and nonmonotone linesearch algorithms. SIAM J. Optim. 28(3), 2274–
2303 (2018). https://doi.org/10.1137/16M1080240

25. Kanzow, C., Lechner, T.: Globalized inexact proximal Newton-type methods for nonconvex composite
functions. Comput. Optim. Appl. 78(2), 377–410 (2021). https://doi.org/10.1007/s10589-020-00243-
6

26. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces.
CMS Books in Mathematics, 2nd edn., p. 619. Springer, New York (2017). https://doi.org/10.1007/
978-3-319-48311-5

27. Bello-Cruz, Y., Li, G., Nghia, T.: Quadratic growth conditions and uniqueness of optimal solution to
lasso. J. Optim. Theory Appl. 194(1), 167–190 (2022). https://doi.org/10.1007/s10957-022-02013-2

28. Salzo, S.: The variable metric forward–backward splitting algorithm under mild differentiability
assumptions. SIAM J. Optim. 27(4), 2153–2181 (2017). https://doi.org/10.1137/16M1073741

29. Attouch, H., Peypouquet, J.: The rate of convergence of Nesterov’s accelerated forward-backward
method is actually faster than 1/k2. SIAM J. Optim. 26(3), 1824–1834 (2016). https://doi.org/10.
1137/15M1046095

30. Clason, C., Valkonen, T.: Introduction to Nonsmooth Analysis and Optimization (2020). https://arxiv.
org/abs/2001.00216

31. Banert, S., Boţ, R.I.: A general double-proximal gradient algorithm for dc programming. Math. Pro-
gram. 178(1–2), 301–326 (2019). https://doi.org/10.1007/s10107-018-1292-2

32. Baraldi, R.J., Kouri, D.P.: A proximal trust-region method for nonsmooth optimization with inexact
function and gradient evaluations. Math. Program. 201(1–2), 559–598 (2023). https://doi.org/10.1007/
s10107-022-01915-3

33. Baraldi, R.J., Kouri, D.P.: Local convergence analysis of an inexact trust-region method for nonsmooth
optimization. Optim. Lett. 18(3), 663–680 (2024). https://doi.org/10.1007/s11590-023-02092-8

34. Azmi, B., Kunisch, K.: Analysis of the Barzilai–Borwein step-sizes for problems in Hilbert spaces. J.
Optim. Theory Appl. 185(3), 819–844 (2020). https://doi.org/10.1007/s10957-020-01677-y

35. Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148
(1988). https://doi.org/10.1093/imanum/8.1.141

36. Grippo, L., Lampariello, F., Lucidi, S.: A nonmonotone line search technique for Newton’s method.
SIAM J. Numer. Anal. 23(4), 707–716 (1986). https://doi.org/10.1137/0723046

37. Wright, S.J., Nowak, R.D., Figueiredo, M.: Sparse reconstruction by separable approximation. IEEE
Trans. Signal Process. 57(7), 2479–2493 (2009). https://doi.org/10.1109/TSP.2009.2016892

38. Hager, W.W., Phan, D.T., Zhang, H.: Gradient-based methods for sparse recovery. SIAM J. Imaging
Sci. 4(1), 146–165 (2011). https://doi.org/10.1137/090775063

39. Kanzow, C., Mehlitz, P.: Convergence properties of monotone and nonmonotone proximal gradient
methods revisited. J. Optim. Theory Appl. 195(2), 624–646 (2022). https://doi.org/10.1007/s10957-
022-02101-3

40. Azmi, B., Kunisch, K.: On the convergence and mesh-independent property of the Barzilai–Borwein
method for PDE-constrained optimization. IMA J. Numer. Anal. 42(4), 2984–3021 (2022). https://doi.
org/10.1093/imanum/drab056

41. Allgower, E.L., Böhmer,K., Potra, F.A., Rheinboldt,W.C.:Amesh-independence principle for operator
equations and their discretizations. SIAM J. Numer. Anal. 23(1), 160–169 (1986). https://doi.org/10.
1137/0723011

42. Heinkenschloss, M.: Mesh independence for nonlinear least squares problems with norm constraints.
SIAM J. Optim. 3(1), 81–117 (1993). https://doi.org/10.1137/0803005

43. Hintermüller, M., Ulbrich, M.: A mesh-independence result for semismooth Newton methods. Math.
Program. 101(1), 151–184 (2004). https://doi.org/10.1007/s10107-004-0540-9

123

https://doi.org/10.1137/19M1264783
https://doi.org/10.46298/jnsao-2023-10290
https://doi.org/10.1137/15M1019325
https://doi.org/10.1007/s10107-016-0997-3
https://doi.org/10.1137/16M1080240
https://doi.org/10.1007/s10589-020-00243-6
https://doi.org/10.1007/s10589-020-00243-6
https://doi.org/10.1007/978-3-319-48311-5
https://doi.org/10.1007/978-3-319-48311-5
https://doi.org/10.1007/s10957-022-02013-2
https://doi.org/10.1137/16M1073741
https://doi.org/10.1137/15M1046095
https://doi.org/10.1137/15M1046095
https://arxiv.org/abs/2001.00216
https://arxiv.org/abs/2001.00216
https://doi.org/10.1007/s10107-018-1292-2
https://doi.org/10.1007/s10107-022-01915-3
https://doi.org/10.1007/s10107-022-01915-3
https://doi.org/10.1007/s11590-023-02092-8
https://doi.org/10.1007/s10957-020-01677-y
https://doi.org/10.1093/imanum/8.1.141
https://doi.org/10.1137/0723046
https://doi.org/10.1109/TSP.2009.2016892
https://doi.org/10.1137/090775063
https://doi.org/10.1007/s10957-022-02101-3
https://doi.org/10.1007/s10957-022-02101-3
https://doi.org/10.1093/imanum/drab056
https://doi.org/10.1093/imanum/drab056
https://doi.org/10.1137/0723011
https://doi.org/10.1137/0723011
https://doi.org/10.1137/0803005
https://doi.org/10.1007/s10107-004-0540-9


On the forward–backward method with nonmonotone… 1307

44. Kelley, C.T., Sachs, E.W.: Quasi-Newton methods and unconstrained optimal control problems. SIAM
J. Control. Optim. 25(6), 1503–1516 (1987). https://doi.org/10.1137/0325083

45. Kelley, C.T., Sachs, E.W.: Approximate quasi-Newton methods. Math. Program. 48(1), 41–70 (1990)
46. Volkwein, S.: Mesh-independence for an augmented Lagrangian-SQPmethod in Hilbert spaces. SIAM

J. Control. Optim. 38(3), 767–785 (2000). https://doi.org/10.1137/S0363012998334468
47. Allgower, E.L., Böhmer, K.: Application of the mesh independence principle to mesh refinement

strategies. SIAM J. Numer. Anal. 24(6), 1335–1351 (1987). https://doi.org/10.1137/0724086
48. Schirotzek, W.: Nonsmooth Analysis. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-

540-71333-3
49. Hüther, B.: Global convergence of algorithms with nonmonotone line search strategy in unconstrained

optimization. Results Math. 41(3–4), 320–333 (2002). https://doi.org/10.1007/BF03322774
50. Cartis, C., Sampaio, P.R., Toint, P.L.: Worst-case evaluation complexity of non-monotone gradient-

related algorithms for unconstrained optimization. Optimization 64(5), 1349–1361 (2015). https://doi.
org/10.1080/02331934.2013.869809

51. Alber, Y.I., Iusem, A.N., Solodov, M.V.: On the projected subgradient method for nonsmooth con-
vex optimization in a Hilbert space. Math. Program. 81(1), 23–35 (1998). https://doi.org/10.1007/
BF01584842

52. Combettes, P.L.: Quasi-Fejérian analysis of some optimization algorithms. In: Inherently Parallel
Algorithms in Feasibility andOptimization andTheirApplications (Haifa, 2000). Studies inComputing
Mathematics, vol. 8, pp. 115–152. North-Holland, Amsterdam (2001). https://doi.org/10.1016/S1570-
579X(01)80010-0

53. Dontchev, A.L., Zolezzi, T.: Well-posed Optimization Problems. Lecture Notes in Mathematics, vol.
1543, p. 421. Springer, Berlin (1993). https://doi.org/10.1007/BFb0084195

54. Aragón Artacho, F.J., Geoffroy, M.H.: Characterization of metric regularity of subdifferentials. J.
Convex Anal. 15(2), 365–380 (2008)

55. Artacho, F., Geoffroy, M.H.: Metric subregularity of the convex subdifferential in Banach spaces. J.
Nonlinear Convex Anal. 15(1), 35–47 (2014)

56. Azé, D., Corvellec, J.-N.: Nonlinear local error bounds via a change of metric. J. Fixed Point Theory
Appl. 16(1–2), 351–372 (2014). https://doi.org/10.1007/s11784-015-0220-9

57. Drusvyatskiy, D., Mordukhovich, B.S., Nghia, T.: Second-order growth, tilt stability, and metric reg-
ularity of the subdifferential. J. Convex Anal. 21(4), 1165–1192 (2014)

58. Garrigos, G., Rosasco, L., Villa, S.: Thresholding gradient methods in Hilbert spaces: support identi-
fication and linear convergence. ESAIM Control Optim. Calc. Var. 26, 28–20 (2020). https://doi.org/
10.1051/cocv/2019011

59. Peypouquet, J.: Convex Optimization in Normed Spaces. SpringerBriefs in Optimization. Theory,
Methods and Examples. With a Foreword by Hedy Attouch, p. 124. Springer, Cham (2015). https://
doi.org/10.1007/978-3-319-13710-0

60. Casas, E., Mateos, M., Rösch, A.: Analysis of control problems of nonmontone semilinear elliptic
equations. ESAIMControl Optim. Calc. Var. 26, 80–21 (2020). https://doi.org/10.1051/cocv/2020032

61. Hinze, M., Pinnau, R., Ulbrich, M., Ulbrich, S.: Optimization with PDE constraints. In: Mathematical
Modelling: Theory and Applications, vol. 23, p. 270. Springer, Dordrecht (2009). https://doi.org/10.
1007/978-1-4020-8839-1

62. Kunisch, K., Rodrigues, S.S.: Global stabilizability to trajectories for the schlögl equation in a sobolev
norm. Discrete Contin. Dyn. Syst. (2023). https://doi.org/10.3934/dcds.2023017

63. Hinze, M., Kunisch, K.: Second order methods for optimal control of time-dependent fluid flow. SIAM
J. Control. Optim. 40(3), 925–946 (2001). https://doi.org/10.1137/S0363012999361810

64. Casas, E.: A review on sparse solutions in optimal control of partial differential equations. SeMA J.
74(3), 319–344 (2017). https://doi.org/10.1007/s40324-017-0121-5

65. Casas, E.: Second order analysis for bang-bang control problems of PDEs. SIAM J. Control. Optim.
50(4), 2355–2372 (2012). https://doi.org/10.1137/120862892

66. Alnæs, M., Blechta, J., Hake, J., Johansson, A., Kehlet, B., Logg, A., Richardson, C., Ring, J., Rognes,
M.E., Wells, G.N.: The fenics project version 1.5. Arch. Numer. Softw. 3(100) (2015)

67. Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E.,
Peterson, P.,Weckesser,W., Bright, J., et al.: Scipy 1.0: fundamental algorithms for scientific computing
in python. Nat. Methods 17(3), 261–272 (2020)

123

https://doi.org/10.1137/0325083
https://doi.org/10.1137/S0363012998334468
https://doi.org/10.1137/0724086
https://doi.org/10.1007/978-3-540-71333-3
https://doi.org/10.1007/978-3-540-71333-3
https://doi.org/10.1007/BF03322774
https://doi.org/10.1080/02331934.2013.869809
https://doi.org/10.1080/02331934.2013.869809
https://doi.org/10.1007/BF01584842
https://doi.org/10.1007/BF01584842
https://doi.org/10.1016/S1570-579X(01)80010-0
https://doi.org/10.1016/S1570-579X(01)80010-0
https://doi.org/10.1007/BFb0084195
https://doi.org/10.1007/s11784-015-0220-9
https://doi.org/10.1051/cocv/2019011
https://doi.org/10.1051/cocv/2019011
https://doi.org/10.1007/978-3-319-13710-0
https://doi.org/10.1007/978-3-319-13710-0
https://doi.org/10.1051/cocv/2020032
https://doi.org/10.1007/978-1-4020-8839-1
https://doi.org/10.1007/978-1-4020-8839-1
https://doi.org/10.3934/dcds.2023017
https://doi.org/10.1137/S0363012999361810
https://doi.org/10.1007/s40324-017-0121-5
https://doi.org/10.1137/120862892


1308 B. Azmi, M. Bernreuther

68. He, Y.: Two-level method based on finite element and Crank–Nicolson extrapolation for the time-
dependent Navier–Stokes equations. SIAM J. Numer. Anal. 41(4), 1263–1285 (2003). https://doi.org/
10.1137/S0036142901385659

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

123

https://doi.org/10.1137/S0036142901385659
https://doi.org/10.1137/S0036142901385659

	On the forward–backward method with nonmonotone linesearch for infinite-dimensional nonsmooth nonconvex problems
	Abstract
	1 Introduction
	Contributions
	Outline of the paper
	Notation
	2 Problem formulation and algorithm
	3 Convergence and complexity analysis
	3.1 General case
	3.2 Convex case
	3.3 Convergence under quadratic growth-type conditions
	3.4 On relaxing the global Lipschitz continuity of  mathcalF for problems governed by PDEs


	4 Application to PDE-constrained optimization
	4.1 Elliptic problem
	4.2 Parabolic problem
	4.3 Discussion on the quadratic growth condition

	5 Numerical experiments
	Conclusion
	Appendix A Proofs
	A. 1 Proof of Theorem 7
	A. 2 Proof of Lemma 8
	Acknowledgements
	References