arXiv:1311.4296v1 [cs.LG] 18 Nov 2013 Reflection methods for user-friendly submodular optimization Stefanie Jegelka UC Berkeley Berkeley, CA, USA Francis Bach INRIA - ENS Paris, France Suvrit Sra MPI for Intelligent Systems T ̈ ubingen, Germany Abstract Recently, it has become evident that submodularity naturally captures widely oc- curring concepts in machine learning, signal processing and computer vision. Con- sequently, there is need for efficient optimization procedures for submodular func- tions, especially for minimization problems. While general submodular minimiza- tion is challenging, we propose a new method that exploits existing decomposabil- ity of submodular functions. In contrast to previous approaches, our method is neither approximate, nor impractical, nor does it need any cumbersome parame- ter tuning. Moreover, it is easy to implement and parallelize. A key component of our method is a formulation of the discrete submodular minimization problem as a continuous best approximation problem that is solved through a sequence of reflections, and its solution can be easily thresholded to obtain an optimal discrete solution. This method solves both the continuous and discrete formulations of the problem, and therefore has applications in learning, inference, and reconstruc- tion. In our experiments, we illustrate the benefits of our method on two image segmentation tasks. 1 Introduction Submodularity is a rich combinatorial concept that expresses widely occurring phenomena such as diminishing marginal costs and preferences for grouping. A set function F : 2 V → R on a set V is submodular if for all subsets S, T ⊆ V , we have F ( S ∪ T ) + F ( S ∩ T ) ≤ F ( S ) + F ( T ) . Submodular functions underlie the goals of numerous problems in machine learning, computer vi- sion and signal processing [1]. Several problems in these areas can be phrased as submodular op- timization tasks: notable examples include graph cut-based image segmentation [9], sensor place- ment [32], or document summarization [34]. A longer list of examples may be found in [1]. The theoretical complexity of submodular optimization is well-understood: unconstrained minimiza- tion of submodular set functions is polynomial-time [21] while submodular maximization is NP- hard. Algorithmically, however, the picture is different. Generic submodular maximization admits efficient algorithms that can attain approximate optima with global guarantees; these algorithms are typically based on local search techniques [18, 38]. In contrast, although polynomial-time solvable, submodular function minimization (SFM) which seeks to solve min S ⊆ V F ( S ) , (1) poses substantial algorithmic difficulties. This is partly due to the fact that one is commonly inter- ested in an exact solution (or an arbitrarily close approximation thereof), and “polynomial-time” is not necessarily equivalent to “practically fast”. Submodular minimization algorithms may be obtained from two main perspectives: combinatorial and continuous . Combinatorial algorithms for SFM typically use close connections to matroid and 1 maximum flow methods; the currently theoretically fastest combinatorial algorithm for SFM scales as O ( n 6 + n 5 τ ) , where τ is the time to evaluate the function oracle [40] (for an overview of other algorithms, see e.g., [36]). These combinatorial algorithms are typically nontrivial to implement. Continuous methods offer an alternative by instead minimizing a convex extension . This idea ex- ploits the fundamental connection between a submodular function F and its Lov ́ asz extension f [35], which is continuous and convex. The SFM problem (1) is then equivalent to min x ∈ [0 , 1] n f ( x ) . (2) The Lov ́ asz extension f is nonsmooth, so we might have to resort to subgradient methods. While a fundamental result of Edmonds [17] demonstrates that a subgradient of f can be computed in O ( n log n ) time, subgradient methods can be sensitive to choices of the step size, and can be slow. They theoretically converge at a rate of O (1 / √ t ) (after t iterations). The “smoothing technique” of [39] does not in general apply here because computing a smoothed gradient is equivalent to solving the submodular minimization problem. We discuss this issue further in Section 2. An alternative to minimizing the Lov ́ asz extension directly on [0 , 1] n is to consider a slightly modi- fied convex problem. Specifically, the exact solution of the discrete problem min S ⊆ V F ( S ) and of its nonsmooth convex relaxation min x ∈ [0 , 1] n f ( x ) may be found as a level set S 0 = { k | x ∗ k > 0 } of the unique point x ∗ that minimizes the strongly convex function [1, 12]: f ( x ) + 1 2 ‖ x ‖ 2 . (3) We will refer to the minimization of (3) as the proximal problem due to its close similarity to prox- imity operators used in convex optimization [14]. When F is a cut function, (3) becomes a total variation problem (see, e.g., [11] and references therein) that also occurs in other regularization problems [1]. Two noteworthy points about (3) are: (i) addition of the strongly convex component 1 2 ‖ x ‖ 2 ; (ii) the ensuing removal of the box-constraints x ∈ [0 , 1] n . These changes allow us to consider a convex dual which is amenable to smooth optimization techniques. Typical approaches to generic SFM include Frank-Wolfe methods [19] that have cheap iterations and O (1 /t ) convergence, but can be quite slow in practice (Section 5); or the minimum-norm- point/Fujishige-Wolfe algorithm [22] that has expensive iterations but finite convergence. Other recent methods are approximate [26]. In contrast to several iterative methods based on convex relax- ations, we seek to obtain exact discrete solutions. To the best of our knowledge, all generic algorithms that use only submodularity are several orders of magnitude slower than specialized algorithms when they exist (e.g., for graph cuts). However, the submodular function is not always generic and given via a black-box, but has known structure. Fol- lowing [29, 31, 41, 44], we make the assumption that F ( S ) = ∑ r i =1 F i ( S ) is a sum of sufficiently “simple” functions (see Sec. 3). This structure allows the use of (parallelizable) dual decomposition techniques for the problem in Eq. (2), with [13, 41] or without [31] Nesterov’s smoothing technique, or with direct smoothing [44] techniques. But existing approaches typically have two drawbacks: (a) they use smoothing or step-size parameters whose selection may be critical and quite tedious; and (b) they still exhibit slow convergence (see Section 5). These drawbacks arise from working with formulation (2). Our main insight is that, despite seem- ingly counter-intuitive, the proximal problem (3) offers a much more user-friendly tool for solving (1) than its natural convex counterpart (2), both in implementation and running time. We approach Problem (3) via its dual. This allows decomposition techniques which combine well with orthogonal projection and reflection methods that (a) exhibit faster convergence, (b) are easily parallelizable, (c) require no extra hyperparameters, and (d) are extremely easy to implement. The main three algorithms that we consider are: (i) dual block-coordinate descent (equivalently, primal-dual proximal-Dykstra), which was already shown to be extremely efficient for total varia- tion problems [3] that are special cases of Problem (3); (ii) Douglas-Rachford splitting using the careful variant of [5], which for our formulation (Section 4.2) requires no hyper-parameters; and (iii) accelerated projected gradient [6]. We will see these alternative algorithms can offer speedups beyond known efficiencies. Our observations have two implications: first, from the viewpoint of solving Problem (3), they offers speedups for often occurring denoising and reconstruction prob- lems that employ total variation. Second, our experiments suggest that projection and reflection methods can work very well for solving the combinatorial problem (1). 2 In summary, we make the following contributions: (1) In Section 3, we cast the problem of minimizing decomposable submodular functions as an or- thogonal projection problem and show how existing optimization techniques may be brought to bear on this problem, to obtain fast, easy-to-code and easily parallelizable algorithms. In addition, we show examples of classes of functions amenable to our approach. In particular, for simple functions, i.e., those for which minimizing F ( S ) − a ( S ) is easy for all vectors 1 a ∈ R n , the problem in Eq. (3) may be solved in O (log 1 ε ) calls to such minimization routines, to reach a precision ε (Section 2, 3, Appendix B). (2) In Section 5, we demonstrate the empirical gains of using accelerated proximal methods, Douglas-Rachford and block coordinate descent methods over existing approaches: fewer hyperparameters and faster convergence. 2 Review of relevant results from submodular analysis The relevant concepts we review here are the Lov ́ asz extension, base polytopes of submodular func- tions, and relationships between proximal and discrete problems. For more details, see [1, 21]. Lov ́ asz extension and convexity. The power set 2 V may be naturally identified with the vertices of the hypercube, i.e., { 0 , 1 } n . The Lov ́ asz extension f of any set function is defined by linear inter- polation, so that for any S ⊂ V , F ( S ) = f (1 S ) . It may be computed in closed form once the com- ponents of x are sorted: if x σ (1) > · · · > x σ ( n ) , then f ( x ) = ∑ n k =1 x σ ( k ) [ F ( { σ (1) , . . . , σ ( k ) } ) − F ( { σ (1) , . . . , σ ( k − 1) } ) ] [35]. For the graph cut function, f is the total variation. In this paper, we are going to use two important results: (a) if the set function F is submodular, then its Lov ́ asz extension f is convex, and (b) minimizing the set function F is equivalent to minimizing f ( x ) with respect to x ∈ [0 , 1] n . Given x ∈ [0 , 1] n , all of its level sets may be considered and the function may be evaluated (at most n times) to obtain a set S . Moreover, for a submodular function, the Lov ́ asz extension happens to be the support function of the base polytope B ( F ) defined as B ( F ) = { y ∈ R n | ∀ S ⊂ V, y ( S ) 6 F ( S ) and y ( V ) = F ( V ) } , that is f ( x ) = max y ∈ B ( F ) y ⊤ x [17]. A maximizer of y ⊤ x (and hence the value of f ( x ) ), may be computed by the “greedy algorithm”, which first sorts the components of w in decreasing order x σ (1) > · · · > x σ ( n ) , and then compute y σ ( k ) = F ( { σ (1) , . . . , σ ( k ) } ) − F ( { σ (1) , . . . , σ ( k − 1) } ) . In other words, a linear function can be maximized over B ( F ) in time O ( n log n + nτ ) (note that the term nτ may be improved in many special cases). This is crucial for exploiting convex duality. Dual of discrete problem. We may derive a dual problem to the discrete problem in Eq. (1) and the convex nonsmooth problem in Eq. (2), as follows: min S ⊆ V F ( S ) = min x ∈ [0 , 1] n f ( x ) = min x ∈ [0 , 1] n max y ∈ B ( F ) y ⊤ x = max y ∈ B ( F ) min x ∈ [0 , 1] n y ⊤ x = max y ∈ B ( F ) ( y ) − ( V ) , (4) where ( y ) − = min { y, 0 } applied elementwise. This allows to obtain dual certificates of optimality from any y ∈ B ( F ) and x ∈ [0 , 1] n . Proximal problem. The optimization problem (3), i.e., min x ∈ R n f ( x ) + 1 2 ‖ x ‖ 2 , has intricate re- lations to the SFM problem [12]. Given the unique optimal solution x ∗ of (3), the maximal (resp. minimal) optimizer of the SFM problem is the set S ∗ of nonnegative (resp. positive) elements of x ∗ . More precisely, solving (3) is equivalent to minimizing F ( S ) + μ | S | for all μ ∈ R . A solution S μ ⊆ V is obtained from a solution x ∗ as S ∗ μ = { i | x ∗ i > μ } . Conversely, x ∗ may be obtained from all S ∗ μ as x ∗ k = sup { μ ∈ R | k ∈ S ∗ μ } for all k ∈ V . Moreover, if x is an ε -optimal solution of Eq. (3), then we may construct √ εn -optimal solutions for all S μ [1; Prop. 10.5]. In practice, the duality gap of the discrete problem is usually much lower than that of the proximal version of the same problem, as we will see in Section 5. Note that the problem in Eq. (3) provides much more information than Eq. (2), as all μ -parameterized discrete problems are solved. The dual problem of Problem (3) reads as follows: min x ∈ R n f ( x ) + 1 2 ‖ x ‖ 2 2 = min x ∈ R n max y ∈ B ( F ) y ⊤ x + 1 2 ‖ x ‖ 2 2 = max y ∈ B ( F ) min x ∈ R n y ⊤ x + 1 2 ‖ x ‖ 2 2 = max y ∈ B ( F ) − 1 2 ‖ y ‖ 2 2 , where primal and dual variables are linked as x = − y . Observe that this dual problem is equivalent to finding the orthogonal projection of 0 onto B ( F ) . 1 Every vector a ∈ R n may be viewed as a modular (linear) set function: a ( S ) , ∑ i ∈ S a ( i ) . 3 Divide-and-conquer strategies for the proximal problems. Given a solution x ∗ of the proximal problem, we have seen how to get S ∗ μ for any μ by simply thresholding x ∗ at μ . Conversely, one can recover x ∗ exactly from at most n well-chosen values of μ . A known divide-and-conquer strategy [21, 23] hinges upon the fact that for any μ , one can easily see which components of x ∗ are greater or smaller than μ by computing S ∗ μ . The resulting algorithm makes O ( n ) calls to the submodular function oracle. In Appendix B, we extend an alternative approach by Tarjan et al. [45] for cuts to general submodular functions and obtain a solution to (3) up to precision ε in O (min { n, log 1 ε } ) iterations. This result is particularly useful if our function F is a sum of functions for each of which by itself the SFM problem is easy. Beyond squared ℓ 2 -norms, our algorithm equally applies to computing all minimizers of f ( x ) + ∑ p j =1 h j ( x j ) for arbitrary smooth strictly convex functions h j , j = 1 , . . . , n . 3 Decomposition of submodular functions Following [29, 31, 41, 44], we assume that our function F may be decomposed as the sum F ( S ) = ∑ r j =1 F j ( S ) of r “simple” functions. In this paper, by “simple” we mean functions G for which G ( S ) − a ( S ) can be minimized efficiently for all vectors a ∈ R n (more precisely, we require that S 7 → G ( S ∪ T ) − a ( S ) can be minimized efficiently over all subsets of V \ T , for any T ⊆ V and a ). Efficiency may arise from the functional form of G , or from the fact that G has small support. For such functions, Problems (1) and (3) become min S ⊆ V ∑ r j =1 F j ( S ) = min x ∈ [0 , 1] n ∑ r j =1 f j ( x ) min x ∈ R n ∑ r j =1 f j ( x ) + 1 2 ‖ x ‖ 2 2 . (5) The key to the algorithms presented here is to be able to minimize 1 2 ‖ x − z ‖ 2 2 + f j ( x ) , or equivalently, to orthogonally project z onto B ( F j ) : min 1 2 ‖ y − z ‖ 2 2 subject to y ∈ B ( F j ) . We next sketch some examples of functions F and their decompositions into simple functions F j . As shown at the end of Section 2, projecting onto B ( F j ) is easy as soon as the corresponding submodular minimization problems are easy. Here we outline some cases for which specialized fast algorithms are known. Graph cuts. A widely used class of submodular functions are graph cuts. Graphs may be decom- posed into substructures such as trees, simple paths or single edges. Message passing algorithms apply to trees, while the proximal problem for paths is very efficiently solved by [3]. For single edges, it is solvable in closed form. Tree decompositions are common in graphical models, whereas path decompositions are frequently used for TV problems [3]. Concave functions. Another important class of submodular functions is that of concave functions of cardinality, i.e., F j ( S ) = h ( | S | ) for a concave function h . Problem (3) for such functions may be solved in O ( n log n ) time (see [20] and Appendix B). Functions of this class have been used in [26, 28, 44]. Such functions also include covering functions [44]. Hierarchical functions. Here, the ground set corresponds to the leaves of a rooted, undirected tree. Each node has a weight, and the cost of a set of nodes S ⊆ V is the sum of the weights of all nodes in the smallest subtree (including the root) that spans S . This class of functions too admits to solve the proximal problem in O ( n log n ) time [24, 25]. Related tree functions have been considered in [27], where the elements v of the ground set are arranged in a tree of height d and each have a weight w ( v ) . Let desc( v ) be the set of descendants of v in the tree. Then F ( S ) = ∑ v ∈ V w ( v )1[dec( v ) ∩ S 6 = ∅ ] . Jenatton et al. [27] show how to solve the proximal problem for such a function in time O ( nd ) . Small support. Any general, potentially slower algorithm such as the minimum-norm-point algo- rithm can be applied if the support of each F j is only a small subset of the ground set. 3.1 Dual decomposition of the nonsmooth problem We first review existing dual decomposition techniques for the nonsmooth problem (1). We always assume that F = ∑ r j =1 F j , and define H r := ∏ r j =1 R n ≃ R n × r . We follow [31] to derive a dual formulation (see Appendix A): Lemma 1. The dual of Problem (1) may be written in terms of variables λ 1 , . . . , λ r ∈ R n as max ∑ r j =1 g j ( λ j ) s.t. λ ∈ { ( λ 1 , . . . , λ r ) ∈ H r | ∑ r j =1 λ j = 0 } (6) 4 where g j ( λ j ) = min S ⊂ V F j ( S ) − λ j ( S ) is a nonsmooth concave function. The dual is the maximization of a nonsmooth concave function over a convex set, onto which it is easy to project: the projection of a vector y has j -th block equal to y j − 1 r ∑ r k =1 y k . Moreover, in our setup, functions g j and their subgradients may be computed efficiently through SFM. We consider several existing alternatives for the minimization of f ( x ) on x ∈ [0 , 1] n , most of which use Lemma 1. Computing subgradients for any f j means calling the greedy algorithm, which runs in time O ( n log n ) . All of the following algorithms require the tuning of an appropriate step size. Primal subgradient descent (primal-sgd) : Agnostic to any decomposition properties, we may apply a standard simple subgradient method to f . A subgradient of f may be obtained from the subgradients of the components f j . This algorithm converges at rate O (1 / √ t ) . Dual subgradient descent (dual-sgd) [31]: Applying a subgradient method to the nonsmooth dual in Lemma 1 leads to a convergence rate of O (1 / √ t ) . Computing a subgradient requires minimizing the submodular functions F j individually. In simulations, following [31], we consider a step-size rule similar to Polyak’s rule (dual-sgd-P) [7], as well as a decaying step-size (dual-sgd-F), and use discrete optimization for all F j . Primal smoothing (primal-smooth) [44]: The nonsmooth primal may be smoothed in several ways by smoothing the f j individually; one example is ̃ f ε j ( x j ) = max y j ∈ B ( F j ) y ⊤ j x j − ε 2 ‖ y j ‖ 2 . This leads to a function that is (1 /ε ) -smooth. Computing ̃ f ε j means solving the proximal problem for F j . The convergence rate is O (1 /t ) , but, apart from the step size which may be set relatively easily, the smoothing constant ε needs to be defined. Dual smoothing (dual-smooth) : Instead of the primal, the dual (6) may be smoothed, e.g., by entropy [10, 41] applied to each g j as ̃ g ε j ( λ j ) = min x ∈ [0 , 1] n f j ( x ) + εh ( x ) where h ( x ) is a negative entropy. Again, the convergence rate is O (1 /t ) but there are two free parameters (in particular the smoothing constant ε which is hard to tune). This method too requires solving proximal problems for all F j in each iteration. Dual smoothing with entropy also admits coordinate descent methods [37] that exploit the decom- position, but we do not compare to those here. 3.2 Dual decomposition methods for proximal problems We may also consider Eq. (3) and first derive a dual problem using the same technique as in Sec- tion 3.1. Lemma 2 (proved in Appendix A) formally presents our dual formulation as a best approx- imation problem. The primal variable can be recovered as x = − ∑ j y j . Lemma 2. The dual of Eq. (3) may be written as the best approximation problem min λ,y ‖ y − λ ‖ 2 2 s.t. λ ∈ { ( λ 1 , . . . , λ r ) ∈ H r | ∑ r j =1 λ j = 0 } , y ∈ ∏ r j =1 B ( F j ) . (7) We can actually eliminate the λ j and obtain the simpler looking dual problem max y − 1 2 ∥ ∥ ∥∑ r j =1 y j ∥ ∥ ∥ 2 2 s.t. y j ∈ B ( F j ) , j ∈ { 1 , . . . , r } (8) Such a dual was also used in [43]. In Section 5, we will see the effect of solving one of these duals or the other. For the simpler dual (8) the case r = 2 is of special interest; it reads max y 1 ∈ B ( F 1 ) , y 2 ∈ B ( F 2 ) − 1 2 ‖ y 1 + y 2 ‖ 2 2 ⇐⇒ min y 1 ∈ B ( F 1 ) , − y 2 ∈− B ( F 2 ) ‖ y 1 − ( − y 2 ) ‖ 2 . (9) We write Problem (9) in this suggestive form to highlight its key geometric structure: it is, like (7), a best approximation problem : i.e., the problem of finding the closest point between the polytopes B ( F 1 ) and − B ( F 2 ) . Notice, however, that (7) is very different from (9)—the former operates in a product space while the latter does not, a difference that can have impact in practice (see Section 5). We are now ready to present algorithms that exploit our dual formulations. 4 Algorithms We describe a few competing methods for solving our smooth dual formulations. We describe the details for the special 2-block case (9); the same arguments apply to the block dual from Lemma 2. 5 4.1 Block coordinate descent or proximal-Dykstra Perhaps the simplest approach to solving (9) (viewed as a minimization problem) is to use a block coordinate descent (BCD) procedure, which in this case performs the alternating projections: y k +1 1 ← argmin y 1 ∈ B ( F 1 ) ‖ y 1 − ( − y k 2 ) ‖ 2 2 ; y k +1 2 ← argmin y 2 ∈ B ( F 2 ) ‖ y 2 − ( − y k +1 1 ) ‖ 2 . (10) The iterations for solving (8) are analogous. This BCD method (applied to (9)) is equivalent to applying the so-called proximal-Dykstra method [14] to the primal problem. This may be seen by comparing the iterates. Notice that the BCD iteration (10) is nothing but alternating projections onto the convex polyhedra B ( F 1 ) and B ( F 2 ) . There exists a large body of literature studying method of alternating projections—we refer the interested reader to the monograph [15] for further details. However, despite its attractive simplicity, it is known that BCD (in its alternating projections form), can converge arbitrarily slowly [5] depending on the relative orientation of the convex sets onto which one projects. Thus, we turn to a potentially more effective method. 4.2 Douglas-Rachford splitting The Douglas-Rachford (DR) splitting method [16] includes algorithms like ADMM as a special case [14]. It avoids the slowdowns alluded to above by replacing alternating projections with alter- nating “reflections”. Formally, DR applies to convex problems of the form [4, 14] min x φ 1 ( x ) + φ 2 ( x ) , (11) subject to the qualification ri(dom φ 1 ) ∩ ri(dom φ 2 ) 6 = ∅ . To solve (11), DR starts with some z 0 , and performs the three-step iteration (for k ≥ 0 ): 1 . x k = prox φ 2 ( z k ); 2 . v k = prox φ 1 (2 x k − z k ); 3 . z k +1 = z k + γ k ( v k − z k ) , (12) where γ k ∈ [0 , 2] is a sequence of scalars that satisfy ∑ k γ k (2 − γ k ) = ∞ . The sequence { x k } produced by iteration (12) can be shown to converge to a solution of (11) [4; Thm. 25.6]. Introducing the reflection operator R φ := 2 prox φ − I , and setting γ k = 1 , the DR iteration (12) may be written in a more symmetric form as x k = prox φ 2 ( z k ) , z k +1 = 1 2 [ R φ 1 R φ 2 + I] z k , k ≥ 0 . (13) Applying DR to the duals (7) or (9), requires first putting them in the form (11), either by introducing extra variables or by going back to the primal, which is unnecessary. This is where the special structure of our dual problem proves crucial, a recognition that is subtle yet remarkably important. Instead of applying DR to (9), consider the closely related problem min y δ 1 ( y ) + δ − 2 ( y ) , (14) where δ 1 , δ − 2 are indicator functions for B ( F 1 ) and − B ( F 2 ) , respectively. Applying DR directly to (14) does not work because usually ri(dom δ 1 ) ∩ ri(dom δ 2 ) = ∅ . Indeed, applying DR to (14) generates iterates that diverge to infinity [5; Thm. 3.13(ii)]. Fortunately, even though the DR iterates for (14) may diverge, Bauschke et al. [5] show how to extract convergent sequences from these iterates, which actually solve the corresponding best approximation problem; for us this is nothing but the dual (9) that we wanted to solve in the first place. Theorem 3, which is a simplified version of [5; Thm. 3.13], formalizes the above discussion. Theorem 3. [5] Let A and B be nonempty polyhedral convex sets. Let Π A ( Π B ) denote orthogonal projection onto A ( B ), and let R A := 2Π A − I (similarly R B ) be the corresponding reflection operator. Let { z k } be the sequence generated by the DR method (13) applied to (14) . If A ∩ B 6 = ∅ , then { z k } k ≥ 0 converges weakly to a fixed-point of the operator T := 1 2 [ R A R B + I] ; otherwise ‖ z k ‖ 2 → ∞ . The sequences { x k } and { Π A Π B z k } are bounded; the weak cluster points of either of the two sequences { (Π A R B z k , x k ) } k ≥ 0 { (Π A x k , x k ) } k ≥ 0 , (15) are solutions best approximation problem min a,b ‖ a − b ‖ such that a ∈ A and b ∈ B . The key consequence of Theorem 3 is that we can apply DR with impunity to (14), and extract from its iterates the optimal solution to problem (9) (from which recovering the primal is trivial). The most important feature of solving the dual (9) in this way is that absolutely no stepsize tuning is required, making the method very practical and user friendly (see also Appendix D). 6 pBCD, iter 1 pBCD, iter 7 DR, iter 1 DR, iter 4 smooth gap ν s = 3 . 4 · 10 6 ν s = 4 . 4 · 10 5 ν s = 4 . 17 · 10 5 ν s = 8 . 05 · 10 4 discrete gap ν d = 4 . 6 · 10 3 ν d = 5 . 5 · 10 2 ν d = 6 . 6 · 10 3 ν d = 5 . 9 · 10 − 1 Figure 1: Segmentation results for the slowest and fastest projection method, with smooth ( ν s ) and discrete ( ν d ) duality gaps. Note how the background noise disappears only for small duality gaps. 5 Experiments We empirically compare the proposed projection methods 2 to the (smoothed) subgradient methods discussed in Section 3.1. For solving the proximal problem, we apply block coordinate descent (BCD) and Douglas-Rachford (DR) to Problem (8) if applicable, and also to (7) (BCD-para, DR- para). In addition, we use acceleration to solve (8) or (9) [6]. The main iteration cost of all methods except for the primal subgradient method is the orthogonal projection onto polytopes B ( F j ) , and therefore the number of iterations is a suitable criterion for comparisons. The primal subgradient method uses the greedy algorithm in each iteration, which runs in O ( n log n ) . However, as we will see, its convergence is so slow to counteract any benefit that may arise from not using projections. We do not include Frank-Wolfe methods here, since FW is equivalent to a subgradient descent on the primal and converges correspondingly slowly. As benchmark problems, we use (i) graph cut problems for segmentation, or MAP inference in a 4-neighborhood grid-structured MRF, and (ii) concave functions similar to those used in [44], but together with graph cut functions. The segmentation problems (i) are set up in a fairly standard way on a 4-neighbor grid graph, with unary potentials derived from Gaussian Mixture Models of color features. The weight of graph edge ( i, j ) is a function of exp( −‖ y i − y j ‖ 2 ) , where y i is the RGB color vector of pixel i . The functions in (i) decompose as sums over vertical and horizontal paths. All horizontal paths are independent and can be solved together in parallel, and similarly all vertical paths. The functions in (ii) are constructed by extracting regions R j via superpixels [33] and, for each R j , defining the function F j ( S ) = | S || R j \ S | . We use 200 and 500 regions. The problems have size 640 × 427 . Hence, for (i) we have r = 640 + 427 (but solve it as r = 2 ) and for (ii) r = 640 + 427 + 500 (solved as r = 3 ). For algorithms working with formulation (7), we compute an improved smooth duality gap of a current primary solution x = − ∑ j y j as follows: find y ′ ∈ argmax y ∈ B ( F ) x ⊤ y (then f ( x ) = x ⊤ y ′ ) and find an improved x ′ by minimizing min z z ⊤ y ′ + 1 2 ‖ z ‖ 2 subject to the constraint that z has the same ordering as x [1]. The constraint ensures that ( x ′ ) ⊤ y ′ = f ( x ′ ) . This is an isotonic regression problem and can be solved in time O ( n ) using the “pool adjacent violators” algorithm [1]. The gap is then f ( x ′ ) + 1 2 ‖ x ′ ‖ 2 − ( − 1 2 ‖ y ′ ‖ 2 ) . For computing the discrete gap, we find the best level set S i of x and, using y ′ = − x , compute min i F ( S i ) − y ′ − ( V ) . Two functions ( r = 2 ). Figure 2 shows the duality gaps for the discrete and smooth (where ap- plicable) problems for two instances of segmentation problems. The algorithms working with the proximal problems are much faster than the ones directly solving the nonsmooth problem. In par- ticular DR converges extremely fast, faster even than BCD which is known to be a state-of-the-art algorithms for this problem [3]. This, in itself, is a new insight for solving TV. We also see that the discrete gap shrinks faster than the smooth gap, i.e., the optimal discrete solution does not require to solve the smooth problem to extremely high accuracy. Figure 1 illustrates example results for different gaps. More functions ( r > 2 ). Figure 3 shows example results for four problems of sums of concave and cut functions. Here, we can only run DR-para. Overall, BCD, DR-para and the accelerated gradient method perform very well. 2 Code and data corresponding to this paper are available at https://sites.google.com/site/mloptstat/drsubmod 7 200 400 600 800 1000 −1 0 1 2 3 4 iteration log 10 (duality gap) discrete gaps − non−smooth problems − 1 dual−sgd−P dual−sgd−F dual−smooth primal−smooth primal−sgd 20 40 60 80 100 −1 0 1 2 3 4 iteration log 10 (duality gap) discrete gaps − smooth problems− 1 grad−accel BCD DR BCD−para DR−para 20 40 60 80 100 −4 −2 0 2 4 6 iteration log 10 (duality gap) smooth gaps − smooth problems − 1 grad−accel BCD DR BCD−para DR−para 200 400 600 800 1000 −1 0 1 2 3 4 5 iteration log 10 (duality gap) discrete gaps − non−smooth problems − 2 dual−sgd−P dual−sgd−F dual−smooth primal−smooth primal−sgd 20 40 60 80 100 −1 0 1 2 3 4 5 iteration log 10 (duality gap) discrete gaps − smooth problems− 2 grad−accel BCD DR BCD−para DR−para 20 40 60 80 100 −2 −1 0 1 2 3 4 5 6 7 iteration log 10 (duality gap) smooth gaps − smooth problems − 2 grad−accel BCD DR BCD−para DR−para 200 400 600 800 1000 −1 0 1 2 3 4 5 iteration log 10 (duality gap) discrete gaps − non−smooth problems − 3 dual−sgd−P dual−sgd−F dual−smooth primal−smooth primal−sgd 20 40 60 80 100 −1 0 1 2 3 4 5 iteration log 10 (duality gap) discrete gaps − smooth problems− 3 grad−accel BCD DR BCD−para DR−para 20 40 60 80 100 −2 −1 0 1 2 3 4 5 6 7 iteration log 10 (duality gap) smooth gaps − smooth problems − 3 grad−accel BCD DR BCD−para DR−para 200 400 600 800 1000 −1 0 1 2 3 4 5 iteration log 10 (duality gap) discrete gaps − non−smooth problems − 4 dual−sgd−P dual−sgd−F dual−smooth primal−smooth primal−sgd 20 40 60 80 100 −1 0 1 2 3 4 5 iteration log 10 (duality gap) discrete gaps − smooth problems− 4 grad−accel BCD DR BCD−para DR−para 20 40 60 80 100 −2 −1 0 1 2 3 4 5 6 7 iteration log 10 (duality gap) smooth gaps − smooth problems − 4 grad−accel BCD DR BCD−para DR−para Figure 2: Comparison of convergence behaviors. Left: discrete duality gaps for various optimization schemes for the nonsmooth problem, from 1 to 1000 iterations. Middle: discrete duality gaps for var- ious optimization schemes for the smooth problem, from 1 to 100 iterations. Right: corresponding continuous duality gaps. From top to bottom: four different images. 10 20 30 40 50 −3 −2 −1 0 1 2 3 4 5 6 iteration log 10 (duality gap) discrete gaps − 1 dual−sgd−P DR−para BCD BCD−para grad−accel 50 100 150 200 −3 −2 −1 0 1 2 3 4 5 6 iteration log 10 (duality gap) discrete gaps − 2 dual−sgd−P DR−para BCD BCD−para grad−accel 20 40 60 80 100 −3 −2 −1 0 1 2 3 4 5 6 iteration log 10 (duality gap) discrete gaps − 3 dual−sgd−P DR−para BCD BCD−para grad−accel 50 100 150 200 −3 −2 −1 0 1 2 3 4 5 6 iteration log 10 (duality gap) discrete gaps − 4 dual−sgd−P DR−para BCD BCD−para grad−accel Figure 3: Convergence behavior for graph cut plus concave functions. 8 0 2 4 6 8 0 1 2 3 4 5 6 40 iterations of DR # cores speedup factor 0 2 4 6 8 0 1 2 3 4 5 6 40 iterations of DR # cores speedup factor Figure 4: Speedup due to parallel processing for two instances. Maxflow DR 1-thread DR 2-thread DR 4-thread image 1 0.39 1.61 (4.13) 0.93 (2.39) 0.65 (1.66) image 2 0.32 1.74 (5.45) 0.99 (3.10) 0.69 (2.16) image 3 0.40 3.45 (8.61) 1.93 (4.82) 1.31 (3.27) image 4 0.38 3.38 (8.88) 1.90 (5.00) 1.29 (3.38) average 0.37 2.55 1.44 0.98 Table 1: Running times (in seconds) for the optimized C++ Maxflow code of [8, 9, 30] and our DR for graph cut using one or multiple threads. The last row is the average, and the numbers in parentheses indicate the factor relative to the Maxflow time. Parallel speedups If we aim for parallel methods, then again DR outperforms BCD. Figure 4 (right) shows the speedup gained from parallel processing for r = 2 . Using 8 cores, we obtain a 5-fold speed-up. Running time compared to graph cuts Table 1 shows the running times of our DR method (implemented in Matlab/C++) and the Maxflow code of [8, 9, 30] (using the wrapper [2]) for the four graph cut (segmentation) instances above on a MacBook Air with a 2 GHz Intel Core i7. The running times are averages over 5 repetitions. DR was run for 10, 10, 21, and 20 iterations, respectively. DR is by a factor of 2-9 slower than the specialized code. Given that, as opposed to the combinatorial algorithm, DR solves the full regularization path, is parallelizable, generic and straightforwardly extends to a variety of functions, this is remarkable. In summary, our experiments suggest that projection methods can be extremely useful for solving the combinatorial submodular minimization problem. Of the tested methods, DR, cyclic BCD and accelerated gradient perform very well. For parallelism, applying DR on (9) converges much faster than BCD on the same problem. 6 Conclusion We have presented a novel approach to submodular function minimization based on the equivalence with a best approximation problem. The use of reflection methods avoids any hyperparameters and reduce the number of iterations significantly, suggesting the suitability of reflection methods for combinatorial problems. Given the natural parallelization abilities of our approach, it would be interesting to perform detailed empirical comparisons with existing parallel implementations of graph cuts (e.g., [42]). Moreover, a generalization beyond submodular functions of the relationships between combinatorial optimization problems and convex problems would enable the application of our framework to other common situations such as multiple labels (see, e.g., [31]). Acknowledgments. This research was in part funded by the Office of Naval Research under contract/grant number N00014-11-1-0688, by NSF CISE Expeditions award CCF-1139158, by DARPA XData Award FA8750-12-2-0331, and the European Research Council (SIERRA project), as well as gifts from Amazon Web Services, Google, SAP, Blue Goji, Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, General Elec- tric, Hortonworks, Intel, Microsoft, NetApp, Oracle, Samsung, Splunk, VMware and Yahoo!. We would like to thank Martin Jaggi, Simon Lacoste-Julien and Mark Schmidt for discussions. 9 References [1] F. Bach. Learning with submodular functions: A convex optimization perspective. Arxiv preprint arXiv:1111.6453v2 , 2013. [2] Shai Bagon. Matlab wrapper for graph cut, 2006. [3] A. Barbero and S. Sra. Fast Newton-type methods for total variation regularization. In Int. Conference on Machine Learning (ICML) , 2011. [4] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces . Springer, 2011. [5] H. H. Bauschke, P. L. Combettes, and D. R. Luke. Finding best approximation pairs relative to two closed convex sets in Hilbert spaces. Journal of Approximation Theory , 127(2):178–192, 2004. [6] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009. [7] D. P. Bertsekas. Nonlinear programming . Athena Scientific, 1999. [8] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Analysis and Machine Intelligence , 26 (9):1124–1137, 2004. [9] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Analysis and Machine Intelligence , 23(11):1222–1239, 2001. [10] B.Savchynskyy, S.Schmidt, J.H.Kappes, and C.Schn ̈ orr. Efficient MRF energy minimization via adaptive diminishing smoothing. In Conference on Uncertainty in Artificial Intelligence (UAI) , 2012. [11] A. Chambolle. An algorithm for total variation minimization and applications. Journal of Mathematical Imaging and Vision , 20(1):89–97, 2004. [12] A. Chambolle and J. Darbon. On total variation minimization and surface evolution using parametric maximum flows. Int. Journal of Comp. Vision , 84(3):288–307, 2009. [13] F. Chudak and K. Nagano. Efficient solutions to relaxations of combinatorial problems with submodular penalties via the Lov ́ asz extension and non-smooth convex optimization. In ACM- SIAM Symp. on Discrete Algorithms (SODA) , 2007. [14] P. L. Combettes and J.-C. Pesquet. Proximal Splitting Methods in Signal Processing. In Fixed- Point Algorithms for Inverse Problems in Science and Engineering , pages 185–212. Springer, 2011. [15] F. R. Deutsch. Best Approximation in Inner Product Spaces . Springer Verlag, first edition, 2001. [16] J. Douglas and H. H. Rachford. On the numerical solution of the heat conduction problem in 2 and 3 space variables. Trans. of the American Mathematical Society , 82:421–439, 1956. [17] J. Edmonds. Submodular functions, matroids, and certain polyhedra. In Combinatorial opti- mization - Eureka, you shrink! , pages 11–26. Springer, 2003. [18] U. Feige, V. S. Mirrokni, and J. Vondrak. Maximizing non-monotone submodular functions. SIAM Journal on Computing , 40(4):1133–1153, 2011. [19] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly , 3:95–110, 1956. [20] S. Fujishige. Lexicographically optimal base of a polymatroid with respect to a weight vector. Mathematics of Operations Research , pages 186–196, 1980. [21] S. Fujishige. Submodular Functions and Optimization . Elsevier, 2005. [22] S. Fujishige and S. Isotani. A submodular function minimization algorithm based on the minimum-norm base. Pacific Journal of Optimization , 7:3–17, 2011. [23] H. Groenevelt. Two algorithms for maximizing a separable concave function over a polyma- troid feasible region. European Journal of Operational Research , 54(2):227–236, 1991. [24] D.S. Hochbaum and S.-P. Hong. About strongly polynomial time algorithms for quadratic optimization over submodular constraints. Mathematical Programming , pages 269–309, 1995. 10 [25] S. Iwata and N. Zuiki. A network flow approach to cost allocation for rooted trees. Networks , 44:297–301, 2004. [26] S. Jegelka, H. Lin, and J. Bilmes. On fast approximate submodular minimization. In Neural Information Processing Systems (NIPS) , 2011. [27] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research , pages 2297–2334, 2011. [28] P. Kohli, L. Ladick ́ y, and P. Torr. Robust higher order potentials for enforcing label consistency. Int. Journal of Comp. Vision , 82, 2009. [29] V. Kolmogorov. Minimizing a sum of submodular functions. Discrete Applied Mathematics , 160(15), 2012. [30] V. Kolmogorov and R. Zabih. What energy functions can be minimized by graph cuts? IEEE Trans. Pattern Analysis and Machine Intelligence , 26(2):147–159, 2004. [31] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond via dual decomposition. IEEE Trans. Pattern Analysis and Machine Intelligence , 33(3):531–552, 2011. [32] A. Krause and C. Guestrin. Submodularity and its applications in optimized information gath- ering. ACM Transactions on Intelligent Systems and Technology , 2(4), 2011. [33] A. Levinshtein, A. Stere, K.N. Kutulakos, D.J. Fleet, S.J. Dickinson, and K. Siddiqi. Tur- bopixels: Fast superpixels using geometric flows. IEEE Trans. Pattern Analysis and Machine Intelligence , 31(12), 2009. [34] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT) , 2011. [35] L. Lov ́ asz. Submodular functions and convexity. Mathematical programming: the state of the art, Bonn , pages 235–257, 1982. [36] S. T. McCormick. Submodular function minimization. Discrete Optimization , 12:321–391, 2005. [37] O. Meshi, T. Jaakkola, and A. Globerson. Convergence rate analysis of MAP coordinate mini- mization algorithms. In Neural Information Processing Systems (NIPS) , 2012. [38] G.L. Nemhauser, L.A. Wolsey, and M.L. Fisher. An analysis of approximations for maximizing submodular set functions–I. Mathematical Programming , 14(1):265–294, 1978. [39] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming , 103 (1):127–152, 2005. [40] J. B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Mathematical Programming , 118(2):237–251, 2009. [41] B. Savchynskyy, S. Schmidt, J. Kappes, and C. Schn ̈ orr. A study of Nesterov’s scheme for Lagrangian decomposition and MAP labeling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2011. [42] A. Shekhovtsov and V. Hlav ́ ac. A distributed mincut/maxflow algorithm combining path aug- mentation and push-relabel. In Energy Minimization Methods in Computer Vision and Pattern Recognition , 2011. [43] P. Stobbe. Convex Analysis for Minimizing and Learning Submodular Set functions . PhD thesis, California Institute of Technology, 2013. [44] P. Stobbe and A. Krause. Efficient minimization of decomposable submodular functions. In Neural Information Processing Systems (NIPS) , 2010. [45] R. Tarjan, J. Ward, B. Zhang, Y. Zhou, and J. Mao. Balancing applied to maximum network flow problems. In European Symp. on Algorithms (ESA) , pages 612–623, 2006. 11 A Derivations of Dual Problems A.1 Proof of Lemma 1 Proof. To derive the non-smooth dual problem, we follow [31] and use Lagrangian duality: min x ∈ [0 , 1] n f ( x ) = min x ∈ [0 , 1] n ∑ r j =1 f j ( x ) = min x 1 ,...,x r ∈ [0 , 1] n ∑ r j =1 f j ( x j ) such that x 1 = · · · = x r = min x ∈ R n , x 1 ,...,x r ∈ [0 , 1] n max ( λ j ) ∑ r j =1 f j ( x j ) + ∑ r j =1 λ ⊤ j ( x − x j ) = max ∑ r j =1 λ j =0 ∑ r j =1 min x j ∈ [0 , 1] n { f j ( x j ) − λ ⊤ j x j } = max ∑ r j =1 λ j =0 ∑ r j =1 max y j ∈ B ( F j ) ( y j − λ j ) − ( V ) = max ∑ r j =1 λ j =0 ∑ r j =1 g j ( λ j ) , where g j ( λ j ) = min A ⊂ V F j ( A ) − λ j ( A ) is a nonsmooth concave function, which may be computed efficiently through submodular function minimization. A.2 Proof of Lemma 2 Proof. The proof follows a similar saddle-point approach. min x ∈ R n f ( x ) + 1 2 ‖ x ‖ 2 2 = min x ∈ R n ∑ r j =1 f j ( x ) + 1 2 ‖ x ‖ 2 2 = min x 1 ,...,x r ∈ R n ∑ r j =1 { f j ( x j ) + 1 2 r ‖ x j ‖ 2 2 } such that x 1 = · · · = x r = min x ∈ R n , x 1 ,...,x r ∈ R n max λ j ∑ r j =1 { f j ( x j ) + 1 2 r ‖ x j ‖ 2 2 + λ ⊤ j ( x − x j ) } = max ∑ r j =1 λ j =0 ∑ r j =1 min x j ∈ R n { f j ( x j ) − λ ⊤ j x j + 1 2 r ‖ x j ‖ 2 2 } = max ∑ r j =1 λ j =0 ∑ r j =1 min x j ∈ R n { max y j ∈ B ( F j ) x ⊤ j y j − λ ⊤ j x j + 1 2 r ‖ x j ‖ 2 2 } = max ∑ r j =1 λ j =0 ∑ r j =1 max y j ∈ B ( F j ) − r 2 ‖ y j − λ j ‖ 2 2 = max ∑ r j =1 λ j =0 max y j ∈ B ( F j ) − r 2 r ∑ j =1 ‖ y j − λ j ‖ 2 2 . (16) Writing (16) as a minimization problem and ignoring constants completes the proof. B Divide-and-conquer algorithm for parametric submodular minimization B.1 Description of the algorithm The optimal solution x ∗ of our proximal problem min x ∈ R n f ( x ) + ‖ x ‖ 2 indicates the minimizers of F ( S ) − λ | S | for all λ ∈ R . Those minimizers form a chain S ∅ ⊂ S 1 ⊂ . . . ⊂ S k = V . The solutions are the level sets of the optimal solution x ∗ . Here, we extend the approach of Tarjan et al. [45] for parametric max-flow to all submodular func- tions and all monotone strictly convex functions beyond the square functions used in the main paper. More precisely, we consider a submodular function F defined on V = { 1 , . . . , n } and n differ- entiable strictly convex functions h i such that their Fenchel-conjugates h ∗ i have full domain, for 12 i ∈ { 1 , . . . , n } . The functions h ∗ i are then differentiable. We consider the following problem: min x ∈ R n f ( x ) + n ∑ i =1 h i ( x i ) = min x ∈ R n max y ∈ B ( F ) y ⊤ x + n ∑ i =1 h i ( x i ) (17) = max y ∈ B ( F ) min x ∈ R n y ⊤ x + n ∑ i =1 h i ( x i ) (18) = max y ∈ B ( F ) − n ∑ i =1 h ∗ i ( − y i ) . (19) The optimality conditions are 1. y ∈ B ( F ) , 2. y ⊤ x = f ( x ) , 3. − y i = h ′ i ( x i ) ⇔ x i = ( h ∗ i ) ′ ( − y i ) . Let τ ( V ) be the time for minimizing the submodular function F ( S ) + a ( S ) on the ground set V (for any a ∈ R n ). For our complexity analysis, we make the assumption that minimizing the (contracted) function F S,a ( T ) , F ( S ∪ T ) − F ( S ) + a ( T ) on the smaller ground set U ⊆ V \ S (for any a ∈ R n , S ⊆ V , U ⊆ V \ S ) takes time at most | U | | V | τ ( V ) . This is a reasonable assumption, because it essentially says that τ ( V ) grows at least linearly in the size of V . To our knowledge, even fast algorithms for special submodular functions take at least linear time. We will also use the notation F ( S | T ) , F ( S ∪ T ) − F ( S ) . For recursions, we use the restriction F S : 2 S → R , F S ( T ) = F ( T ) of F to S and the contraction F S : 2 V \ S → R , F S ( T ) = F ( T | S ) of F on S . Algorithm 1: Recursive Divide-and-Conquer SplitInterval ( λ min , λ max , V , F , i ) if i even then // unbalanced split λ ← argmin λ ∑ i h i ( λ ) − λF ( V ) A ← argmin T ⊆ V F ( T ) + ∑ i ∈ T h ′ i ( λ ) if S = ∅ or S = V then return x = λ 1 V end else // balanced split λ ← ( λ min + λ max ) / 2 S ← argmin T ⊆ V F ( T ) + ∑ i ∈ T h ′ i ( λ ) if S = ∅ then x ← SplitInterval ( λ min , λ , V , F , i + 1 ) return x end if S = V then x ← SplitInterval ( λ , λ max , V , F , i + 1 ) return x end end // S 6 = ∅ and S 6 = V x S ← SplitInterval ( λ min , λ , S , F A , i + 1 ) x V \ S ← SplitInterval ( λ , λ max , V \ S , F S , i + 1 ) return [ x S , x V \ S ] Algorithm 1 is a divide-and-conquer algorithm. In each recursive call, it takes an interval [ λ min , λ max ] in which all components of the optimal solution lie and either (a) shortens the search 13 interval for any break point, (b) finds the optimal (constant) value of x on a range of elements, or (c) recursively splits the problem into a set S and V \ S with corresponding ranges for the values of x ∗ and finds the optimal values of x on the two subsets. B.2 Review of related results The goal of this appendix is to show Proposition 4 below. We first start by reviewing existing results regarding separable problems on the base polyhedron (see [1] for details). It is known that if y ∈ B ( F ) , then y k ∈ [ F ( V ) − F ( V \{ k } ) , F ( { k } ) ] ; thus, the optimal solution x is such that x k ∈ [ ( h ∗ k ) ′ ( − F ( { k } )) , ( h ∗ k ) ′ ( F ( V \{ k } ) − F ( V )) ] . We therefore set the initial search range to λ min = min k ∈ V ( h ∗ k ) ′ ( − F ( { k } )) and λ max = max k ∈ V ( h ∗ k ) ′ ( F ( V \{ k } ) − F ( V )) . The algorithm relies on the following facts (see [1] for a proof). For all three propositions, we assume that the h i are strictly convex, continuously differentiable functions on R such that sup λ ∈ R h ′ ( λ ) = + ∞ and inf λ ∈ R h ′ ( λ ) = −∞ . Proposition 1 (Monotonicity of optimizing sets) . Let α < β and S α be any minimizer of F ( S ) + h ′ ( α )( S ) and S β any minimizer of F ( S ) + h ′ ( β )( S ) . Then S β ⊆ S α . Proposition 2 (Characterization of x ∗ ) . The coordinates x ∗ j ( j ∈ V ) of the unique optimal solution x ∗ of Problem 17 are x ∗ j = max { λ | j ∈ S λ } , where S λ is any minimizer of F ( S ) + h ′ ( λ )( S ) . Propositions 1 and 2 imply that the level sets of x ∗ form a chain ∅ = S 0 ⊂ S 1 ⊂ . . . ⊂ S k = V of maximal minimizers for the critical values of λ (which are the entries of x ∗ ). (Each S i = S λ for some λ = x ∗ j .) Proposition 3 (Splits) . Let T = S i be a level set of x ∗ and let y ∈ R T , z ∈ R V \ T be the minimizers of the subproblems y = argmin x f T ( x ) + ∑ i ∈ T h i ( x i ) z = argmin x f T ( x ) + ∑ i / ∈ T h i ( x i ) Then x ∗ j = y j for j ∈ T and x ∗ j = z j for j ∈ V \ T . The algorithm uses Proposition 3 recursively. Proof. Let λ be the value in x ∗ defining S i = S λ . It is easy to see that the restriction F T and the contraction F T are both submodular. Hence, Propositions 1 and 2 hold for them. Since the restriction on T is equivalent to the original function for any S ⊆ T , F ( S ) + h ( λ )( S ) = F T ( S ) + h T ( λ )( S ) for any S ⊆ T . With this, Propositions 1 and 2 imply that for any α > λ , F ( S ) + h ( λ )( S ) = F T ( S ) + h T ( λ )( S ) and therefore x ∗ j = y j for j ∈ T . Similarly, for any S ∈ V \ T , it holds that F ( S ∪ T ) + h ( λ )( S ∪ T ) = F T ( S ) + ( h ′ ( λ )) T ( S ) + F ( T ) + h ′ ( λ )( T ) . Due to the monotonicity property of the optimizing sets, S α ⊇ T for all α < λ , and therefore the maximal minimizer U α of F T ( S ) + ( h ′ ( α )) T ( S ) satisfies U α ∪ T = S α (the terms F ( T ) + h ′ ( λ )( T ) are constant with respect to U ). Hence Poposition 2 implies that x ∗ j = z j for j ∈ V \ T . These propositions imply that there is a set of at most n values of λ = α that define the level sets S α of the optimal solution x ∗ . If we know these break point values, then we know x ∗ . Algorithm 1 interleaves an unbalanced split strategy that may split the seach interval in an unbalanced way but converges in O ( n ) recursive calls, and a balanced split strategy that always halves the search inter- vals but is not finitely convergent. 14 B.3 Proof of convergence We now prove the convergence rate for Algorithm 1. Proposition 4. The minimum of f ( x ) + ∑ n i =1 h i ( x i ) may be obtained up to coordinate-wise accu- racy ǫ within O ( min { n, log 1 ǫ } ) (20) submodular function minimizations. If h i ( x i ) = 1 2 x 2 i , then ǫ = ∆ min n 2 ℓ 0 is sufficient to recover the exact solution, where ∆ min = min {| F ( S | T ) | | S ⊆ V \ T, F ( S | T ) 6 = 0 } and ℓ 0 is the length of the initial interval [ λ min , λ max ] . Proof. The proof relies on Propositions 1, 2 and 3. We first argue for the correctness of the balanced splitting strategy. Propositions 1 and 2 imply that for any λ ∈ R , if S is a minimizer of F ( S ) + h ′ ( λ )( S ) , then the unique minimum of f ( x ) + ∑ n i =1 h i ( x i ) satisfies that for all k ∈ S , x k > h ′ k ( λ ) and for all k ∈ V \ S, x k 6 h ′ k ( λ ) . In particular, if S = ∅ , then this means that for all k ∈ V , x k 6 h ′ k ( λ ) . Similarly, if S = V , then for all k ∈ V , x k > h ′ k ( λ ) . The limits of the interval are set accordingly. The correctness of the recursive call follows from Proposition 3. In each iteration, the size of the search interval [ λ min , λ max ] for any break point is at least halved. Hence, within d recursions, the length of each interval is at most 2 − d ℓ 0 . The choice of λ in the unbalanced splitting strategy corresponds to solving a simplified version of the dual problem. Indeed, by convex duality, the following two problems are dual to each other: max y − ∑ i h ∗ i ( − y i ) s.t. y ( V ) = F ( V ) (21) min λ ∈ R ∑ i ∈ V h i ( λ ) − λF ( V ) . (22) Problem (21) replaces the constraint that y ∈ B ( F ) by y ( V ) = F ( V ) , dropping the constraint that y ( S ) ≤ F ( S ) for all S ⊆ V . Testing whether y satisfies all constraints of (19), i.e., y ∈ B ( F ) is equivalent to testing whether F ( S ) − y ( S ) ≥ 0 . We do this implicitly by our choice of λ : Convex duality implies that the the optimal solutions of Problems (21) and (22) satisfy y i = − h ′ i ( λ ) . This holds in particular for the chosen (unique optimal) λ in the algorithm. Let T be a minimizer of F ( S ) + h ′ ( λ )( S ) = F ( S ) − y ( S ) . If T = ∅ or T = V , then y ∈ B ( F ) and an optimal solution for the full dual problem (19). Hence, y and x = λ 1 V = ( h ∗ ) ′ ( − y ) form a primal/dual optimal pair for (19). If ∅ ⊂ T ⊂ V and F ( T ) − y ( T ) < 0 , then y / ∈ B ( F ) , and we perform a split with the same argumentation as above. This splitting strategy is exactly that of [1, 23] and splits at most n times. Hence, this strategy yields the global optimum (to machine precision) in the time of O ( n ) times solving a submodular minimization on V . If n is large, this may be computationally expensive. If we only do balanced splits, we end up approaching the break points more and more closely (but typically never exactly). Unbalanced splits always find an exact break point, but with potentially little progress in reducing the intervals. Algorithm 1 thus interleaves both strategies where we store intervals of allowed values for subsets of components of A . At step d there are at most min { n, 2 d } different intervals (as there cannot be more intervals than elements of V ). To split these intervals, submodular function minimization problems have to be solved on each of these intervals, with total complexity less than a single submodular function optimization problem on the full set. At each iteration, intervals corresponding to a singleton may be trivially completely solved, and components which are already found are discarded. Hence, at each recursive level, the total computation time is bounded above by τ ( V ) . While balanced splits always substantially shrink the intervals, they are not finitely convergent. Un- balanced splits converge after at most n recursions. Following the argumentation of Tarjan et al. [45], who considered the special case of flows, alternating the two types of splits gives the best of both worlds: (a) all components are estimated up to precision ℓ 0 2 d/ 2 , and (b) the algorithm is finitely con- vergent, and will stop when ℓ 0 2 d/ 2 is less than the minimal distance between two different components of x . 15 Finally, we adress the precision for the special case that h i ( x i ) = 1 2 x 2 i for all i ∈ V . If the interval lengths are smaller than the smallest gap between any two break points (components of x ∗ ), then each interval contains at most one break point and the algorithm converges after at most two unbal- anced splits. Hence, we here consider ǫ to be one half times the smallest gap between any two break points. Let ∅ = S 0 ⊂ S 1 ⊂ . . . ⊂ S k = V be the chain of level sets of x ∗ . By the optimality conditions discussed above for unbalanced splits, any constant part T = S i \ S i − 1 of x ∗ takes value λ 1 = − y j 1 ( j ∈ T ), where y ( T ) = F S i − 1 ( T ) , and hence λ = − F ( S i \ S i − 1 | S i − 1 ) | S i \ S i − 1 | . (23) Therefore, the (absolute) difference between any two such values is loosely lower bounded by min i ∣ ∣ ∣ ∣ F ( S i \ S i − 1 | S i − 1 ) | S i \ S i − 1 | − F ( S i +1 \ S i | S i ) | S i +1 \ S i | ∣ ∣ ∣ ∣ ≥ ∆ min ( 2 n − 1 − 2 n ) ≥ 2∆ min n 2 . (24) This implies O (log( ℓ 0 n 2 / ∆ min )) iterations. Note that in the case of flows, the algorithm is not exactly equivalent to the flow algorithm of [45], which updates flows directly. C BCD and proximal Dykstra We consider the best approximation problem min 1 2 ‖ x − y ‖ 2 2 s.t. x ∈ C 1 ∩ C 2 ∩ · · · ∩ C m . Let us show the details for only the two block case. The general case follows similarly. Consider the more general problem min 1 2 ‖ x − y ‖ 2 2 + f ( x ) + h ( x ) . (25) Clearly, this problem contains the two-block best approximation problem as a special case (by setting f and h to be suitable indicator functions). Now introduce two variables z, w that equal x ; then the corresponding Lagrangian is L ( x, z, w, ν, μ ) := 1 2 ‖ x − y ‖ 2 2 + f ( z ) + h ( w ) + ν T ( x − z ) + μ T ( x − w ) . From this Lagrangian, a brief calculation yields the dual optimization problem min g ( ν, μ ) := 1 2 ‖ ν + μ − y ‖ 2 2 + f ∗ ( ν ) + h ∗ ( μ ) . We solve this dual problem via BCD, which has the updates ν k +1 = argmin ν g ( ν, μ k ) , μ k +1 = argmin μ g ( ν k +1 , μ ) . Thus, 0 ∈ ν k +1 + μ k − y + ∂f ∗ ( ν k +1 ) and 0 ∈ ν k +1 + μ k +1 − y + ∂h ∗ ( μ k +1 ) . The first optimality condition may be rewritten as y − μ k ∈ ν k +1 + ∂f ∗ ( ν k +1 ) = ⇒ ν k +1 = prox f ∗ ( y − μ k ) = ⇒ ν k +1 = y − μ k − prox f ( y − μ k ) . Similarly, we second condition yields μ k +1 = y − ν k +1 − prox h ( y − ν k +1 ) . Now use Lagrangian stationarity x = y − ν − μ = ⇒ y − μ = x + ν to rewrite BCD using primal and dual variables to obtain the so-called proximal-Dykstra method: t k ← prox f ( x k + ν k ) ν k +1 ← x k + ν k − t k x k +1 ← prox h ( μ k + t k ) μ k +1 ← μ k + t k − x k +1 We discussed the more general problem (25) because it contains the smoothed primal as a special case, namely with y = 0 in (25), f = f 1 , and h = f 2 , we obtain min f 1 ( x ) + f 2 ( x ) + 1 2 ‖ x ‖ 2 2 , for which BCD yields the proximal-Dykstra method that was previously used in [3] for two- dimensional TV optimization. 16 D Recipe: Submodular minimization via reflections To be precise, we summarize here how to solve Problem (17) via reflections. As we showed above, the dual is of the form min λ,y ‖ y − λ ‖ 2 2 s.t. λ ∈ A = { ( λ 1 , . . . , λ r ) ∈ H r | ∑ r j =1 λ j = 0 } , y ∈ B , ∏ r j =1 B ( F j ) . (26) The vector y consists of r parts y j ∈ B ( F j ) . We first solve the dual by starting with any z (0) ∈ H r , and iterate z ( k +1) = 1 2 ( z k + R A R B ( z ( k ) )) . (27) Upon convergence to a point z ∗ , we extract the components y j = Π B ( F j ) ( z ∗ j ) . (28) The final primal solution is x = − ∑ j y j ∈ R n . 17