Journal of Computational Mathematics Vol.35, No.4, 2017, 383–398. http://www.global-sci.org/jcm doi:10.4208/jcm.1606-m2015-0452 EXTRAPUSH FOR CONVEX SMOOTH DECENTRALIZED OPTIMIZATION OVER DIRECTED NETWORKS* Jinshan Zeng College of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 330022, China Email: jinshanzeng@jxnu.edu.cn Wotao Yin Department of Mathematics, University of California, Los Angeles, CA 90095, USA Email: wotaoyin@math.ucla.edu Abstract In this note, we extend the algorithms Extra [13] and subgradient-push [10] to a new algorithm ExtraPush for consensus optimization with convex differentiable objective func- tions over a directed network. When the stationary distribution of the network can be computed in advance, we propose a simplified algorithm called Normalized ExtraPush. Just like Extra, both ExtraPush and Normalized ExtraPush can iterate with a fixed step size. But unlike Extra, they can take a column-stochastic mixing matrix, which is not necessarily doubly stochastic. Therefore, they remove the undirected-network restriction of Extra. Subgradient-push, while also works for directed networks, is slower on the same type of problem because it must use a sequence of diminishing step sizes. We present preliminary analysis for ExtraPush under a bounded sequence assump- tion. For Normalized ExtraPush, we show that it naturally produces a bounded, linearly convergent sequence provided that the objective function is strongly convex. In our numerical experiments, ExtraPush and Normalized ExtraPush performed simi- larly well. They are significantly faster than subgradient-push, even when we hand-optimize the step sizes for the latter. Mathematics subject classification: 90C25, 90C30. Key words: Decentralized optimization, Directed graph, Consensus, Non-doubly stochas- tic, Extra. 1. Introduction We consider the following consensus optimization problem defined on a directed, strongly connected network of n agents: minimize x∈Rp f(x) ≜ n X i=1 fi(x), (1.1) where fi is a proper, closed, convex, differentiable function only known to the agent i. The model (1.1) finds applications in decentralized averaging, learning, estimation, and control. For a stationary network with bi-directional communication, the existing algorithms include the (sub)gradient methods [2,5,8,9,13,19], and the primal-dual domain methods such as the decentralized alternating direction method of multipliers (DADMM) [11,12]. * Received November 12, 2015 / Revised version received June 2, 2016 / Accepted June 27, 2016 / arXiv:1511.02942v4 [math.OC] 30 Jan 2019 384 J.S. ZENG AND W.T. YIN This note focuses on a directed network (with directional communication), where the re- search of decentralized optimization is pioneered by the works [15–17]. When communication is bi-directional, algorithms can use a symmetric and doubly-stochastic mixing matrix to ob- tain a consensual solution; however, once the communication is directional, the mixing matrix becomes generally asymmetric and only column-stochastic. Also consider the setting where each agent broadcasts its information to its neighbors, yet an agent may not receive the in- formation from a neighbor. An agent can weigh its information (both from itself and received from its neighbors) so that the total weights add up to 1, but an agent cannot ensure that its broadcasted information receives weights that precisely add up to exactly 1. Therefore, only each column of the mixing matrix sums to 1. In the column-stochastic setting, the push-sum protocol [6] can be used to obtain a stationary distribution for the mixing matrix. In the symmetric and doubly-stochastic setting, if the objective is Lipschitz-differentiable, the gradient-based algorithm Extra [13] converges at the rate of O(1/t), where t is the iteration number. In the column-stochastic setting, the best rate is O(ln t/ √ t) from the subgradient- based algorithm [10]. We address the open question of how to take advantage of the gradient of a Lipschitz-differentiable objective. We make an attempt in this note to combine ideas in [10,13] and present our preliminary results. Specifically, we propose ExtraPush, which is a two-step iteration like Extra and incorporates the push-sum protocol. At each iteration, the Extra variables are approximately normalized by the current push-sum variables. When the stationary distribution of the network can be easily computed, we propose to first apply the push-sum protocol to obtain the stationary distribution and then run the two-step iteration Normalized ExtraPush. At each iteration, its running variables are normalized by the stationary distribution. Our algorithms are essentially the same as found in the recent work by Xi and Khan [18]. They attempted to prove convergence for a strongly convex objective function. They noticed that a certain matrix that is important to the analysis (as a part of their convergence metric) is positive semi-definite. Our analysis also uses this property. However, their analysis breaks down due to incorrect assumptions. More specifically, each function fi is assumed in [18] to be strongly convex and also has a bounded and Lipschitz gradient (i.e., its gradient is bounded and Lipschitz continuous). However, no function can satisfy these assumptions simultaneously since gradients of a strongly convex are strictly increasing and unbounded. It is worth noting that our algorithm can be applied to a time-varying directed network after a straightforward modification; our convergence proof, however, will need a significant change. The rest of this note is organized as follows. Section 2 introduces the problem setup and preliminaries. Section 3 develops ExtraPush and Normalized ExtraPush. Section 4 establishes the optimality conditions for ExtraPush and shows its convergence under the boundedness assumption. Section 5 assumes that the objective is strongly convex and shows that Normal- ized ExtraPush produces a bounded sequence that converges linearly. Section 6 presents our numerical simulation results. We conclude this paper in Section 7. Notation: Let In denote an identity matrix with the size n × n, and 1n×p ∈Rn×p denote the matrix with all entries equal to 1. We also use 1n ∈Rn as a vector of all 1’s. For any vector x, we let xi denote its ith component and diag(x) denote the diagonal matrix generated by x. For any matrix X, XT denotes its transpose, Xij denotes its (i, j)th component, and ∥X∥≜ p ⟨X, X⟩= qP i,j X2 ij denotes its Frobenius norm. The largest and smallest eigenvalues of matrix X are denoted as λmax(X) and λmin(X), respectively. For any matrix B ∈Rm×n, null(B) ≜{x ∈Rn|Bx = 0} is the null space of B. Given a matrix B ∈Rm×n, by Z ∈null(B), ExtraPush for Convex Smooth Decentralized Optimization 385 we mean that each column of Z lies in null(B). The smallest nonzero eigenvalue of a symmetric positive semidefinite matrix X ̸= 0 is denoted as ˜λmin(X), which is strictly positive. For any positive semidefinite matrix G ∈Rn×n (not necessarily symmetric in this paper), we use the notion ∥X∥2 G ≜⟨X, GX⟩for a matrix X ∈Rn×p. 2. Problem Reformulation 2.1. Network Consider a directed network G = {V, E}, where V is the vertex set and E is the edge set. Any edge (i, j) ∈E represents a directed arc from node i to node j. The sets of in-neighbors and out-neighbors of node i are N in i ≜{j : (j, i) ∈E} ∪{i}, N out i ≜{j : (i, j) ∈E} ∪{i}, respectively. Let di ≜|N out i | be the out-degree of node i. In G, each node i can only send information to its out-neighbors, not vice versa. To illustrate a mixing matrix for a directed network, consider A ∈Rn×n where  Aij > 0, if j ∈N in i Aij = 0, otherwise. (2.1) The entries Aij satisfy that, for each node j, P i∈V Aij = 1. An example is the following mixing matrix Aij =  1/dj, if j ∈N in i 0, otherwise, (2.2) i, j = 1, . . . , n, which is used in the subgradient-push method [10]. See Fig. 2.1 for a directed graph G and an example of its mixing matrix A. The matrix A is column stochastic and asymmetric in general. 1 4 5 2 3 A =         1 4 1 4 0 1 2 0 1 4 1 4 0 0 1 3 1 4 0 1 2 0 1 3 0 1 4 0 1 2 0 1 4 1 4 1 2 0 1 3         Fig. 2.1. A directed graph G (left) and its mixing matrix A (right). Assumption 1. The graph G is strongly connected. Property 1. Under Assumption 1, the followings hold (parts (i) and (iv) are results in [10, Corollary 2]): (i) Let At = t z }| { A × A · · · A for any t ∈N. Then At →φ1T n geometrically fast as t →∞, (2.3) for some stationary distribution vector φ, i.e., φi ≥0 and Pn i φi = 1. 386 J.S. ZENG AND W.T. YIN (ii) null(In −φ1T n) = null(In −A). (iii) Aφ = φ. (iv) The quantity ξ ≜inft min1≤i≤n(At1n)i ≥ 1 nn > 0. Proof. Part (iii) is obvious from (ii) since φ ∈null(In −φ1T n) and Pn i φi = 1. Next, we show part (ii). First, let z ∈null(In −φ1T n), which means z = φ1T nz and thus Az = Aφ1T nz. By (2.3), it is obvious that Aφ1T n = φ1T n. Therefore, Az = φ1T nz = z and hence null(In −φ1T n) ⊆ null(In −A). On the other hand, any z ∈null(In −A), equivalently, z = Az, obeys z = Atz for any t ≥1. Letting t →∞, it holds that z = φ1T nz, that is, z ∈null(In −φ1T n). Therefore, part (ii) holds. □ 2.2. Problem Given in the Matrix Notation Let x(i) ∈Rp denote the local copy of x at node i, and xt (i) denote its value at the tth iteration. Throughout the note, we use the following equivalent form of the problem (1.1) using local copies of the variable x: minimizex 1T nf(x) ≜ n X i=1 fi(x(i)), subject to x(i) = x(j), ∀i, j ∈E, (2.4) where 1n ∈Rn denotes the vector with all its entries equal to 1, x ≜       — xT (1) — — xT (2) — ... — xT (n) —       ∈Rn×p, f(x) ≜      f1(x(1)) f2(x(2)) ... fn(x(n))     ∈Rn. In addition, the gradient of f(x) is ∇f(x) ≜      — ∇f1(x(1))T — — ∇f2(x(2))T — ... — ∇fn(x(n))T —     ∈Rn×p. The ith rows of the above matrices x and ∇f(x), and vector f(x), correspond to agent i. For simplicity, one can treat p = 1 throughout this paper. For a vector ¯x ∈Rn, let ¯xave ≜1 n(Pn i=1 ¯xi) ∈R. A special case of (2.4) is the well-known average consensus problem, where fi(x(i)) = 1 2(x(i) −¯xi)2 for each node i and the solution is x(i) = ¯xave for all i. 3. Proposed Algorithms 3.1. Reviews of Extra and subgradient-push Extra [13] is a “two-step” iterative algorithm for solving (2.4) over an undirected network. Let W ∈Rn×n be a symmetric and doubly stochastic mixing matrix, and ¯W ≜In+W 2 . The Extra iteration is xt+2 = (In + W)xt+1 −¯Wxt −α(∇f(xt+1) −∇f(xt)), t = 0, 1, · · · , (3.1) ExtraPush for Convex Smooth Decentralized Optimization 387 which starts with x1 = x0 −α∇f(x0), any x0 ∈Rn×p and uses a properly bounded step size α > 0. Extra converges at a rate o( 1 t ), measured by the best running violation to the first-order optimality condition, provided that f is Lipschitz differentiable. It improves to a linear rate of convergence if f is also (restricted) strongly convex. The subgradient-push algorithm [10] is proposed to solve the decentralized optimization problem (1.1) over a time-varying directed graph. It is a combination of the subgradient method and the push-sum protocol [1,6,7]. Let A(t) be the mixing matrix at the tth iteration as defined in (2.2) for a time-varying directed network. The iteration of subgradient-push is    zt+1 = A(t)zt −αt∇f(xt), wt+1 = A(t)wt, xt+1 = diag(wt+1)−1zt+1, (3.2) where αt > 0 is the step size at the tth iteration that decays as follows: P∞ t=1 αt = ∞, P∞ t=1 α2 t < ∞, and αt ≤αs for all t > s ≥1. It is shown in [10] that the convergence rate of subgradient-push algorithm is O(ln t/ √ t). 3.2. Proposed: ExtraPush ExtraPush combines the above two algorithms. Specifically, set arbitrary z0 and w0 = 1n; set x0 = z0; for t = 1, set w1 = Aw0, z1 = Az0 −α∇f(x0), and x1 = diag(w1)−1z1. Letting ¯A ≜In+A 2 , for t = 2, 3, . . . , perform    zt = (A + In)zt−1 −¯Azt−2 −α(∇f(xt−1) −∇f(xt−2)), wt = Awt−1, xt = diag(wt)−1zt. (3.3) By the structure of A, each node i broadcasts its z(i) to its out-neighbors at each ExtraPush iteration. The step size α > 0 needs to be properly set. The iteration (3.3) of ExtraPush can be implemented at each agent i as follows:        zt (i) = zt−1 (i) + P j∈N in i Aijzt−1 (j) −P j∈N in i ¯Aijzt−2 (j) −α(∇fi(xt−1 (i) ) −∇fi(xt−2 (i) )), wt i = P j∈N in i Aijwt−1 j , xt (i) = zt (i) wt i , where ¯Aij is the (i, j)th component of ¯A, and wt i is the ith component of wt, for all i, j. 3.3. Proposed: Normalized ExtraPush Normalized ExtraPush first computes the stationary distribution φ of A and saves each φi at node i. Next, in the main iteration, the w-step from (3.3) is removed, and n · φ instead of wt is used to obtain xt. As such, the main iteration of Normalized ExtraPush simplifies (3.3). Letting, D ≜ndiag(φ). the iteration of Normalized ExtraPush proceeds as follows: set arbitrary z0 and x0 = D−1z0; for t = 1, set z1 = Az0 −α∇f(x0) and x1 = D−1z1. For t = 2, 3, . . . , perform  zt = (A + In)zt−1 −¯Azt−2 −α(∇f(xt−1) −∇f(xt−2)), xt = D−1zt. (3.4) 388 J.S. ZENG AND W.T. YIN At each agent i, the iterate (3.4) of Normalized ExtraPush can be implemented as follows: ( zt (i) = zt−1 (i) + P j∈N in i Aijzt−1 (j) −P j∈N in i ¯Aijzt−2 (j) −α(∇fi(xt−1 (i) ) −∇fi(xt−2 (i) )), xt (i) = zt (i) nφi . Next, we present two equivalent forms of Normalized ExtraPush. Letting fφ(z) ≜Df(D−1z), we have ∇fφ(z) = ∇f(D−1z). Substituting the x-step of (3.4) into its z-step yields the single- value iteration: zt = (A + In)zt−1 −¯Azt−2 −α(∇fφ(zt−1) −∇fφ(zt−2)). (3.5) Upon stopping, one shall return xt = D−1zt. The iteration (3.5) is nearly identical to the Extra iteration (3.1) except that (3.1) must use a doubly-stochastic matrix. Letting Aφ ≜D−1AD and ¯Aφ ≜ 1 2(In + Aφ), which are row stochastic matrices, gives another equivalent form of Normalized ExtraPush xt = (Aφ + In)xt−1 −¯Aφxt−2 −αD−1(∇f(xt−1) −∇f(xt−2)), (3.6) which, compared to the Extra iteration (3.1), has the extra diagonal matrix D−1. Indeed, this iteration generalizes Extra to use row-stochastic matrices Aφ and ¯A. 4. Preliminary Analysis of ExtraPush In this section, we first develop the first-order optimality conditions for the problem (2.4) and then provide the convergence of ExtraPush under the boundedness assumption. Theorem 1. (first-order optimality conditions) Suppose that graph G is strongly con- nected. Then x∗is consensual and x∗ (1) ≡x∗ (2) ≡· · · ≡x∗ (n) is an optimal solution of (1.1) if and only if, for some α > 0, there exist z∗∈null(In −A) and y∗∈null(1T n) such that the following conditions hold  y∗+ α∇f(x∗) = 0, x∗= D−1z∗. (4.1) (We let L∗denote the set of triples (z∗, y∗, x∗) satisfying the above conditions.) Proof. Assume that x∗is consensual and x∗ (1) ≡x∗ (2) ≡· · · ≡x∗ (n) is optimal. Let z∗= ndiag(φ)x∗= n(φx∗T (1)). Then φ1T nz∗= φ1T nnφx∗T (1) = nφx∗T (1) = z∗. It implies that z∗∈null(I− φ1T n). By Property 1 (ii), it follows that z∗∈null(In −A). Moreover, letting y∗= −α∇f(x∗), it holds that 1T ny∗= −α1T n∇f(x∗) = 0, that is, y∗∈null(1T n). On the other hand, assume (4.1) holds. By Property 1 (ii), it follows that z∗= φ1T nz∗. Plugging x∗= D−1z∗gives x∗= 1 n1n1T nz∗, which implies that x∗is consensual. Moreover, by y∗+ α∇f(x∗) = 0 and y∗∈null(1T n), it holds 1T n∇f(x∗) = −1 α1T ny∗= 0, which implies that x∗is optimal. □ Introducing the sequence yt ≜ t X k=0 ( ¯A −A)zk, (4.2) ExtraPush for Convex Smooth Decentralized Optimization 389 the iteration (3.3) of ExtraPush can be rewritten as        ¯Azt+1 = ¯Azt −α∇f(xt) −yt+1, yt+1 = yt + ( ¯A −A)zt+1, wt+1 = Awt, xt+1 = diag(wt+1)−1zt+1. (4.3) Theorem 2. Suppose that the sequence {xt} generated by ExtraPush (3.3) and the sequence {yt} defined in (4.2) are bounded. Then, any limit point of {(zt, yt, xt)}, denoted by (z∗, y∗, x∗), satisfies the optimality conditions (4.1). Proof. By Property 1, {wt} is bounded. By the last update of (4.3) and the bounded- ness of both {xt} and {wt}, {zt} is bounded. Hence, there exists a convergent subsequence {(z, y, w, x)tj}∞ j=1. Let (z∗, y∗, w∗, x∗) be its limit. By (2.3), we know that w∗= nφ and thus that x∗= D−1z∗. Letting t →∞in the second equation of (4.3) gives z∗= Az∗, or equivalently z∗∈null(In −A). Similarly, letting t →∞in the first equation of (4.3) yields y∗+ α∇f(x∗) = 0. Moreover, from the definition (4.2) of yt and the facts that both A and ¯A are column stochastic, it follows that 1T ny∗= 0 and 1T n∇f(x∗) = 0. Therefore, (z∗, y∗, x∗) satisfies the optimality conditions (4.1). □ 5. Convergence of Normalized ExtraPush In this section, we show the linear convergence of Normalized ExtraPush under the smooth- ness and strong convexity assumptions of the objective function. Similar to (4.3), introducing a new sequence yt = Pt k=0( ¯A −A)zk, the iterative formula (3.4) of Normalized ExtraPush implies    ¯Azt+1 = ¯Azt −α∇f(xt) −yt+1, yt+1 = yt + ( ¯A −A)zt+1, xt+1 = D−1zt+1. (5.1) Theorem 3. Suppose that the sequence {xt} generated by Normalized ExtraPush (3.4) is bounded, and that the sequence {yt} is also bounded. Then, any limit point of {(zt, yt, xt)}∞ t=0, denoted by (z∗, y∗, x∗), satisfies the first-order optimality conditions (4.1). The proof is very similar to that of Theorem 2. It only needs to replace the sequence {wt} with its limitation nφ in the proof procedure, thus we omit it here. From Theorem 3, it shows that Normalized ExtraPush has subsequence convergence to an optimal solution of the considered optimization problem under the boundedness assumption. To obtain the linear convergence of Normalized ExtraPush, we still need the following assumptions. Assumption 2. (existence of solution) Let X ∗be the optimal solution set of problem (1.1), and assume that X ∗is nonempty. Assumption 3. For each agent i, its objective function fi satisfies the following: (i) (Lipschitz differentiability) fi is differentiable, and its gradient ∇fi is Li-Lipschitz continuous, i.e., ∥∇fi(x) −∇fi(y)∥≤Li∥x −y∥, ∀x, y ∈Rp; 390 J.S. ZENG AND W.T. YIN (ii) (quasi-strong convexity) fi is quasi-strongly convex, and there exists a positive con- stant Si such that Si∥x∗−y∥2 ≤⟨∇fi(x∗) −∇fi(y), x∗−y⟩for any y ∈Rp and some optimal value x∗∈X ∗. Following Assumption 3, there hold for any x, y ∈Rn×p and some x∗≡1n(x∗)T ∥∇f(x) −∇f(y)∥≤Lf∥x −y∥, (5.2) Sf∥x∗−y∥2 ≤⟨∇f(x∗) −∇f(y), x∗−y⟩, (5.3) where the constants Lf ≜maxi Li and Sf ≜mini Si. Assumption 4. (positive definiteness) D−1 ¯A + ¯AT D−1 ≻0. By noticing D−1 ¯A + ¯AT D−1 = D−1/2(D−1/2 ¯AD1/2 + D1/2 ¯AT D−1/2)D−1/2, we can guar- antee the positive definiteness of D−1 ¯A+ ¯AT D−1 by ensuring the matrix ¯A+ ¯AT to be positive definite. Note that ¯Aii > P j̸=i ¯Aij for each i, which means that ¯A is strictly column-diagonal dominant. To ensure the positive definiteness of ¯A + ¯AT , each node j can be “selfish” and take a sufficiently large Ajj. Before presenting the main result, we introduce the following notation. For each t, intro- ducing ut = Pt k=0 zk, then the Normalized ExtraPush iteration (3.4) reduces to    ¯Azt+1 = ¯Azt −α∇f(xt) −( ¯A −A)ut+1 ut+1 = ut + zt+1 xt+1 = D−1zt+1. (5.4) Let (z∗, y∗, x∗) ∈L∗, where x∗has been specified in (5.3). Let u∗be any matrix that satisfies ( ¯A −A)u∗= y∗. For simplicity, we introduce vt =  zt ut  , v∗=  z∗ u∗  , G =  N T 0 0 M  , S =  0 M −M T 0  , (5.5) where N = D−1 ¯A, M = D−1( ¯A −A). Let fD(z) ≜f(D−1z) and ¯f(v) ≜fD(z). Then ∇¯f(v) = [∇fD(z), 0]. By (5.2) and (5.3), there hold ∥∇¯f(v1) −∇¯f(v2)∥= ∥∇fD(z1) −∇fD(z2)∥≤¯L∥z1 −z2∥, (5.6) ¯µ∥z∗−z∥2 ≤⟨∇fD(z∗) −∇fD(z), z∗−z⟩= ⟨∇¯f(v∗) −∇¯f(v), v∗−v⟩, (5.7) where ¯L ≜ Lf σ2 min(D) and ¯µ ≜ Sf σ2max(D). By (5.4) and (5.5), the Normalized ExtraPush iteration (3.4) implies GT (vt+1 −vt) = −Svt+1 −α∇¯f(vt). (5.8) Next, we will show that G + GT is positive semidefinite, which by Assumption 4 implies that N + N T is positive definite. It is sufficient to show that M + M T is positive semidefinite. Note that M + M T = D−1(In −A) 2 + (In −A)T D−1 2 = D−1/2  In −D1/2AT D−1/2 + D−1/2AD1/2 2  D−1/2 ≜D−1/2ΛD−1/2, ExtraPush for Convex Smooth Decentralized Optimization 391 and by Property 1 (iii), nφT is the left eigenvector of AT corresponding to eigenvalue 1, and thus, Λ is the Laplacian of a certain directed graph G′ with AT being its corresponding transition probability matrix [3]. It follows that 0 = λ1(Λ) ≤λ2(Λ) ≤· · · ≤λn(Λ), where λi(Λ) denotes the ith eigenvalue of Λ. Therefore, M +M T is positive semidefinite, and the following property holds ∥x∥2 G = 1 2∥x∥2 G+GT ≥0, ∀x ∈Rn. Let c1 = λmax(MM T ) ˜λmin(M T M) , c2 = λmax( M+M T 2 ) ˜λmin(M T M) , c3 = λmax(NN T ) + 3c1λmax(N T N), and let ∆1 =  ¯µ −η 2 2 −6c1 ¯L2, ∆2 = ¯L4 4η2 −3c1 ¯L2σ  c3σ −λmin(N T + N)  for some appropriate tunable parameters η and σ. Then we describe our main result as follows. Theorem 4. Under Assumptions 1-4, if the step size parameter α satisfies ¯µ −η 2 −√∆1 3c1 ¯L2σ < α < min ( ¯µ −η 2 + √∆1 3c1 ¯L2σ , − ¯L2 2η + √∆2 3c1 ¯L2σ ) (5.9) for some appropriate η and σ as specified in (5.25) and (5.26), respectively, then the sequence {vt} defined in (5.5) satisfies ∥vt −v∗∥2 G ≥(1 + δ)∥vt+1 −v∗∥2 G, (5.10) for δ > 0 obeying 0 < δ ≤min ( −1 σ + (¯µ −η 2)α −3 2c1 ¯L2σα2 λmax( N+N T 2 ) + 3c2α2 ¯L2 , λmin( N T +N 2 ) −c3σ 2 − ¯L2α 2η −3 2c1 ¯L2σα2 3c2(λmax(N T N) + α2 ¯L2) ) . From this theorem, the sequence {vt} converges to v∗at a linear rate in the sense of “G- norm”. By the definition of v∗in (5.5), v∗is indeed defined by some optimal value (z∗, y∗, x∗). Roughly speaking, bigger δ means faster convergence rate. As specified in Theorem 4, δ is affected by many factors. Generally, δ decreases with respect to both λmax( N+N T 2 ) and λmax(N T N), which potentially implies that if all nodes are more “selfish”, that is, they hold more information for themselves than sending to their out-neighbors. Consequently, the infor- mation mixing speed of the network will get smaller, and thus the convergence of Normalized ExtraPush becomes slower. Therefore, we suggest a more democratic rule (such as the matrix A specified in (2.2)) for faster convergence in practice. To ensure δ > 0, it requires that the step size α lie in an appropriate interval. It should be pointed out that the condition (5.9) on α is sufficiently, not necessary, for the linear convergence of Normalized ExtraPush. Normalized ExtraPush algorithm may not diverge if a small α is set. In fact, in the next section, it can be observed that both ExtraPush and Normalized ExtraPush algorithms converge under small values of α. In general, a smaller α implies a slower rate of convergence. To prove Theorem 4, we need the following lemmas. 392 J.S. ZENG AND W.T. YIN Lemma 1. For any (z∗, y∗, x∗) ∈L∗, let u∗satisfy ( ¯A −A)u∗= y∗. Then there hold Mz∗= 0, (5.11) M T z∗= 0, (5.12) Sv∗+ α∇¯f(v∗) = 0. (5.13) Proof. By the optimality of (z∗, y∗, x∗), the followings hold: (i) ( ¯A −A)z∗= 0, and thus Mz∗= 0; (ii) D−1z∗= x∗is consensual; from the column stochasticity of both A and ¯A, it follows M T z∗= ( ¯A −A)T x∗= 0; (iii) Mu∗+ α∇fD(z∗) = D−1y∗+ αD−1∇f(x∗) = 0, with M T z∗= 0, which imply Sv∗+ α∇¯f(v∗) = 0. □ Lemma 2. For any t ∈N, it holds N(zt+1 −zt) = −M(ut+1 −u∗) −α[∇fD(zt) −∇fD(z∗)]. (5.14) This lemma follows from (5.4) and the fact Mu∗+ α∇fD(z∗) = 0 in the last lemma. Lemma 3. Let {vt} be a sequence generated by the iteration (5.8) and v∗be defined in (5.5). Then, it holds ∥vt+1 −v∗∥2 G −∥vt −v∗∥2 G ≤−∥vt+1 −vt∥2 G + ∥zt+1 −zt∥2 σ 2 NN T + α ¯ L2 2η In −∥z∗−zt+1∥2 (α¯µ−αη 2 −1 σ )In + σ 2 ∥u∗−ut+1∥2 MM T , (5.15) where σ, η > 0 are two tunable parameters. Proof. Note that ∥vt+1 −v∗∥2 G −∥vt −v∗∥2 G = −∥vt+1 −vt∥2 G + ⟨v∗−vt+1, G(vt −vt+1)⟩ + ⟨v∗−vt+1, GT (vt −vt+1)⟩. (5.16) In the following, we analyze the two inner-product terms: ⟨v∗−vt+1, G(vt −vt+1)⟩= ⟨z∗−zt+1, N T (zt −zt+1)⟩+ ⟨M T (u∗−ut+1), ut −ut+1⟩ (∵(5.11), Mz∗= 0) = ⟨z∗−zt+1, N T (zt −zt+1)⟩+ ⟨M T (u∗−ut+1), z∗−zt+1⟩ ≤σ 2 ∥zt −zt+1∥2 NN T + 1 σ ∥z∗−zt+1∥2 + σ 2 ∥u∗−ut+1∥2 MM T , (5.17) and ⟨v∗−vt+1, GT (vt −vt+1)⟩= ⟨v∗−vt+1, Svt+1 + α∇¯f(vt)⟩(∵(5.5)) = ⟨v∗−vt+1, S(vt+1 −v∗) + α(∇¯f(vt) −∇¯f(v∗))⟩(∵(5.13)) (∵S = −ST ) = α⟨v∗−vt+1, ∇¯f(vt) −∇¯f(v∗)⟩ = α⟨v∗−vt+1, ∇¯f(vt+1) −∇¯f(v∗)⟩ + α⟨v∗−vt+1, ∇¯f(vt) −∇¯f(vt+1)⟩ ≤−α¯µ∥zt+1 −z∗∥2 + αη 2 ∥z∗−zt+1∥2 + α¯L2 2η ∥zt −zt+1∥2. (5.18) ExtraPush for Convex Smooth Decentralized Optimization 393 Substituting (5.17) and (5.18) into (5.16), then we can conclude the lemma. □ Proof. (for Theorem 4) In order to establish (5.10) for some constant δ > 0, in light of Lemma 3, it is sufficient to show that the right-hand side of (5.15) is no more than −δ∥vt+1 −v∗∥2 G, which implies ∥zt+1 −z∗∥2 P + ∥zt+1 −zt∥2 Q ≥∥ut+1 −u∗∥2 R, (5.19) where P =  α¯µ −αη 2 −1 σ  In −δ N + N T 2 , Q = N T + N 2 −σ 2 NN T −α¯L2 2η In, R = σ 2 MM T + δ M + M T 2  . Establishing (5.19): Step 1. From Lemma 2, there holds ∥u∗−ut+1∥2 M T M = ∥M(u∗−ut+1)∥2 = ∥N(zt+1 −zt) + α[∇fD(zt+1) −∇fD(z∗)] + α[∇fD(zt) −∇fD(zt+1)]∥2 ≤3∥zt+1 −zt∥2 N T N + 3α2 ¯L2∥zt+1 −z∗∥2 + 3α2 ¯L2∥zt+1 −zt∥2 = ∥zt+1 −z∗∥2 3α2 ¯L2In + ∥zt+1 −zt∥2 3(N T N+α2 ¯L2In). (5.20) Note that ∥u∗−ut+1∥2 σ 2 MM T +δM σλmax(MM T ) 2 + δλmax( M+M T 2 ) ≤∥u∗−ut+1∥2 ≤∥u∗−ut+1∥2 M T M ˜λmin(M T M) . If the following conditions hold  P ⪰3( 1 2c1σ + c2δ)α2 ¯L2In, Q ⪰3( 1 2c1σ + c2δ)(N T N + α2 ¯L2In), (5.21) then (5.19) holds. To show (5.21), it is sufficient to prove ( (λmax( N+N T 2 ) + 3c2α2 ¯L2)δ ≤−1 σ + (¯µ −η 2)α −3 2c1 ¯L2σα2, 3c2(λmax(N T N) + α2 ¯L2)δ ≤λmin( N T +N 2 ) −c3σ 2 − ¯L2α 2η −3 2c1 ¯L2σα2. (5.22) Let c4 ≜(¯µ −η 2) + √∆1, c5 ≜ ¯L2 η , c6 ≜2c4c5+12c1 ¯L2 c2 4 , c7 ≜λ2 min(N T +N) 4c3 , c8 ≜a(c7 + 2) −(2 −c7) for some positive constant a ∈(0, 1), ∆3 ≜λ2 min(N T + N) −4c3c6. After reduction, we claim that if the following conditions hold 2 −c7 2 + c7 < a < 1, (5.23) ¯µ > r 6c1 1 −a2 + 1 c8 s 1 −a2 6c1  ¯L, (5.24) ¯µ  1 − s 1 −4¯L2 c8¯µ2  < η < min  ¯µ  1 + s 1 −4¯L2 c8¯µ2  , 2  ¯µ − r 6c1 1 −a2 ¯L  , (5.25) λmin(N T + N) −√∆3 2c3 < σ < λmin(N T + N) + √∆3 2c3 , (5.26) ¯µ −η 2 −√∆1 3c1 ¯L2σ < α < min  ¯µ −η 2 + √∆1 3c1 ¯L2σ , − ¯L2 2η + √∆2 3c1 ¯L2σ  , (5.27) then (5.22) holds for some positive constant δ. We then end the proof of this theorem. □ 394 J.S. ZENG AND W.T. YIN 6. Numerical Experiments In this section, we present the results of a series of numerical experiments that demonstrate the effectiveness of the proposed algorithms relative to the subgradient-push algorithm. The used network and its corresponding mixing matrix A are depicted in Fig. 2.1. 6.1. Decentralized Least Squares Consider the following decentralized least squares problem: x∗←argmin x∈Rp f(x) = n X i=1 fi(x), (6.1) where fi(x) = 1 2∥B(i)x −b(i)∥2 2, B(i) ∈Rmi×p, b(i) ∈Rmi for i = 1, . . . , n. The solution x∗is B†b, where B = Pn i=1 BT (i)B(i), b = Pn i=1 BT (i)b(i), and B† is the pseudo-inverse of B. In this experiment, we take n = 5, p = 256, and mi = 100 for i = 1, . . . , 5. For both ExtraPush and Normalized ExtraPush, we first choose an α in the hand-optimized manner (in this case, α = 0.1) and then take a smaller one like α = 0.02 to show the difference due to a smaller step size. The step size of the subgradient-push algorithm is handed optimized to αt = 0.8 √ t . The experiment results are illustrated in Fig. 6.1. As illustrated in Fig. 6.1, the performances of ExtraPush and Normalized ExtraPush are almost identical. Their linear convergence rates are affected by different step sizes; a smaller α leads to a slower rate, as one would expect. 6.2. Decentralized Huber-like Regression Instead of least squares, this experiment minimizes the Huber loss function, which is known to be robust to outliers: x∗←argmin x∈Rp f(x) = n X i=1 fi(x), (6.2) where fi(x) = Pmi j=1 Hξ(B(i)jx −b(i)j), B(i)j is jth row of matrix B(i) ∈Rmi×p and b(i)j is the jth entry of vector b(i) ∈Rmi for i = 1, . . . , n. The Huber loss function is defined as Hξ(a) =  1 2a2, for |a| ≤ξ (ℓ2 2 zone), ξ(|a| −1 2ξ), otherwise (ℓ1 zone). (6.3) Similar to the experimental setting in [13], we let ξ = 2 and set the solution x∗in the ℓ2 2 zone while initializing x0 (i) in the ℓ1 zone for all agents i. Similar to the last experiment, we teset two different step sizes, α = 0.1 and 0.02, where α = 0.1 is hand-optimized. The step size of the subgradient-push algorithm is hand optimized to αt = 5 √t+100. The numerical results are depicted in Fig. 6.2. As shown by Fig. 6.2, when α = 0.1, both ExtraPush and Normalized ExtraPush algorithms have the sublinear convergence in their first 500 iterations and then show linear convergence, as xt (i) for most i have entered the ℓ2 2 zone. While for α = 0.02, more iterations (about 2500) are needed before both algorithms start decaying linearly. ExtraPush for Convex Smooth Decentralized Optimization 395 7. Conclusion In this note, we propose a decentralized algorithm called ExtraPush, as well as its simplified version called Normalized ExtraPush, for solving distributed consensus optimization problems over directed graphs. The algorithms use column-stochastic mixing matrices. We show that Normalized ExtraPush converges at a linear rate if the objective function is smooth and strongly convex. In additional, we develop the first-order optimality conditions and provide the conver- gence of ExtraPush under the boundedness assumption. The convergence as well as the rate of convergence of ExtraPush should be justified in the future. Moreover, when applied to a directed time-varying network, the performance of the proposed algorithms will also be studied. Another line of future research is to generalize ExtraPush to handle the sum of smooth and proximable (possibly nonsmooth) functions as done in [14] that has generalized Extra this way. Acknowledgments. The work of W. Yin has been supported in part by the NSF grants DMS-1317602 and ECCS-1462398. The work of J. Zeng has been supported in part by the NSF grant 11501440. References [1] F. Benezit, V. Blondel, P. Thiran, J. Tsitsiklis and M. Vetterli, Weighted Gossip: distributed averaging using non-doubly stochastic matrices, inProc. IEEE Conf. Decision Control, (2010), 1753-1757. [2] I. Chen, Fast distributed first-order methods, Masters thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 2012. [3] F. Chung, Laplancians and the cheeger inequality for directed graphs, Annals of Combinatorics, 2 (2005), 1-19. [4] J. Duchi, A. Agarwal and M. Wainwright, Dual averaging for distributed optimization: Conver- gence analysis and network scaling, IEEE Trans. Automat. Contr., 57 (2012) 592-606. [5] D. Jakovetic, J. Xavier and J. Moura, Fast distributed gradient methods, IEEE Trans. Automat. Contr., 59 (2014) 1131-1146. [6] D. Kempe, A. Dobra and J. Gehrke, Gossip-based computation of aggregate information, in 44th Annual IEEE Symposium on Foundations of Computer Science, (2003), 482-491. [7] F. Iutzeler, P. Ciblat and W. Hachem, Analysis of sum-weight-like algorithms for averaging in wireless sensor networks, IEEE Trans. Signal Process., 61(2013) 2802-2814. [8] I. Matei and J. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies, IEEE J. Sel. Top. Signal Process., 5 (2011) 754-771. [9] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Trans. Automat. Contr., 54 (2009) 48-61. [10] A. Nedic and A. Olshevsky, Distributed optimization over time-varying directed graphs, IEEE Trans. Automat. Contr., 60 (2015) 601-615. [11] I. Schizas, A. Ribeiro, and G. Giannakis, Consensus in ad hoc WSNs with noisy links-part I: Distributed estimation of deterministic signals, IEEE Trans. Signal Process., 56:1 (2008) 350-364. [12] W. Shi, Q. Ling, K. Yuan, G. Wu and W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization, IEEE Trans. Signal Process., 62 (2014) 1750-1761. [13] W. Shi, Q. Ling, G. Wu and W. Yin, EXTRA: an exact first-order algorithm for decentralized consensus optimization, SIAM J. Optimiz., 25:2 (2015) 944-966. [14] W. Shi, Q. Ling, G. Wu and W. Yin, A Proximal Gradient Algorithm for Decentralized Composite Optimization, IEEE Trans. Signal Process., 63:22 (2015) 6013-6023. 396 J.S. ZENG AND W.T. YIN [15] K. I. Tsianos, S. Lawlor and M. G. Rabbat, Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning, in Proc. 50th Allerton Conf. Commun., Control, Comp., (2012), 1543-1550. [16] K. I. Tsianos, S. Lawlor and M. G. Rabbat, Push-sum distributed dual averaging for convex optimization, in Proc. IEEE Conf. Decision Control, (2012), 5453-5458. [17] K. I. Tsianos, The role of the Network in Distributed Optimization Algorithms: Convergence Rates, Scalability, Communication/Computation Tradeoffs and Communication Delays, PhD the- sis, Dept. Elect. Comp. Eng., McGill Univ., Montreal, QC, Canada, 2013. [18] C. Xi and U.A. Khan, On the linear convergence of distributed optimization over directed graphs, preprint, arXiv:1510.02149, 2015. [19] K. Yuan, Q. Ling and W. Yin. On the convergence of decentralized gradient descent. SIAM J. Optimiz., 26:3 (2016), 1835-1854. ExtraPush for Convex Smooth Decentralized Optimization 397 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 −8 10 −6 10 −4 10 −2 10 0 10 2 Iteration t Iterative Error ExtraPush (α=0.1) Normlized ExtraPush (α=0.1) ExtraPush (α=0.02) Normlized ExtraPush (α=0.02) Subgradient−Push (hand optimized) Fig. 6.1. Experiment results for decentralized least squares regression. History of ∥xt −x∗∥2, where x∗ is the exact solution. The performances of ExtraPush and Normalized ExtraPush are very similar. 398 J.S. ZENG AND W.T. YIN 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 10 −6 10 −4 10 −2 10 0 10 2 Iteration t Iterative Error ExtraPush (α=0.1) Normlized ExtraPush (α=0.1) ExtraPush (α=0.02) Normlized ExtraPush (α=0.02) Subgradient−Push (hand optimized) Fig. 6.2. Experiment results for decentralized Huber regression. History of ∥xt −x∗∥2, where x∗is the exact solution. The performances of ExtraPush and Normalized ExtraPush are very similar.