Journal of Computational Mathematics Vol.35, No.4, 2017, 383–398. http://www.global-sci.org/jcm doi:10.4208/jcm.1606-m2015-0452 EXTRAPUSH FOR CONVEX SMOOTH DECENTRALIZED OPTIMIZATION OVER DIRECTED NETWORKS * Jinshan Zeng College of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 330022, China Email: jinshanzeng@jxnu.edu.cn Wotao Yin Department of Mathematics, University of California, Los Angeles, CA 90095, USA Email: wotaoyin@math.ucla.edu Abstract In this note, we extend the algorithms Extra [13] and subgradient-push [10] to a new algorithm ExtraPush for consensus optimization with convex differentiable objective func- tions over a directed network. When the stationary distribution of the network can be computed in advance, we propose a simplified algorithm called Normalized ExtraPush . Just like Extra, both ExtraPush and Normalized ExtraPush can iterate with a fixed step size. But unlike Extra, they can take a column-stochastic mixing matrix, which is not necessarily doubly stochastic. Therefore, they remove the undirected-network restriction of Extra. Subgradient-push, while also works for directed networks, is slower on the same type of problem because it must use a sequence of diminishing step sizes. We present preliminary analysis for ExtraPush under a bounded sequence assump- tion. For Normalized ExtraPush, we show that it naturally produces a bounded, linearly convergent sequence provided that the objective function is strongly convex. In our numerical experiments, ExtraPush and Normalized ExtraPush performed simi- larly well. They are significantly faster than subgradient-push, even when we hand-optimize the step sizes for the latter. Mathematics subject classification: 90C25, 90C30. Key words: Decentralized optimization, Directed graph, Consensus, Non-doubly stochas- tic, Extra. 1. Introduction We consider the following consensus optimization problem defined on a directed, strongly connected network of n agents: minimize x ∈ R p f ( x ) , n ∑ i =1 f i ( x ) , (1.1) where f i is a proper, closed, convex, differentiable function only known to the agent i . The model (1.1) finds applications in decentralized averaging, learning, estimation, and control. For a stationary network with bi-directional communication, the existing algorithms include the (sub)gradient methods [2, 5, 8, 9, 13, 19], and the primal-dual domain methods such as the decentralized alternating direction method of multipliers (DADMM) [11, 12]. * Received November 12, 2015 / Revised version received June 2, 2016 / Accepted June 27, 2016 / arXiv:1511.02942v4 [math.OC] 30 Jan 2019 384 J.S. ZENG AND W.T. YIN This note focuses on a directed network (with directional communication), where the re- search of decentralized optimization is pioneered by the works [15–17]. When communication is bi-directional, algorithms can use a symmetric and doubly-stochastic mixing matrix to ob- tain a consensual solution; however, once the communication is directional, the mixing matrix becomes generally asymmetric and only column-stochastic. Also consider the setting where each agent broadcasts its information to its neighbors, yet an agent may not receive the in- formation from a neighbor. An agent can weigh its information (both from itself and received from its neighbors) so that the total weights add up to 1, but an agent cannot ensure that its broadcasted information receives weights that precisely add up to exactly 1. Therefore, only each column of the mixing matrix sums to 1. In the column-stochastic setting, the push-sum protocol [6] can be used to obtain a stationary distribution for the mixing matrix. In the symmetric and doubly-stochastic setting, if the objective is Lipschitz-differentiable, the gradient-based algorithm Extra [13] converges at the rate of O (1 /t ), where t is the iteration number. In the column-stochastic setting, the best rate is O (ln t/ √ t ) from the subgradient- based algorithm [10]. We address the open question of how to take advantage of the gradient of a Lipschitz-differentiable objective. We make an attempt in this note to combine ideas in [10,13] and present our preliminary results. Specifically, we propose ExtraPush , which is a two-step iteration like Extra and incorporates the push-sum protocol. At each iteration, the Extra variables are approximately normalized by the current push-sum variables. When the stationary distribution of the network can be easily computed, we propose to first apply the push-sum protocol to obtain the stationary distribution and then run the two-step iteration Normalized ExtraPush. At each iteration, its running variables are normalized by the stationary distribution. Our algorithms are essentially the same as found in the recent work by Xi and Khan [18]. They attempted to prove convergence for a strongly convex objective function. They noticed that a certain matrix that is important to the analysis (as a part of their convergence metric) is positive semi-definite. Our analysis also uses this property. However, their analysis breaks down due to incorrect assumptions. More specifically, each function f i is assumed in [18] to be strongly convex and also has a bounded and Lipschitz gradient (i.e., its gradient is bounded and Lipschitz continuous). However, no function can satisfy these assumptions simultaneously since gradients of a strongly convex are strictly increasing and unbounded. It is worth noting that our algorithm can be applied to a time-varying directed network after a straightforward modification; our convergence proof, however, will need a significant change. The rest of this note is organized as follows. Section 2 introduces the problem setup and preliminaries. Section 3 develops ExtraPush and Normalized ExtraPush. Section 4 establishes the optimality conditions for ExtraPush and shows its convergence under the boundedness assumption. Section 5 assumes that the objective is strongly convex and shows that Normal- ized ExtraPush produces a bounded sequence that converges linearly. Section 6 presents our numerical simulation results. We conclude this paper in Section 7. Notation: Let I n denote an identity matrix with the size n × n , and 1 n × p ∈ R n × p denote the matrix with all entries equal to 1. We also use 1 n ∈ R n as a vector of all 1’s. For any vector x , we let x i denote its i th component and diag ( x ) denote the diagonal matrix generated by x . For any matrix X , X T denotes its transpose, X ij denotes its ( i, j )th component, and ‖ X ‖ , √ 〈 X, X 〉 = √∑ i,j X 2 ij denotes its Frobenius norm. The largest and smallest eigenvalues of matrix X are denoted as λ max ( X ) and λ min ( X ), respectively. For any matrix B ∈ R m × n , null ( B ) , { x ∈ R n | Bx = 0 } is the null space of B . Given a matrix B ∈ R m × n , by Z ∈ null ( B ), ExtraPush for Convex Smooth Decentralized Optimization 385 we mean that each column of Z lies in null ( B ). The smallest nonzero eigenvalue of a symmetric positive semidefinite matrix X 6 = 0 is denoted as ̃ λ min ( X ), which is strictly positive. For any positive semidefinite matrix G ∈ R n × n (not necessarily symmetric in this paper), we use the notion ‖ X ‖ 2 G , 〈 X, GX 〉 for a matrix X ∈ R n × p . 2. Problem Reformulation 2.1. Network Consider a directed network G = { V, E } , where V is the vertex set and E is the edge set. Any edge ( i, j ) ∈ E represents a directed arc from node i to node j . The sets of in-neighbors and out-neighbors of node i are N in i , { j : ( j, i ) ∈ E } ∪ { i } , N out i , { j : ( i, j ) ∈ E } ∪ { i } , respectively. Let d i , |N out i | be the out-degree of node i . In G , each node i can only send information to its out-neighbors, not vice versa. To illustrate a mixing matrix for a directed network, consider A ∈ R n × n where { A ij > 0 , if j ∈ N in i A ij = 0 , otherwise. (2.1) The entries A ij satisfy that, for each node j , ∑ i ∈ V A ij = 1. An example is the following mixing matrix A ij = { 1 /d j , if j ∈ N in i 0 , otherwise, (2.2) i, j = 1 , . . . , n , which is used in the subgradient-push method [10]. See Fig. 2.1 for a directed graph G and an example of its mixing matrix A . The matrix A is column stochastic and asymmetric in general. 1 4 5 2 3 A =         1 4 1 4 0 1 2 0 1 4 1 4 0 0 1 3 1 4 0 1 2 0 1 3 0 1 4 0 1 2 0 1 4 1 4 1 2 0 1 3         Fig. 2.1. A directed graph G (left) and its mixing matrix A (right). Assumption 1. The graph G is strongly connected. Property 1. Under Assumption 1 , the followings hold (parts (i) and (iv) are results in [10, Corollary 2]): (i) Let A t = t ︷ ︸︸ ︷ A × A · · · A for any t ∈ N . Then A t → φ 1 T n geometrically fast as t → ∞ , (2.3) for some stationary distribution vector φ , i.e., φ i ≥ 0 and ∑ n i φ i = 1 . 386 J.S. ZENG AND W.T. YIN (ii) null ( I n − φ 1 T n ) = null ( I n − A ) . (iii) Aφ = φ. (iv) The quantity ξ , inf t min 1 ≤ i ≤ n ( A t 1 n ) i ≥ 1 n n > 0 . Proof. Part (iii) is obvious from (ii) since φ ∈ null ( I n − φ 1 T n ) and ∑ n i φ i = 1. Next, we show part (ii). First, let z ∈ null ( I n − φ 1 T n ), which means z = φ 1 T n z and thus Az = Aφ 1 T n z . By (2.3), it is obvious that Aφ 1 T n = φ 1 T n . Therefore, Az = φ 1 T n z = z and hence null ( I n − φ 1 T n ) ⊆ null ( I n − A ) . On the other hand, any z ∈ null ( I n − A ) , equivalently, z = Az , obeys z = A t z for any t ≥ 1. Letting t → ∞ , it holds that z = φ 1 T n z , that is, z ∈ null ( I n − φ 1 T n ) . Therefore, part (ii) holds.  2.2. Problem Given in the Matrix Notation Let x ( i ) ∈ R p denote the local copy of x at node i , and x t ( i ) denote its value at the t th iteration. Throughout the note, we use the following equivalent form of the problem (1.1) using local copies of the variable x : minimize x 1 T n f ( x ) , n ∑ i =1 f i ( x ( i ) ) , subject to x ( i ) = x ( j ) , ∀ i, j ∈ E, (2.4) where 1 n ∈ R n denotes the vector with all its entries equal to 1, x ,       — x T (1) — — x T (2) — . . . — x T ( n ) —       ∈ R n × p , f ( x ) ,      f 1 ( x (1) ) f 2 ( x (2) ) . . . f n ( x ( n ) )      ∈ R n . In addition, the gradient of f ( x ) is ∇ f ( x ) ,      — ∇ f 1 ( x (1) ) T — — ∇ f 2 ( x (2) ) T — . . . — ∇ f n ( x ( n ) ) T —      ∈ R n × p . The i th rows of the above matrices x and ∇ f ( x ), and vector f ( x ), correspond to agent i . For simplicity, one can treat p = 1 throughout this paper. For a vector ̄ x ∈ R n , let ̄ x ave , 1 n ( ∑ n i =1 ̄ x i ) ∈ R . A special case of (2.4) is the well-known average consensus problem, where f i ( x ( i ) ) = 1 2 ( x ( i ) − ̄ x i ) 2 for each node i and the solution is x ( i ) = ̄ x ave for all i . 3. Proposed Algorithms 3.1. Reviews of Extra and subgradient-push Extra [13] is a “two-step” iterative algorithm for solving (2.4) over an undirected network. Let W ∈ R n × n be a symmetric and doubly stochastic mixing matrix, and ̄ W , I n + W 2 . The Extra iteration is x t +2 = ( I n + W ) x t +1 − ̄ W x t − α ( ∇ f ( x t +1 ) − ∇ f ( x t )) , t = 0 , 1 , · · · , (3.1) ExtraPush for Convex Smooth Decentralized Optimization 387 which starts with x 1 = x 0 − α ∇ f ( x 0 ), any x 0 ∈ R n × p and uses a properly bounded step size α > 0. Extra converges at a rate o ( 1 t ), measured by the best running violation to the first-order optimality condition, provided that f is Lipschitz differentiable. It improves to a linear rate of convergence if f is also (restricted) strongly convex. The subgradient-push algorithm [10] is proposed to solve the decentralized optimization problem (1.1) over a time-varying directed graph. It is a combination of the subgradient method and the push-sum protocol [1,6,7]. Let A ( t ) be the mixing matrix at the t th iteration as defined in (2.2) for a time-varying directed network. The iteration of subgradient-push is    z t +1 = A ( t ) z t − α t ∇ f ( x t ) , w t +1 = A ( t ) w t , x t +1 = diag ( w t +1 ) − 1 z t +1 , (3.2) where α t > 0 is the step size at the t th iteration that decays as follows: ∑ ∞ t =1 α t = ∞ , ∑ ∞ t =1 α 2 t < ∞ , and α t ≤ α s for all t > s ≥ 1 . It is shown in [10] that the convergence rate of subgradient-push algorithm is O (ln t/ √ t ). 3.2. Proposed: ExtraPush ExtraPush combines the above two algorithms. Specifically, set arbitrary z 0 and w 0 = 1 n ; set x 0 = z 0 ; for t = 1, set w 1 = A w 0 , z 1 = A z 0 − α ∇ f ( x 0 ) , and x 1 = diag ( w 1 ) − 1 z 1 . Letting ̄ A , I n + A 2 , for t = 2 , 3 , . . . , perform    z t = ( A + I n ) z t − 1 − ̄ A z t − 2 − α ( ∇ f ( x t − 1 ) − ∇ f ( x t − 2 )) , w t = A w t − 1 , x t = diag ( w t ) − 1 z t . (3.3) By the structure of A , each node i broadcasts its z ( i ) to its out-neighbors at each ExtraPush iteration. The step size α > 0 needs to be properly set. The iteration (3.3) of ExtraPush can be implemented at each agent i as follows:        z t ( i ) = z t − 1 ( i ) + ∑ j ∈N in i A ij z t − 1 ( j ) − ∑ j ∈N in i ̄ A ij z t − 2 ( j ) − α ( ∇ f i ( x t − 1 ( i ) ) − ∇ f i ( x t − 2 ( i ) )) , w t i = ∑ j ∈N in i A ij w t − 1 j , x t ( i ) = z t ( i ) w t i , where ̄ A ij is the ( i, j )th component of ̄ A , and w t i is the i th component of w t , for all i, j . 3.3. Proposed: Normalized ExtraPush Normalized ExtraPush first computes the stationary distribution φ of A and saves each φ i at node i . Next, in the main iteration, the w -step from (3.3) is removed, and n · φ instead of w t is used to obtain x t . As such, the main iteration of Normalized ExtraPush simplifies (3.3). Letting, D , n diag ( φ ) . the iteration of Normalized ExtraPush proceeds as follows: set arbitrary z 0 and x 0 = D − 1 z 0 ; for t = 1, set z 1 = A z 0 − α ∇ f ( x 0 ) and x 1 = D − 1 z 1 . For t = 2 , 3 , . . . , perform { z t = ( A + I n ) z t − 1 − ̄ A z t − 2 − α ( ∇ f ( x t − 1 ) − ∇ f ( x t − 2 )) , x t = D − 1 z t . (3.4) 388 J.S. ZENG AND W.T. YIN At each agent i , the iterate (3.4) of Normalized ExtraPush can be implemented as follows: { z t ( i ) = z t − 1 ( i ) + ∑ j ∈N in i A ij z t − 1 ( j ) − ∑ j ∈N in i ̄ A ij z t − 2 ( j ) − α ( ∇ f i ( x t − 1 ( i ) ) − ∇ f i ( x t − 2 ( i ) )) , x t ( i ) = z t ( i ) nφ i . Next, we present two equivalent forms of Normalized ExtraPush. Letting f φ ( z ) , D f ( D − 1 z ) , we have ∇ f φ ( z ) = ∇ f ( D − 1 z ) . Substituting the x -step of (3.4) into its z -step yields the single- value iteration: z t = ( A + I n ) z t − 1 − ̄ A z t − 2 − α ( ∇ f φ ( z t − 1 ) − ∇ f φ ( z t − 2 )) . (3.5) Upon stopping, one shall return x t = D − 1 z t . The iteration (3.5) is nearly identical to the Extra iteration (3.1) except that (3.1) must use a doubly-stochastic matrix. Letting A φ , D − 1 AD and ̄ A φ , 1 2 ( I n + A φ ), which are row stochastic matrices, gives another equivalent form of Normalized ExtraPush x t = ( A φ + I n ) x t − 1 − ̄ A φ x t − 2 − αD − 1 ( ∇ f ( x t − 1 ) − ∇ f ( x t − 2 )) , (3.6) which, compared to the Extra iteration (3.1), has the extra diagonal matrix D − 1 . Indeed, this iteration generalizes Extra to use row-stochastic matrices A φ and ̄ A . 4. Preliminary Analysis of ExtraPush In this section, we first develop the first-order optimality conditions for the problem (2.4) and then provide the convergence of ExtraPush under the boundedness assumption. Theorem 1. (first-order optimality conditions) Suppose that graph G is strongly con- nected. Then x ∗ is consensual and x ∗ (1) ≡ x ∗ (2) ≡ · · · ≡ x ∗ ( n ) is an optimal solution of (1 . 1) if and only if, for some α > 0 , there exist z ∗ ∈ null ( I n − A ) and y ∗ ∈ null ( 1 T n ) such that the following conditions hold { y ∗ + α ∇ f ( x ∗ ) = 0 , x ∗ = D − 1 z ∗ . (4.1) ( We let L ∗ denote the set of triples ( z ∗ , y ∗ , x ∗ ) satisfying the above conditions. ) Proof. Assume that x ∗ is consensual and x ∗ (1) ≡ x ∗ (2) ≡ · · · ≡ x ∗ ( n ) is optimal. Let z ∗ = n diag ( φ ) x ∗ = n ( φx ∗ T (1) ) . Then φ 1 T n z ∗ = φ 1 T n nφx ∗ T (1) = nφx ∗ T (1) = z ∗ . It implies that z ∗ ∈ null ( I − φ 1 T n ). By Property 1 (ii), it follows that z ∗ ∈ null ( I n − A ). Moreover, letting y ∗ = − α ∇ f ( x ∗ ) , it holds that 1 T n y ∗ = − α 1 T n ∇ f ( x ∗ ) = 0, that is, y ∗ ∈ null ( 1 T n ). On the other hand, assume (4.1) holds. By Property 1 (ii), it follows that z ∗ = φ 1 T n z ∗ . Plugging x ∗ = D − 1 z ∗ gives x ∗ = 1 n 1 n 1 T n z ∗ , which implies that x ∗ is consensual. Moreover, by y ∗ + α ∇ f ( x ∗ ) = 0 and y ∗ ∈ null ( 1 T n ), it holds 1 T n ∇ f ( x ∗ ) = − 1 α 1 T n y ∗ = 0, which implies that x ∗ is optimal.  Introducing the sequence y t , t ∑ k =0 ( ̄ A − A ) z k , (4.2) ExtraPush for Convex Smooth Decentralized Optimization 389 the iteration (3.3) of ExtraPush can be rewritten as        ̄ A z t +1 = ̄ A z t − α ∇ f ( x t ) − y t +1 , y t +1 = y t + ( ̄ A − A ) z t +1 , w t +1 = A w t , x t +1 = diag ( w t +1 ) − 1 z t +1 . (4.3) Theorem 2. Suppose that the sequence { x t } generated by ExtraPush (3.3) and the sequence { y t } defined in (4.2) are bounded. Then, any limit point of { ( z t , y t , x t ) } , denoted by ( z ∗ , y ∗ , x ∗ ) , satisfies the optimality conditions (4 . 1) . Proof. By Property 1, { w t } is bounded. By the last update of (4.3) and the bounded- ness of both { x t } and { w t } , { z t } is bounded. Hence, there exists a convergent subsequence { ( z , y , w , x ) t j } ∞ j =1 . Let ( z ∗ , y ∗ , w ∗ , x ∗ ) be its limit. By (2.3), we know that w ∗ = nφ and thus that x ∗ = D − 1 z ∗ . Letting t → ∞ in the second equation of (4.3) gives z ∗ = A z ∗ , or equivalently z ∗ ∈ null ( I n − A ). Similarly, letting t → ∞ in the first equation of (4.3) yields y ∗ + α ∇ f ( x ∗ ) = 0 . Moreover, from the definition (4.2) of y t and the facts that both A and ̄ A are column stochastic, it follows that 1 T n y ∗ = 0 and 1 T n ∇ f ( x ∗ ) = 0. Therefore, ( z ∗ , y ∗ , x ∗ ) satisfies the optimality conditions (4.1).  5. Convergence of Normalized ExtraPush In this section, we show the linear convergence of Normalized ExtraPush under the smooth- ness and strong convexity assumptions of the objective function. Similar to (4.3), introducing a new sequence y t = ∑ t k =0 ( ̄ A − A ) z k , the iterative formula (3.4) of Normalized ExtraPush implies    ̄ A z t +1 = ̄ A z t − α ∇ f ( x t ) − y t +1 , y t +1 = y t + ( ̄ A − A ) z t +1 , x t +1 = D − 1 z t +1 . (5.1) Theorem 3. Suppose that the sequence { x t } generated by Normalized ExtraPush (3.4) is bounded, and that the sequence { y t } is also bounded. Then, any limit point of { ( z t , y t , x t ) } ∞ t =0 , denoted by ( z ∗ , y ∗ , x ∗ ) , satisfies the first-order optimality conditions (4 . 1) . The proof is very similar to that of Theorem 2. It only needs to replace the sequence { w t } with its limitation nφ in the proof procedure, thus we omit it here. From Theorem 3, it shows that Normalized ExtraPush has subsequence convergence to an optimal solution of the considered optimization problem under the boundedness assumption. To obtain the linear convergence of Normalized ExtraPush, we still need the following assumptions. Assumption 2. (existence of solution) Let X ∗ be the optimal solution set of problem (1.1) , and assume that X ∗ is nonempty. Assumption 3. For each agent i , its objective function f i satisfies the following: (i) (Lipschitz differentiability) f i is differentiable, and its gradient ∇ f i is L i -Lipschitz continuous, i.e., ‖∇ f i ( x ) − ∇ f i ( y ) ‖ ≤ L i ‖ x − y ‖ , ∀ x, y ∈ R p ; 390 J.S. ZENG AND W.T. YIN (ii) (quasi-strong convexity) f i is quasi-strongly convex, and there exists a positive con- stant S i such that S i ‖ x ∗ − y ‖ 2 ≤ 〈∇ f i ( x ∗ ) − ∇ f i ( y ) , x ∗ − y 〉 for any y ∈ R p and some optimal value x ∗ ∈ X ∗ . Following Assumption 3, there hold for any x , y ∈ R n × p and some x ∗ ≡ 1 n ( x ∗ ) T ‖∇ f ( x ) − ∇ f ( y ) ‖ ≤ L f ‖ x − y ‖ , (5.2) S f ‖ x ∗ − y ‖ 2 ≤ 〈∇ f ( x ∗ ) − ∇ f ( y ) , x ∗ − y 〉 , (5.3) where the constants L f , max i L i and S f , min i S i . Assumption 4. (positive definiteness) D − 1 ̄ A + ̄ A T D − 1  0 . By noticing D − 1 ̄ A + ̄ A T D − 1 = D − 1 / 2 ( D − 1 / 2 ̄ AD 1 / 2 + D 1 / 2 ̄ A T D − 1 / 2 ) D − 1 / 2 , we can guar- antee the positive definiteness of D − 1 ̄ A + ̄ A T D − 1 by ensuring the matrix ̄ A + ̄ A T to be positive definite. Note that ̄ A ii > ∑ j 6 = i ̄ A ij for each i , which means that ̄ A is strictly column-diagonal dominant. To ensure the positive definiteness of ̄ A + ̄ A T , each node j can be “selfish” and take a sufficiently large A jj . Before presenting the main result, we introduce the following notation. For each t , intro- ducing u t = ∑ t k =0 z k , then the Normalized ExtraPush iteration (3.4) reduces to    ̄ A z t +1 = ̄ A z t − α ∇ f ( x t ) − ( ̄ A − A ) u t +1 u t +1 = u t + z t +1 x t +1 = D − 1 z t +1 . (5.4) Let ( z ∗ , y ∗ , x ∗ ) ∈ L ∗ , where x ∗ has been specified in (5.3). Let u ∗ be any matrix that satisfies ( ̄ A − A ) u ∗ = y ∗ . For simplicity, we introduce v t = ( z t u t ) , v ∗ = ( z ∗ u ∗ ) , G = ( N T 0 0 M ) , S = ( 0 M − M T 0 ) , (5.5) where N = D − 1 ̄ A , M = D − 1 ( ̄ A − A ). Let f D ( z ) , f ( D − 1 z ) and ̄ f ( v ) , f D ( z ). Then ∇ ̄ f ( v ) = [ ∇ f D ( z ) , 0]. By (5.2) and (5.3), there hold ‖∇ ̄ f ( v 1 ) − ∇ ̄ f ( v 2 ) ‖ = ‖∇ f D ( z 1 ) − ∇ f D ( z 2 ) ‖ ≤ ̄ L ‖ z 1 − z 2 ‖ , (5.6) ̄ μ ‖ z ∗ − z ‖ 2 ≤ 〈∇ f D ( z ∗ ) − ∇ f D ( z ) , z ∗ − z 〉 = 〈∇ ̄ f ( v ∗ ) − ∇ ̄ f ( v ) , v ∗ − v 〉 , (5.7) where ̄ L , L f σ 2 min ( D ) and ̄ μ , S f σ 2 max ( D ) . By (5.4) and (5.5), the Normalized ExtraPush iteration (3.4) implies G T ( v t +1 − v t ) = − S v t +1 − α ∇ ̄ f ( v t ) . (5.8) Next, we will show that G + G T is positive semidefinite, which by Assumption 4 implies that N + N T is positive definite. It is sufficient to show that M + M T is positive semidefinite. Note that M + M T = D − 1 ( I n − A ) 2 + ( I n − A ) T D − 1 2 = D − 1 / 2 ( I n − D 1 / 2 A T D − 1 / 2 + D − 1 / 2 AD 1 / 2 2 ) D − 1 / 2 , D − 1 / 2 Λ D − 1 / 2 , ExtraPush for Convex Smooth Decentralized Optimization 391 and by Property 1 (iii), nφ T is the left eigenvector of A T corresponding to eigenvalue 1, and thus, Λ is the Laplacian of a certain directed graph G ′ with A T being its corresponding transition probability matrix [3]. It follows that 0 = λ 1 (Λ) ≤ λ 2 (Λ) ≤ · · · ≤ λ n (Λ), where λ i (Λ) denotes the i th eigenvalue of Λ. Therefore, M + M T is positive semidefinite, and the following property holds ‖ x ‖ 2 G = 1 2 ‖ x ‖ 2 G + G T ≥ 0 , ∀ x ∈ R n . Let c 1 = λ max ( M M T ) ̃ λ min ( M T M ) , c 2 = λ max ( M + M T 2 ) ̃ λ min ( M T M ) , c 3 = λ max ( N N T ) + 3 c 1 λ max ( N T N ) , and let ∆ 1 = ( ̄ μ − η 2 ) 2 − 6 c 1 ̄ L 2 , ∆ 2 = ̄ L 4 4 η 2 − 3 c 1 ̄ L 2 σ ( c 3 σ − λ min ( N T + N ) ) for some appropriate tunable parameters η and σ . Then we describe our main result as follows. Theorem 4. Under Assumptions 1 - 4 , if the step size parameter α satisfies ̄ μ − η 2 − √ ∆ 1 3 c 1 ̄ L 2 σ < α < min { ̄ μ − η 2 + √ ∆ 1 3 c 1 ̄ L 2 σ , − ̄ L 2 2 η + √ ∆ 2 3 c 1 ̄ L 2 σ } (5.9) for some appropriate η and σ as specified in (5.25) and (5.26) , respectively, then the sequence { v t } defined in (5.5) satisfies ‖ v t − v ∗ ‖ 2 G ≥ (1 + δ ) ‖ v t +1 − v ∗ ‖ 2 G , (5.10) for δ > 0 obeying 0 < δ ≤ min { − 1 σ + ( ̄ μ − η 2 ) α − 3 2 c 1 ̄ L 2 σα 2 λ max ( N + N T 2 ) + 3 c 2 α 2 ̄ L 2 , λ min ( N T + N 2 ) − c 3 σ 2 − ̄ L 2 α 2 η − 3 2 c 1 ̄ L 2 σα 2 3 c 2 ( λ max ( N T N ) + α 2 ̄ L 2 ) } . From this theorem, the sequence { v t } converges to v ∗ at a linear rate in the sense of “ G - norm”. By the definition of v ∗ in (5.5), v ∗ is indeed defined by some optimal value ( z ∗ , y ∗ , x ∗ ). Roughly speaking, bigger δ means faster convergence rate. As specified in Theorem 4, δ is affected by many factors. Generally, δ decreases with respect to both λ max ( N + N T 2 ) and λ max ( N T N ), which potentially implies that if all nodes are more “selfish”, that is, they hold more information for themselves than sending to their out-neighbors. Consequently, the infor- mation mixing speed of the network will get smaller, and thus the convergence of Normalized ExtraPush becomes slower. Therefore, we suggest a more democratic rule (such as the matrix A specified in (2.2)) for faster convergence in practice. To ensure δ > 0, it requires that the step size α lie in an appropriate interval. It should be pointed out that the condition (5.9) on α is sufficiently, not necessary, for the linear convergence of Normalized ExtraPush. Normalized ExtraPush algorithm may not diverge if a small α is set. In fact, in the next section, it can be observed that both ExtraPush and Normalized ExtraPush algorithms converge under small values of α . In general, a smaller α implies a slower rate of convergence. To prove Theorem 4, we need the following lemmas. 392 J.S. ZENG AND W.T. YIN Lemma 1. For any ( z ∗ , y ∗ , x ∗ ) ∈ L ∗ , let u ∗ satisfy ( ̄ A − A ) u ∗ = y ∗ . Then there hold M z ∗ = 0 , (5.11) M T z ∗ = 0 , (5.12) S v ∗ + α ∇ ̄ f ( v ∗ ) = 0 . (5.13) Proof. By the optimality of ( z ∗ , y ∗ , x ∗ ), the followings hold: (i) ( ̄ A − A ) z ∗ = 0, and thus M z ∗ = 0; (ii) D − 1 z ∗ = x ∗ is consensual; from the column stochasticity of both A and ̄ A , it follows M T z ∗ = ( ̄ A − A ) T x ∗ = 0; (iii) M u ∗ + α ∇ f D ( z ∗ ) = D − 1 y ∗ + αD − 1 ∇ f ( x ∗ ) = 0 , with M T z ∗ = 0, which imply S v ∗ + α ∇ ̄ f ( v ∗ ) = 0 .  Lemma 2. For any t ∈ N , it holds N ( z t +1 − z t ) = − M ( u t +1 − u ∗ ) − α [ ∇ f D ( z t ) − ∇ f D ( z ∗ )] . (5.14) This lemma follows from (5.4) and the fact M u ∗ + α ∇ f D ( z ∗ ) = 0 in the last lemma. Lemma 3. Let { v t } be a sequence generated by the iteration (5.8) and v ∗ be defined in (5.5) . Then, it holds ‖ v t +1 − v ∗ ‖ 2 G − ‖ v t − v ∗ ‖ 2 G ≤ −‖ v t +1 − v t ‖ 2 G + ‖ z t +1 − z t ‖ 2 σ 2 N N T + α ̄ L 2 2 η I n − ‖ z ∗ − z t +1 ‖ 2 ( α ̄ μ − αη 2 − 1 σ ) I n + σ 2 ‖ u ∗ − u t +1 ‖ 2 M M T , (5.15) where σ, η > 0 are two tunable parameters. Proof. Note that ‖ v t +1 − v ∗ ‖ 2 G − ‖ v t − v ∗ ‖ 2 G = −‖ v t +1 − v t ‖ 2 G + 〈 v ∗ − v t +1 , G ( v t − v t +1 ) 〉 + 〈 v ∗ − v t +1 , G T ( v t − v t +1 ) 〉 . (5.16) In the following, we analyze the two inner-product terms: 〈 v ∗ − v t +1 , G ( v t − v t +1 ) 〉 = 〈 z ∗ − z t +1 , N T ( z t − z t +1 ) 〉 + 〈 M T ( u ∗ − u t +1 ) , u t − u t +1 〉 ( ∵ (5.11) , M z ∗ = 0 ) = 〈 z ∗ − z t +1 , N T ( z t − z t +1 ) 〉 + 〈 M T ( u ∗ − u t +1 ) , z ∗ − z t +1 〉 ≤ σ 2 ‖ z t − z t +1 ‖ 2 N N T + 1 σ ‖ z ∗ − z t +1 ‖ 2 + σ 2 ‖ u ∗ − u t +1 ‖ 2 M M T , (5.17) and 〈 v ∗ − v t +1 , G T ( v t − v t +1 ) 〉 = 〈 v ∗ − v t +1 , S v t +1 + α ∇ ̄ f ( v t ) 〉 ( ∵ (5.5)) = 〈 v ∗ − v t +1 , S ( v t +1 − v ∗ ) + α ( ∇ ̄ f ( v t ) − ∇ ̄ f ( v ∗ )) 〉 ( ∵ (5.13)) ( ∵ S = − S T ) = α 〈 v ∗ − v t +1 , ∇ ̄ f ( v t ) − ∇ ̄ f ( v ∗ ) 〉 = α 〈 v ∗ − v t +1 , ∇ ̄ f ( v t +1 ) − ∇ ̄ f ( v ∗ ) 〉 + α 〈 v ∗ − v t +1 , ∇ ̄ f ( v t ) − ∇ ̄ f ( v t +1 ) 〉 ≤ − α ̄ μ ‖ z t +1 − z ∗ ‖ 2 + αη 2 ‖ z ∗ − z t +1 ‖ 2 + α ̄ L 2 2 η ‖ z t − z t +1 ‖ 2 . (5.18) ExtraPush for Convex Smooth Decentralized Optimization 393 Substituting (5.17) and (5.18) into (5.16), then we can conclude the lemma.  Proof. (for Theorem 4) In order to establish (5.10) for some constant δ > 0, in light of Lemma 3, it is sufficient to show that the right-hand side of (5.15) is no more than − δ ‖ v t +1 − v ∗ ‖ 2 G , which implies ‖ z t +1 − z ∗ ‖ 2 P + ‖ z t +1 − z t ‖ 2 Q ≥ ‖ u t +1 − u ∗ ‖ 2 R , (5.19) where P = ( α ̄ μ − αη 2 − 1 σ ) I n − δ N + N T 2 , Q = N T + N 2 − σ 2 N N T − α ̄ L 2 2 η I n , R = σ 2 M M T + δ ( M + M T 2 ) . Establishing (5.19) : Step 1. From Lemma 2, there holds ‖ u ∗ − u t +1 ‖ 2 M T M = ‖ M ( u ∗ − u t +1 ) ‖ 2 = ‖ N ( z t +1 − z t ) + α [ ∇ f D ( z t +1 ) − ∇ f D ( z ∗ )] + α [ ∇ f D ( z t ) − ∇ f D ( z t +1 )] ‖ 2 ≤ 3 ‖ z t +1 − z t ‖ 2 N T N + 3 α 2 ̄ L 2 ‖ z t +1 − z ∗ ‖ 2 + 3 α 2 ̄ L 2 ‖ z t +1 − z t ‖ 2 = ‖ z t +1 − z ∗ ‖ 2 3 α 2 ̄ L 2 I n + ‖ z t +1 − z t ‖ 2 3( N T N + α 2 ̄ L 2 I n ) . (5.20) Note that ‖ u ∗ − u t +1 ‖ 2 σ 2 M M T + δM σλ max ( M M T ) 2 + δλ max ( M + M T 2 ) ≤ ‖ u ∗ − u t +1 ‖ 2 ≤ ‖ u ∗ − u t +1 ‖ 2 M T M ̃ λ min ( M T M ) . If the following conditions hold { P  3( 1 2 c 1 σ + c 2 δ ) α 2 ̄ L 2 I n , Q  3( 1 2 c 1 σ + c 2 δ )( N T N + α 2 ̄ L 2 I n ) , (5.21) then (5.19) holds. To show (5.21), it is sufficient to prove { ( λ max ( N + N T 2 ) + 3 c 2 α 2 ̄ L 2 ) δ ≤ − 1 σ + ( ̄ μ − η 2 ) α − 3 2 c 1 ̄ L 2 σα 2 , 3 c 2 ( λ max ( N T N ) + α 2 ̄ L 2 ) δ ≤ λ min ( N T + N 2 ) − c 3 σ 2 − ̄ L 2 α 2 η − 3 2 c 1 ̄ L 2 σα 2 . (5.22) Let c 4 , ( ̄ μ − η 2 ) + √ ∆ 1 , c 5 , ̄ L 2 η , c 6 , 2 c 4 c 5 +12 c 1 ̄ L 2 c 2 4 , c 7 , λ 2 min ( N T + N ) 4 c 3 , c 8 , a ( c 7 + 2) − (2 − c 7 ) for some positive constant a ∈ (0 , 1), ∆ 3 , λ 2 min ( N T + N ) − 4 c 3 c 6 . After reduction, we claim that if the following conditions hold 2 − c 7 2 + c 7 < a < 1 , (5.23) ̄ μ > (√ 6 c 1 1 − a 2 + 1 c 8 √ 1 − a 2 6 c 1 ) ̄ L, (5.24) ̄ μ ( 1 − √ 1 − 4 ̄ L 2 c 8 ̄ μ 2 ) < η < min { ̄ μ ( 1 + √ 1 − 4 ̄ L 2 c 8 ̄ μ 2 ) , 2 ( ̄ μ − √ 6 c 1 1 − a 2 ̄ L )} , (5.25) λ min ( N T + N ) − √ ∆ 3 2 c 3 < σ < λ min ( N T + N ) + √ ∆ 3 2 c 3 , (5.26) ̄ μ − η 2 − √ ∆ 1 3 c 1 ̄ L 2 σ < α < min { ̄ μ − η 2 + √ ∆ 1 3 c 1 ̄ L 2 σ , − ̄ L 2 2 η + √ ∆ 2 3 c 1 ̄ L 2 σ } , (5.27) then (5.22) holds for some positive constant δ . We then end the proof of this theorem.  394 J.S. ZENG AND W.T. YIN 6. Numerical Experiments In this section, we present the results of a series of numerical experiments that demonstrate the effectiveness of the proposed algorithms relative to the subgradient-push algorithm. The used network and its corresponding mixing matrix A are depicted in Fig. 2.1. 6.1. Decentralized Least Squares Consider the following decentralized least squares problem: x ∗ ← argmin x ∈ R p f ( x ) = n ∑ i =1 f i ( x ) , (6.1) where f i ( x ) = 1 2 ‖ B ( i ) x − b ( i ) ‖ 2 2 , B ( i ) ∈ R m i × p , b ( i ) ∈ R m i for i = 1 , . . . , n. The solution x ∗ is B † b , where B = ∑ n i =1 B T ( i ) B ( i ) , b = ∑ n i =1 B T ( i ) b ( i ) , and B † is the pseudo-inverse of B . In this experiment, we take n = 5, p = 256, and m i = 100 for i = 1 , . . . , 5. For both ExtraPush and Normalized ExtraPush, we first choose an α in the hand-optimized manner (in this case, α = 0 . 1) and then take a smaller one like α = 0 . 02 to show the difference due to a smaller step size. The step size of the subgradient-push algorithm is handed optimized to α t = 0 . 8 √ t . The experiment results are illustrated in Fig. 6.1. As illustrated in Fig. 6.1, the performances of ExtraPush and Normalized ExtraPush are almost identical. Their linear convergence rates are affected by different step sizes; a smaller α leads to a slower rate, as one would expect. 6.2. Decentralized Huber-like Regression Instead of least squares, this experiment minimizes the Huber loss function, which is known to be robust to outliers: x ∗ ← argmin x ∈ R p f ( x ) = n ∑ i =1 f i ( x ) , (6.2) where f i ( x ) = ∑ m i j =1 H ξ ( B ( i ) j x − b ( i ) j ) , B ( i ) j is j th row of matrix B ( i ) ∈ R m i × p and b ( i ) j is the j th entry of vector b ( i ) ∈ R m i for i = 1 , . . . , n. The Huber loss function is defined as H ξ ( a ) = { 1 2 a 2 , for | a | ≤ ξ ( ` 2 2 zone) , ξ ( | a | − 1 2 ξ ) , otherwise ( ` 1 zone) . (6.3) Similar to the experimental setting in [13], we let ξ = 2 and set the solution x ∗ in the ` 2 2 zone while initializing x 0 ( i ) in the ` 1 zone for all agents i . Similar to the last experiment, we teset two different step sizes, α = 0 . 1 and 0 . 02, where α = 0 . 1 is hand-optimized. The step size of the subgradient-push algorithm is hand optimized to α t = 5 √ t +100 . The numerical results are depicted in Fig. 6.2. As shown by Fig. 6.2, when α = 0 . 1, both ExtraPush and Normalized ExtraPush algorithms have the sublinear convergence in their first 500 iterations and then show linear convergence, as x t ( i ) for most i have entered the ` 2 2 zone. While for α = 0 . 02 , more iterations (about 2500) are needed before both algorithms start decaying linearly. ExtraPush for Convex Smooth Decentralized Optimization 395 7. Conclusion In this note, we propose a decentralized algorithm called ExtraPush, as well as its simplified version called Normalized ExtraPush, for solving distributed consensus optimization problems over directed graphs. The algorithms use column-stochastic mixing matrices. We show that Normalized ExtraPush converges at a linear rate if the objective function is smooth and strongly convex. In additional, we develop the first-order optimality conditions and provide the conver- gence of ExtraPush under the boundedness assumption. The convergence as well as the rate of convergence of ExtraPush should be justified in the future. Moreover, when applied to a directed time-varying network, the performance of the proposed algorithms will also be studied. Another line of future research is to generalize ExtraPush to handle the sum of smooth and proximable (possibly nonsmooth) functions as done in [14] that has generalized Extra this way. Acknowledgments. The work of W. Yin has been supported in part by the NSF grants DMS-1317602 and ECCS-1462398. The work of J. Zeng has been supported in part by the NSF grant 11501440. References [1] F. Benezit, V. Blondel, P. Thiran, J. Tsitsiklis and M. Vetterli, Weighted Gossip: distributed averaging using non-doubly stochastic matrices, in Proc. IEEE Conf. Decision Control , (2010), 1753-1757. [2] I. Chen, Fast distributed first-order methods, Masters thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 2012. [3] F. Chung, Laplancians and the cheeger inequality for directed graphs, Annals of Combinatorics , 2 (2005), 1-19. [4] J. Duchi, A. Agarwal and M. Wainwright, Dual averaging for distributed optimization: Conver- gence analysis and network scaling, IEEE Trans. Automat. Contr. , 57 (2012) 592-606. [5] D. Jakovetic, J. Xavier and J. Moura, Fast distributed gradient methods, IEEE Trans. Automat. Contr. , 59 (2014) 1131-1146. [6] D. Kempe, A. Dobra and J. Gehrke, Gossip-based computation of aggregate information, in 44th Annual IEEE Symposium on Foundations of Computer Science , (2003), 482-491. [7] F. Iutzeler, P. Ciblat and W. Hachem, Analysis of sum-weight-like algorithms for averaging in wireless sensor networks, IEEE Trans. Signal Process. , 61 (2013) 2802-2814. [8] I. Matei and J. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies, IEEE J. Sel. Top. Signal Process. , 5 (2011) 754-771. [9] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Trans. Automat. Contr. , 54 (2009) 48-61. [10] A. Nedic and A. Olshevsky, Distributed optimization over time-varying directed graphs, IEEE Trans. Automat. Contr. , 60 (2015) 601-615. [11] I. Schizas, A. Ribeiro, and G. Giannakis, Consensus in ad hoc WSNs with noisy links-part I: Distributed estimation of deterministic signals, IEEE Trans. Signal Process. , 56 :1 (2008) 350-364. [12] W. Shi, Q. Ling, K. Yuan, G. Wu and W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization, IEEE Trans. Signal Process. , 62 (2014) 1750-1761. [13] W. Shi, Q. Ling, G. Wu and W. Yin, EXTRA: an exact first-order algorithm for decentralized consensus optimization, SIAM J. Optimiz. , 25 :2 (2015) 944-966. [14] W. Shi, Q. Ling, G. Wu and W. Yin, A Proximal Gradient Algorithm for Decentralized Composite Optimization, IEEE Trans. Signal Process. , 63 :22 (2015) 6013-6023. 396 J.S. ZENG AND W.T. YIN [15] K. I. Tsianos, S. Lawlor and M. G. Rabbat, Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning, in Proc. 50th Allerton Conf. Commun., Control, Comp. , (2012), 1543-1550. [16] K. I. Tsianos, S. Lawlor and M. G. Rabbat, Push-sum distributed dual averaging for convex optimization, in Proc. IEEE Conf. Decision Control , (2012), 5453-5458. [17] K. I. Tsianos, The role of the Network in Distributed Optimization Algorithms: Convergence Rates, Scalability, Communication/Computation Tradeoffs and Communication Delays, PhD the- sis, Dept. Elect. Comp. Eng., McGill Univ., Montreal, QC, Canada, 2013. [18] C. Xi and U.A. Khan, On the linear convergence of distributed optimization over directed graphs, preprint, arXiv:1510.02149, 2015. [19] K. Yuan, Q. Ling and W. Yin. On the convergence of decentralized gradient descent. SIAM J. Optimiz. , 26 :3 (2016), 1835-1854. ExtraPush for Convex Smooth Decentralized Optimization 397 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 −8 10 −6 10 −4 10 −2 10 0 10 2 Iteration t Iterative Error ExtraPush ( α =0.1) Normlized ExtraPush ( α =0.1) ExtraPush ( α =0.02) Normlized ExtraPush ( α =0.02) Subgradient−Push (hand optimized) Fig. 6.1. Experiment results for decentralized least squares regression. History of ‖ x t − x ∗ ‖ 2 , where x ∗ is the exact solution. The performances of ExtraPush and Normalized ExtraPush are very similar. 398 J.S. ZENG AND W.T. YIN 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 10 −6 10 −4 10 −2 10 0 10 2 Iteration t Iterative Error ExtraPush ( α =0.1) Normlized ExtraPush ( α =0.1) ExtraPush ( α =0.02) Normlized ExtraPush ( α =0.02) Subgradient−Push (hand optimized) Fig. 6.2. Experiment results for decentralized Huber regression. History of ‖ x t − x ∗ ‖ 2 , where x ∗ is the exact solution. The performances of ExtraPush and Normalized ExtraPush are very similar.