Optimization for Data Science — Alessandro Carella

This project explores advanced optimization techniques for solving minimum cost flow problems with quadratic, convex, and separable objective functions. The work focuses on implementing and comparing different algorithmic approaches to network flow optimization.

The core problem addresses the challenge of finding optimal flow allocation in directed networks where transportation costs follow a quadratic relationship. Mathematically, the problem is formulated as minimizing x^TQx + qx subject to flow conservation constraints Ex = b and capacity constraints 0 ≤ x ≤ u, where Q is a positive semidefinite diagonal matrix representing quadratic costs, E is the node-arc incidence matrix, and b represents supply/demand at each node.

This problem has practical applications in logistics, resource allocation, and network design, where costs often increase non-linearly with flow volumes. Our investigation implements three distinct solution approaches: a Lagrangian dual-based algorithm with smoothing, Nesterov's Fast Gradient Method with dynamic parameters, and the CVXPY optimization library with the SCS solver.

This project provided insights into the practical challenges of convex optimization and the significant gap between theoretical guarantees and real-world performance. I learned that the smoothing parameter μ plays a critical role in algorithm convergence, too small leads to numerical instability and tiny step sizes, while too large distorts the original problem excessively.

The implementation revealed that dynamic parameter adjustment dramatically improves performance over fixed parameters. Our variant with dynamic μ achieved convergence in approximately 1,676 iterations on small matrices compared to 28,784 iterations with constant parameters. However, accessing the optimal value a priori provides substantial advantages, reducing relative errors from 4.1 to 0.082 on complex instances.

I discovered the importance of problem conditioning in optimization algorithms. Fast Gradient methods performed well on well-conditioned problems but struggled with ill-conditioned matrices, while the SCS solver maintained consistent performance across all test cases. The Lagrangian relaxation approach proved powerful for decomposing the problem into parallelizable sub-problems, exploiting the separable structure of the objective function.

Perhaps most importantly, I learned that specialized solvers like SCS, built on years of research and optimization, significantly outperform custom implementations in terms of both speed (25-2925 iterations vs. 150,000) and robustness. This underscored the value of understanding when to build custom solutions versus using existing tools, and highlighted the challenges of translating theoretical convergence rates into practical algorithms.

The Lagrangian dual-based algorithm forms the theoretical foundation of our optimization approach. By relaxing the equality constraints Ex = b using Lagrange multipliers λ, we transform the constrained primal problem into an unconstrained dual maximization problem: max_λ φ(λ), where φ(λ) = min_x {x^TQx + qx + λ(Ex - b) : 0 ≤ x ≤ u}.

A critical challenge arises from the positive semidefinite matrix Q containing zeros on its diagonal, making certain components non-invertible. To address this, we introduce a smoothing regularization term μ||x||₂², creating the smoothed problem: min_x {x^TQx + qx + λ(Ex - b) + μ||x||₂² : 0 ≤ x ≤ u}. This makes the objective function strictly convex and differentiable everywhere.

The separable structure of the Lagrangian allows decomposition into m independent one-dimensional problems, one for each arc (i,j). For each arc, the optimal flow x*_ij can be computed analytically as max(0, min(u_ij, -(q_ij - λ_i + λ_j)/(2(Q_ij + μ)))). This closed-form solution enables efficient parallel computation and makes each iteration computationally tractable even for large networks.

The gradient of the dual function φ(λ) with respect to λ equals Ex* - b, representing the violation of flow conservation constraints. This gradient has a clear physical interpretation: when Ex* = b, flows satisfy conservation laws and the gradient vanishes; otherwise, the gradient magnitude indicates how severely the constraints are violated. The Lipschitz constant of this gradient is L = ||E||₂²/μ, which grows as μ decreases, creating a fundamental trade-off between accuracy and convergence speed.

Nesterov's Fast Gradient Method provides optimal worst-case convergence rates for smooth convex optimization. Unlike standard gradient descent which updates at the current point λ_k, Nesterov's method introduces momentum by computing gradients at an extrapolated point y_k = (1 - θ_k)λ_k + θ_kv_k, where v_k represents a velocity term and θ_k = 2/(k+2) provides the acceleration schedule.

The algorithm maintains three sequences: the main iterates λ_k, the extrapolation points y_k, and auxiliary variables v_k. At each iteration, we compute the gradient at y_k, update λ_k+1 = y_k - t∇φ(y_k) with step size t ∈ (0, 1/L], and update the velocity v_k+1 = λ_k + (1/θ_k)(λ_k+1 - λ_k). This momentum mechanism allows the algorithm to achieve O(1/k²) convergence rate compared to O(1/k) for standard gradient descent.

For our problem, theoretical convergence is guaranteed by the bound φ(λ_k) - φ* ≤ (2/((k+1)²t))||λ₀ - λ*||₂². However, this bound depends on knowing the distance to the optimal point λ*, which is typically unavailable. In practice, convergence behavior depends heavily on problem conditioning and the choice of smoothing parameter μ.

Our implementation tested three variants: constant step size with fixed μ, dynamic μ using knowledge of the optimal value f*, and dynamic μ without requiring f*. The constant step size approach struggled with larger problems, requiring over 28,000 iterations for small matrices and failing to converge for larger ones. The dynamic approaches showed dramatic improvements, with the variant using f* converging in under 2,000 iterations and achieving relative errors below 10^-4 on well-conditioned problems.

The implementation consists of multiple Python Jupyter notebooks organized by algorithm variant. The fastGradientMethod.ipynb contains the baseline implementation with constant parameters, while fastGradientMethodWithOpt.ipynb and fastGradientMethodWithoutOpt.ipynb implement dynamic smoothing strategies. Supporting scripts include getProblemData.py for parsing DMX instance files, gradientUtils.py with shared optimization functions, and plots.py for visualization.

Test instances were generated using the netgen network generator to create base linear cost flow problems, to which we added randomized quadratic costs. The matrix Q was constructed by centering random values around q_ij/u_ij within intervals controlled by parameter α = 100. To simulate positive semidefinite matrices with zero diagonal elements, we introduced parameter ρ = 0.3, randomly setting 30% of quadratic costs to zero.

The smoothing parameter μ was computed as μ = ε/(2||u||₂²) where ε = 10^-3 is the tolerance, ensuring the approximation error remains bounded. For the dynamic variant without optimal value, we employed a progressive reduction strategy: μ_k = max(10^{-⌊k/1000⌋}, ε/(2K)), gradually decreasing μ every 1000 iterations while respecting the theoretical minimum. This heuristic balances initial progress with eventual accuracy.

We compared our implementations against CVXPY using the SCS (Splitting Conic Solver) with indirect method settings. The solver automatically transforms our problem into standard conic form. Performance metrics included iteration count, final objective value, gradient norm at termination, and relative error compared to the solver solution. All experiments ran for a maximum of 150,000 iterations on smaller instances (nodes < 500, arcs < 3000) due to computational time constraints.

The experimental results reveal stark performance differences across methods and problem instances. On the small test matrix (4 nodes, 4 arcs), the Fast Gradient method with optimal value knowledge converged in 509 iterations to within 7.6×10^-5 relative error, while the variant without optimal value required 1,680 iterations achieving 1.2×10^-4 error. In contrast, SCS solved the same problem in just 25 iterations with comparable accuracy.

For the rmfgen 4-4-500 instance, both Fast Gradient variants reached their 150,000 iteration limit. The method with optimal value achieved objective value 365,912.49 (2.0% relative error), while the variant without optimal value reached 368,963.11 (1.1% error) compared to SCS's reference solution of 373,214.42 obtained in only 675 iterations. Interestingly, the method without optimal value performed better here, suggesting that access to f* isn't universally advantageous and may depend on problem characteristics.

The most challenging instance, rmfgen 4-16-500, exposed the limitations of Fast Gradient methods. With optimal value, the method achieved 8.2% relative error after 150,000 iterations, while without it the error exploded to 407%, indicating complete failure to approximate the solution. SCS required 2,925 iterations, still maintaining high efficiency. This demonstrates that problem conditioning and dimensionality severely impact gradient method performance.

The gradient norm evolution plots reveal oscillatory behavior in the dynamic μ variants, contrasting with the smooth monotonic decrease predicted by theory. These oscillations correspond to reductions in μ every 1000 iterations, temporarily disrupting convergence but enabling eventual progress toward finer-grained solutions. The relative gap plots show the methods can approach optimal values but struggle to achieve high precision, suggesting these first-order methods are better suited for obtaining approximate solutions quickly rather than high-accuracy results.

This project demonstrates that while Lagrangian dual methods with fast gradient acceleration provide elegant theoretical frameworks for convex optimization, practical implementation faces substantial challenges. The fundamental tension between smoothing for computational tractability and accuracy of the original problem solution requires careful parameter management. Our dynamic smoothing strategies show promise but remain sensitive to problem conditioning and scale.

The superior performance of the SCS solver, typically 50-1000× fewer iterations than our implementations, highlights the value of specialized optimization software incorporating decades of algorithmic refinements, adaptive parameter selection, preconditioning techniques, and numerical stability enhancements. For practitioners, this suggests custom implementations are valuable for understanding and research but production systems benefit from mature solver libraries.