COpter: Efficient Large-Scale Resource-Allocation via Continual Optimization

remove overhead of round-based resource allocation -> sequence of interconnected problems

Featured image

Venue: SOSP 2025
Link: ACM DL

Topic: Let’s not do unnecessary steps. Let’s reuse history. Remove unneccesary overhead(round-based method) in Large-Scale resource allocation.


Summary

There is only small changes between rounds -> remove overhead of round-based resource allocation -> treat those as sequence of interconnected problems Let’s not do unnecessary steps. Let’s reuse history.


Background

Resource allocation needs mathmatical programming / optimization-based methods

But Mathmatical programming / optimization-based methods are not scalable

Prior Approaches’ limitation : Mitigate solver costs

  1. They either degrade allocation quality or fail to meet real-time scheduling deadlines.
    • scalling solvers to large clusters often can’t make frequent, globally-optimal scheduling -> exist systems stick to coarse-grained optimization (hourly, daily)
  2. The allocation problem in each round is treated as an independent problem instance. and this makes work expensive.
    • fundamental mismatch: the modeling of round-based resource-allocation problems vs. many large-scale systems evolve slowly across rounds, so actually we can just reuse info from fore rounds. (most jobs from the previous round continue executing without much change to their allocations)

Prior Approaches examples

  1. Warm-starting: not universally applicable
  2. Solver computation: not efficiently parallelized.
  3. GPU acceleration : not always applicable, due to the sparse nature of the solver compute. @why solver compute has sparse nature?
  4. Forcibly terminating solvers early by setting a compute/time budget: can return sub-optimal and/or infeasible solutions that may be meaningless. @ But is sub-optimal that bad??? i don’t think so… it doesn’t need to be the BEST solution isn’t it?
  5. Partitions a large problem into small pieces and solves them in parallel (POP): allocation quality <-> runtime(responsiveness) tradeoff.
    Generally, allocations are reconsidered every few minutes (a round) by formulating and solving a new optimization problem.

What if we adapt existing solvers to reuse computational effort from prior rounds? : not straightforward.

  1. small change to the input problem -> require reconstructing solver-specific problem representations = significant setup cost for even the smallest change @the root cause : matrix is sparse?
  2. Solvers need to reconstruct key internel transformations, discarding all prior computational effort.
  3. for MILP resource-allocation problems, further exacerbates the scalability challenge for large problems.

Key Idea

Observation

Optimization-based resource allocation problems often only change by small amounts across successive rounds.

Action

continual optimization: remove overhead of round-based resource allocation -> sequence of interconnected problems

Target

COpter provides a method for continual optimization of

How

  1. an efficient-to-update problem representation for incremental changes (reflect small changes).
  2. Exploit the slow evolution of the optimal solution to efficiently warm-start subsequent problem solves.
  3. lightweight heuristics for mixed-integer problems that recover feasible integer solutions with negligible quality loss instead of expensive combinatorial searches entirely.

Goal

Continual optimization: provide both low dynamic regret and runtimes that scale favorably with the amount of change in the optimal solution over time, especially for slowly evolving resource-allocation problems.


Design

  1. an efficient-to-update problem representation for incremental changes (reflect small changes).
    • API : add-request(request ID, num variables), remove-request(request ID), add/remove-resources(resource ID, count)
  2. Exploit the slow evolution of the optimal solution to efficiently warm-start subsequent problem solves.
    • Don’t clean up former state. there is only few changes. start from former state.
    • use PPA(Proximal Point Algorithms) to maintain an easily recomputable internal state. It supports warm-starting and exhibits practical, finite-step convergence for LPs.
  3. Lightweight integerization for MILPs: priortize feasibility over optimality
    instead of heavy & guarantee optimality of the integer-feasible solution, choose heuristic & lightweight way
    • After solving the LP relaxation of the MILP, techniques like branch-and-cut are used to fix integer infeasibilities.
    • Since this integer resolution stage frequently dominates the overall solution time, it creates a significant scalability bottleneck.
    • branch-and-cut guarantees optimality of the integer-feasible solution, this is a bottleneck step
    • COpter introduces lightweight post-processing, using shims, to produce integer-feasible solutions from the optimal solution to the LP-relaxation.

Evaluation


Related Works

Resource-allocation as Optimization Problems

Cluster Scheduling

  1. Plan-ahead schedulers/ Tetri-Sched, Shockwave: solve one MILP to determine allocations for multiple rounds
    • pros: robust to missing scheduler deadlines in large clusters
    • cons: increased queue time & reduced resource utilization(because it needs to conservative).
  2. Solves MILPs hourly & use 2-stage decomposition/ RAS:
    • requires problem-specific optimizations & heuristics to scale the MILP stage to large scales. @then why COpter doesn’t have this problem?
  3. Multilevel scheduling for scalability/ Tetris: lacks global optimality. @why?
  4. Gavel:supports time-slicing DNN jobs across heterogeneous GPUs using LP formulations for many scheduling objectives
    • but runs into solver bottlenecks if jobs request space-sharing leading to missed scheduler deadlines in large clusters. @why?
  5. express and compile scheduling policies into MILPs/ Rebalancer
    • for small scales: The MILPs are then solved using off-the-shelf solvers
    • for large scales: use a faster, sub-optimal local search algorithm on a graph representation

COpter complements Rebalancer.

  1. potentially extending the size of problems directly solvable as MILPs @what?
  2. using COpter’s outputs as initial guesses to initialize the local-search algorithm at scale for better quality solutions. @what?

Solving Optimizaiton Problems @then which side COpter belongs?: Online I think.

Offline optimization

Online optimization


Inputs


Meeting Notes