Training-Budget Sensitivity of Method Rankings in Safe Reinforcement Learning: A Case Study on Quadrotor Control Under Wind Disturbance
Main Article Content
Abstract
Empirical comparisons in safe reinforcement learning (RL) are usually reported at a single, fixed training budget, and the resulting method ranking is then treated as a property of the algorithms. We show that on a quadrotor hover task under wind disturbance this ranking is instead strongly budget-dependent. Using two runs of an identical five-method pipeline—an unconstrained PPO baseline and four Lagrangian variants spanning a two-by-two factorial of error signal (mean cost versus CVaR) and dual-update rule (gradient ascent versus PID)—we find that in calm air the family of PID-controlled Lagrangian methods ranks last at 500K steps and first at 2M steps, while the unconstrained baseline moves in the opposite direction. Quantified by Spearman rank correlation between the two budgets, the ordering is negatively correlated at moderate wind (rho of minus 0.9, p of 0.037) and, pooled across all five wind conditions, significantly negative overall (rho of minus 0.4 over 25 method-condition cells, p of 0.048): methods that look better at 500K tend to look worse at 2M. Because the two runs differ in budget, seed count, and—for the CVaR cells—an adaptive cost-limit calibration introduced between them, we isolate the budget effect two ways that are immune to the calibration confound: (i) the periodic in-training logs of the 2M run, which compare 500K and 2M on the same 20 seeds with calibration held constant, and (ii) the mean-cost PID-Lag method, which uses no CVaR calibration yet still rises from the bottom at 500K (58.8) to the top at 2M (257.7). Bootstrap confidence intervals confirm the reversal. We further show that more compute is not uniformly better: under moderate and strong wind several methods lose return between 500K and 2M, and the safety (violation-rate) ranking shifts with budget as well. Finally, the two top methods at 2M are statistically indistinguishable (paired Wilcoxon, n of 20, corrected p of 1.0; effect size of 0 at the strongest wind), so the phenomenon is a property of the PID family, not of any single method. We argue that safe-RL benchmarks should report convergence evidence (learning curves) and budget sensitivity before publishing a ranking, and we give a concrete checklist. The 2M-step, 20-seed run re-analyzed here is the converged study reported in our companion paper (Chaouli and Elbar 2026b); all findings are recomputed from raw logs and no new training was performed.