Stochastic optimization is a widely used approach for optimization under uncertainty, where uncertain input parameters are modeled by random variables. However, a significant limitation of this approach is that it requires full knowledge of the underlying probability distributions. This paper addresses the question of whether good (approximation) algorithms can be obtained when these distributions are unknown and learned through repeated interactions.
We present a generic online learning algorithm for a large class of 'monotone' stochastic problems, achieving $\sqrt{T\log T}$ regret relative to the best approximation algorithm under known distributions. Importantly, our algorithm operates in a semi-bandit setting, where it only observes samples from the random variables that were actually probed. Our results also extend to settings with censored and binary feedback, where the policy only observes truncated or thresholded versions of the probed variables.
This framework applies to several fundamental problems such as prophet inequality, Pandora’s box, stochastic knapsack, single-resource revenue management, and sequential posted pricing.