Risk-Aware Reinforcement Learning with Bandit-Based Adaptation for Quadrupedal Locomotion |

University of California, Los Angeles

Video

Abstract

In this work, we study risk-aware reinforcement learning for quadrupedal locomotion. Our approach trains a family of risk-conditioned policies using a Conditional Value-at-Risk (CVaR) constrained policy optimization technique that provides improved stability and sample efficiency. At deployment, we adaptively select the best performing policy from the family of policies using a multi-armed bandit framework that uses only observed episodic returns, without any privileged environment information, and adapts to unknown conditions on the fly. Hence, we train quadrupedal locomotion policies at various levels of robustness using CVaR and adaptively select the desired level of robustness online to ensure performance in unknown environments. We evaluate our method in simulation across eight unseen settings (by changing dynamics, contacts, sensing noise, and terrain) and on a Unitree Go2 robot in previously unseen terrains. Our risk-aware policy attains nearly twice the mean and tail performance in unseen environments compared to other baselines and our bandit-based adaptation selects the best-performing risk-aware policy in unknown terrain within two minutes of operation.

Methodology

We train multiple policies using CVaR-constrained policy optimization, where each critic focuses on a different tail of the return distribution, resulting in policies with varying levels of risk awareness. An upper confidence bound (UCB) bandit is then used to adaptively select the appropriate risk level online. The bandit selects the best performing policy over time as the robot interacts with the environment repeatedly.

System diagram: CVaR policy training and bandit-based online adaptation.

Results

Risk Awareness

We commanded the robot to walk at full speed ($v=1\,\text{m/s}$) on both ramp and grass terrains. The $\alpha=0.05$ policy did not advance despite forward commands, either when climbing the ramp or when a foot became trapped in a soccer cone hidden beneath the grass. In contrast, the $\alpha=0.25$ policy adapted a stable gait with larger steps, enabling the robot to step over hidden cones. The PPO baseline moved faster but was more prone to losing balance, particularly when descending the ramp or when one of its feet hits the soccer cone.

Experiment snapshot comparing risk-aware policies on ramp and grass terrain — Experiment snapshot: comparing risk-aware policies on ramp and grass terrains.

Bandit Selection

Cumulative probability of selecting each policy in one ramp experiment. Selection of the policy with $\alpha = 0.25$ quickly dominates after 2000 timesteps, which is roughly 1 minute wall clock time.

Bandit policy selection performance over time — Cumulative policy selection probability: the α=0.25 policy dominates within ~1 minute of operation.

BibTeX

@article{zeng2025risk,
  title={Risk-Aware Reinforcement Learning with Bandit-Based Adaptation for Quadrupedal Locomotion},
  author={Zeng, Yuanhong and Dixit, Anushri},
  journal={2026 IEEE International Conference in Robotics and Automation (ICRA)},
  year={2026},
}