返回列表

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Date: 2026-02-27Fetched: 2026-02-28T01:46:40.445633+00:00

Authors

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang

Links

可验证奖励的强化学习因统一错误惩罚而面临推理多样性降低的问题,对此,一种置信度感知的不对称错误惩罚方法通过基于 rollout 置信度动态调节优势加以解决。