[RL/Perference-based] DPOReinforcement Learning强化学习(Reinforcement Learning), Building Blocks, Policy OptimizationDPO(Directed Preference Optimization)是一种不用训练奖励模型的对齐方法,用来让模型直接学习人类偏好,让模型更倾向于输出人类更喜欢的答案。
[RL/Policy Optimization] PPOReinforcement Learning强化学习(Reinforcement Learning), Building Blocks, Policy Optimization符号定义#符号含义$\pi_\theta(a|s)$参数为$\theta$的策略概率$\theta_{old}$用于采样时的旧策略$r_t(\theta)$概率比率,借鉴重要性采样的思想$A_t$优势函数,衡量某动作相对于baseline的好坏$\epsilon$截断阈值目标函数#定义单步的伪目标: