[RL/Basic]MDP

马尔可夫决策过程#

$$J=\mathbb [\sum_{t=0}^{\infty} \gamma^t r_t]$$

[RL/Perference-based] DPO

DPO(Directed Preference Optimization)是一种不用训练奖励模型的对齐方法,用来让模型直接学习人类偏好,让模型更倾向于输出人类更喜欢的答案。

[RL/Policy Gradient] Actor-Critic

引入基线$V(s)$降低方差。

[RL/Policy Optimization] PPO

符号定义#

符号含义
$\pi_\theta(a|s)$参数为$\theta$的策略概率
$\theta_{old}$用于采样时的旧策略
$r_t(\theta)$概率比率,借鉴重要性采样的思想
$A_t$优势函数,衡量某动作相对于baseline的好坏
$\epsilon$截断阈值

目标函数#

定义单步的伪目标:

[RL/Value-based]Double Q-Learning

减少过估计偏差

[RL/Value-based]Q-Learning

Off Policy,学习最优$Q$

[RL/Value-based]SARSA

On Policy,遵循当前策略更新:

Generative Reinforcement Learning Content

类别代表方法数学本质优势劣势 / 注意点
Flow-based (Flow Matching)Q-Flow, Value Flows, FM-RLODE-based deterministic mapping with exact likelihood可解释性强,训练稳定,连续值传播自然对噪声建模能力较弱,梯度反传复杂
Diffusion-basedDiffusion Policy, Diffusion Reward Model, DDPM-RLStochastic SDE, denoising likelihood强噪声鲁棒性,可生成多模态 reward训练成本高,推理慢
VAE-basedRewardVAE, ValueVAELatent variable model, amortized inference结构简单,可快速近似 reward 分布难建高保真 reward landscape,常模式坍缩
Energy-based Models (EBM)Value EBM, Reward EBMUnnormalized density modeling能表达复杂能量面,配合 RL 理论自然采样困难,训练需 MCMC/contrastive loss
GAN-basedGAIL, AIRL, RewardGANImplicit generative model经典对抗式 IRL 框架reward 信号不稳定,缺少显式likelihood
Normalizing Flows (NF)RealNVP, MAF, Glow-RLInvertible deterministic mapping精确对数似然,可用于 reward density容易受 Jacobian 限制,不适合复杂分布
Score-based modelsScore Matching, EDM, Denoising Score RLLearn ∇log p(x) (energy gradients)与 reward gradient 建模天然契合训练复杂,需要SDE推导支持

Value Flows

Paper Reading: