| Flow-based (Flow Matching) | Q-Flow, Value Flows, FM-RL | ODE-based deterministic mapping with exact likelihood | 可解释性强,训练稳定,连续值传播自然 | 对噪声建模能力较弱,梯度反传复杂 |
| Diffusion-based | Diffusion Policy, Diffusion Reward Model, DDPM-RL | Stochastic SDE, denoising likelihood | 强噪声鲁棒性,可生成多模态 reward | 训练成本高,推理慢 |
| VAE-based | RewardVAE, ValueVAE | Latent variable model, amortized inference | 结构简单,可快速近似 reward 分布 | 难建高保真 reward landscape,常模式坍缩 |
| Energy-based Models (EBM) | Value EBM, Reward EBM | Unnormalized density modeling | 能表达复杂能量面,配合 RL 理论自然 | 采样困难,训练需 MCMC/contrastive loss |
| GAN-based | GAIL, AIRL, RewardGAN | Implicit generative model | 经典对抗式 IRL 框架 | reward 信号不稳定,缺少显式likelihood |
| Normalizing Flows (NF) | RealNVP, MAF, Glow-RL | Invertible deterministic mapping | 精确对数似然,可用于 reward density | 容易受 Jacobian 限制,不适合复杂分布 |
| Score-based models | Score Matching, EDM, Denoising Score RL | Learn ∇log p(x) (energy gradients) | 与 reward gradient 建模天然契合 | 训练复杂,需要SDE推导支持 |