[RL/Value-based]Q-Learning

Off Policy,学习最优$Q$

$$Q(s,a)\leftarrow Q(s,a)+\alpha[r+\gamma \max_{a'} Q(s',a')-Q(s,a)]$$

思想:直接逼近最优动作值函数$Q^\star(s,a)$