Q-network vs Policy Network in Reinforcement Learning

July 4, 2023•238 words

Q-network (aka Deep Q-network)

Policy Network (applicable to fixed number of actions only)

Input: State (Env+Agent)
Output:
- Multiple softmax neurons for all actions and argmax to identify max action
- No q-values for all actions
Characteristics:
- Find max q(s,a) and train to that label=max_action
- Not fundamental in theory, used in some situations only
- Explore: Random
- Exploit: Use policy network at current moment to exploit for max action thru’ softmax and argmax
- Training: No Bellman
  - For current max action, use policy network to play until episode end to accumulate rewards

Pros and contras:

Q-network (Popular, flexible):
- Works with unfixed number of actions
- Q-values for all actions
- Don’t feed invalid actions to network and won’t get wrong actions
Policy Network
- Works with fixed number of actions only
- Q-values for all actions also
- When the policy network is wrong, invalid actions may be in output with high q-values

Bottom line: