Q-network vs Policy Network in Reinforcement Learning
July 4, 2023•238 words
Q-network (aka Deep Q-network)
- Input: State (Env+Agent), Action
- Output:
- Single neuron
- Q-value (current reward and future return)
- Characteristics:
- Find max action and train to that label=q(s,a)
- Closely linked with Q-learning concept
- Explore: Random
- Exploit: Use q-network at current moment to exploit for max action through max q-value
- Training: Bellman target q-value
- Update q-value 1 by 1, q = r + max(q[t+1])
Policy Network (applicable to fixed number of actions only)
- Input: State (Env+Agent)
- Output:
- Multiple softmax neurons for all actions and argmax to identify max action
- No q-values for all actions
- Multiple softmax neurons for all actions and argmax to identify max action
- Characteristics:
- Find max q(s,a) and train to that label=max_action
- Not fundamental in theory, used in some situations only
- Explore: Random
- Exploit: Use policy network at current moment to exploit for max action thru’ softmax and argmax
- Training: No Bellman
- For current max action, use policy network to play until episode end to accumulate rewards
Pros and contras:
- Q-network (Popular, flexible):
- Works with unfixed number of actions
- Q-values for all actions
- Don’t feed invalid actions to network and won’t get wrong actions
- Policy Network
- Works with fixed number of actions only
- Q-values for all actions also
- When the policy network is wrong, invalid actions may be in output with high q-values
Bottom line:
- Q-network should be the choice, not policy network.