Q-network vs Policy Network in Reinforcement Learning

Q-network (aka Deep Q-network)

  • Input: State (Env+Agent), Action
  • Output:
    • Single neuron
    • Q-value (current reward and future return)
  • Characteristics:
    • Find max action and train to that label=q(s,a)
    • Closely linked with Q-learning concept
    • Explore: Random
    • Exploit: Use q-network at current moment to exploit for max action through max q-value
    • Training: Bellman target q-value
      • Update q-value 1 by 1, q = r + max(q[t+1])

Policy Network (applicable to fixed number of actions only)

  • Input: State (Env+Agent)
  • Output:
    • Multiple softmax neurons for all actions and argmax to identify max action
    • No q-values for all actions
  • Characteristics:
    • Find max q(s,a) and train to that label=max_action
    • Not fundamental in theory, used in some situations only
    • Explore: Random
    • Exploit: Use policy network at current moment to exploit for max action thru’ softmax and argmax
    • Training: No Bellman
      • For current max action, use policy network to play until episode end to accumulate rewards

Pros and contras:

  • Q-network (Popular, flexible):
    • Works with unfixed number of actions
    • Q-values for all actions
    • Don’t feed invalid actions to network and won’t get wrong actions
  • Policy Network
    • Works with fixed number of actions only
    • Q-values for all actions also
    • When the policy network is wrong, invalid actions may be in output with high q-values

Bottom line:

  • Q-network should be the choice, not policy network.

You'll only receive email when they publish something new.

More from 19411
All posts