Deep Q Network (DQN) Architectures

Basic Q Network

state to vector of Q values related to all actions, or
state and action to scalar Q(s, a) value

Basic DQN Algorithm

Deep Q-learning with Experience Replay (Google DeepMind 2013) Human-level Control through Deep Reinforcement Learning (Nature 2015)

Execute a policy step and store the experienced data-point \([s, a, r, s']\) into an experience buffer.
Sample a batch of unconnected experience data-points from the experience buffer and use it to optimize (train) the policy network.

The fact that - to optimize the policy - the experienced data is not used directly, but through sampling of unconnected data batches from an experience buffer, stabilizes the policy network behavior and the policy learning process.

Double DQN Architecture

Deep Reinforcement Learning with Double Q-learning (Google DeepMind)

double (target and behavior networks)
- prevents “tail chasing” / oscillating policy
- stabilizes learning process

To stabilize the training process, two independent networks are used for policy execution and policy optimization. The algorithm - once in a while - switches the function of the networks, or copies the optimized policy parameters into the policy execution network, to keep both of them up-to-date.

Dueling Network Architecture

Dueling Network Architectures for Deep Reinforcement Learning (Google DeepMind)

dueling (value advantage decomposition)
- helps the network to independently learn state value and action advantages which is often more simple
- makes independent generalization of state value and action advantages possible
- uses aggregation module to combine state value and action advantages to Q values