The Q Learning Algorithm
- the Bellman Equation can be adapted to work as a policy optimization rule
- model free algorithm
- samples reward based on policy and adds maximum Q value of new state that may not be equal to the Q value related to the action chosen following the current behavior policy
- because of this discrepancy, Q learning is called an off-policy algorithm
Q Learning Example Implementation
Please see my Svelte TD Learning Repository for the complete code and the interactive Gridworld Examples for more information.
const QLearningQTableUpdate = (state, a, r, stateNext) => {
  let g;
  if (mazeComp.isTerminal(stateNext)) {
    g = r;
  } else {
    g = r + gamma * mazeComp.getMaxQValue(stateNext);
  }
  let q = (1.0 - alpha) * mazeComp.getQValue(state, a) + alpha * g;
  mazeComp.setQValue(state, a, q);
};
const runQLearningEpisodeStep = (state) => {
  let stateNext;
  let a, r;
  if (mazeComp.isTerminal(state)) {
    runEpisode();  // run next episode (calls runQLearningEpisode)
  } else {
    stepTimer = setTimeout(() => {
      a = mazeComp.getEpsilonGreedyAction(state, epsilon);
      [stateNext, r] = mazeComp.step(state, a);
      QLearningQTableUpdate(state, a, r, stateNext);
      state = [...stateNext];
      runQLearningEpisodeStep(state);
    }, 0);
  }
};
const runQLearningEpisode = () => {
  let state = mazeComp.getRandomStartState();
  runQLearningEpisodeStep(state);
};
Berkeley CS 294-112: Deep Reinforcement Learning (Advanced Q Learning)