The Q Learning Algorithm
- the Bellman Equation can be adapted to work as a policy optimization rule
- model free algorithm
- samples reward based on policy and adds maximum Q value of new state that may not be equal to the Q value related to the action chosen following the current behavior policy
- because of this discrepancy, Q learning is called an off-policy algorithm
Q Learning Example Implementation
Please see my Svelte TD Learning Repository for the complete code and the interactive Gridworld Examples for more information.
const QLearningQTableUpdate = (state, a, r, stateNext) => {
let g;
if (mazeComp.isTerminal(stateNext)) {
g = r;
} else {
g = r + gamma * mazeComp.getMaxQValue(stateNext);
}
let q = (1.0 - alpha) * mazeComp.getQValue(state, a) + alpha * g;
mazeComp.setQValue(state, a, q);
};
const runQLearningEpisodeStep = (state) => {
let stateNext;
let a, r;
if (mazeComp.isTerminal(state)) {
runEpisode(); // run next episode (calls runQLearningEpisode)
} else {
stepTimer = setTimeout(() => {
a = mazeComp.getEpsilonGreedyAction(state, epsilon);
[stateNext, r] = mazeComp.step(state, a);
QLearningQTableUpdate(state, a, r, stateNext);
state = [...stateNext];
runQLearningEpisodeStep(state);
}, 0);
}
};
const runQLearningEpisode = () => {
let state = mazeComp.getRandomStartState();
runQLearningEpisodeStep(state);
};
Berkeley CS 294-112: Deep Reinforcement Learning (Advanced Q Learning)