An interactive course on reinforcement learning — from multi-armed bandits and Markov decision processes to Q-learning, deep RL, RLHF, and the agentic systems built on top. Every idea is a gridworld or figure you can drive, not an equation you skim.
Dynamic programming, live — the engine behind Section 3.
Reinforcement learning is the study of learning by consequence: an agent acts, the world responds with a reward, and over time it learns to act better.
It is how a program learned to play Atari from raw pixels, how AlphaGo beat the world champion, and how the assistant you talk to was tuned to be helpful. This course builds that idea from the ground up — from a single slot machine to the algorithms behind modern agents.
The agent–environment loop, reward and return, value functions and policies, and the explore–exploit dilemma — first on bandits, then on gridworlds.
Markov decision processes and the Bellman equations, dynamic programming (value & policy iteration), and model-free learning — temporal-difference, SARSA, and Q-learning.
Policy gradients and PPO, deep RL (DQN to AlphaZero), the RLHF that aligns language models, and the agentic systems built on top.
All ten sections are live — 53 units and 182 interactive lessons, from your first slot machine to designing an agent.
The foundations need nothing but curiosity — bandits and gridworlds are built to be played with. The later, deep-RL sections lean on neural networks and gradient descent; if those are new, Finguard ML is the place to pick them up.
Who it's for
No account, no install. Progress saves automatically in your browser, separate from your other courses.