Finguard RL & Agents · decision-making track

Teach a machine to decide.

An interactive course on reinforcement learning — from multi-armed bandits and Markov decision processes to Q-learning, deep RL, RLHF, and the agentic systems built on top. Every idea is a gridworld or figure you can drive, not an equation you skim.

10
sections
53
units
Gridworlds
& interactive
Interactive figure · value iteration
sweep 0
Each sweep, value flows one step outward from the goal (+1) and away from the trap (−1). The arrows are the policy the values imply.

Dynamic programming, live — the engine behind Section 3.

Reinforcement learning is the study of learning by consequence: an agent acts, the world responds with a reward, and over time it learns to act better.

It is how a program learned to play Atari from raw pixels, how AlphaGo beat the world champion, and how the assistant you talk to was tuned to be helpful. This course builds that idea from the ground up — from a single slot machine to the algorithms behind modern agents.

What you'll learn

From one slot machine to AlphaGo.

01

The foundations

The agent–environment loop, reward and return, value functions and policies, and the explore–exploit dilemma — first on bandits, then on gridworlds.

02

The core algorithms

Markov decision processes and the Bellman equations, dynamic programming (value & policy iteration), and model-free learning — temporal-difference, SARSA, and Q-learning.

03

Modern RL & agents

Policy gradients and PPO, deep RL (DQN to AlphaZero), the RLHF that aligns language models, and the agentic systems built on top.

The curriculum

Ten sections, one through-line.

All ten sections are live — 53 units and 182 interactive lessons, from your first slot machine to designing an agent.

01
Reinforcement Learning Foundations
The agent–environment loop, reward and the goal, returns and discounting, policies and value functions, and exploration vs exploitation.
Available now
02
Multi-Armed Bandits & Exploration
The k-armed bandit, regret, ε-greedy, optimistic initial values, and upper confidence bounds — with a bandit you pull yourself.
Available now
03
Markov Decision Processes & Dynamic Programming
MDPs and the Markov property, the Bellman equations, policy evaluation, value iteration, and policy iteration — on a live gridworld.
Available now
04
Temporal-Difference & Q-Learning
Learning without a model, the TD error and bootstrapping, SARSA and Q-learning, and an agent that learns the gridworld over episodes.
Available now
05
Policy Gradient Methods
Why optimize the policy directly, the policy gradient theorem and REINFORCE, baselines and variance, actor-critic, and A2C/A3C/GAE.
Available now
06
Deep Reinforcement Learning
Function approximation and the deadly triad, DQN (replay & target nets), the DQN family, trust regions and PPO, and continuous control.
Available now
07
Model-Based RL & Planning
Learning a model, Dyna, Monte Carlo tree search, and the AlphaGo–AlphaZero–MuZero line of work.
Available now
08
RL for Language Models (RLHF)
Why RL for language models, reward models from preferences, optimizing with PPO and a KL leash, DPO, and reward hacking.
Available now
09
Agents & Agentic Systems
From RL to agents, the plan-act-observe loop, tools and environments, memory, planning, multi-agent systems, and reliability.
Available now
10
Frontiers & Capstones
Offline RL, advanced exploration, safe and aligned RL, and two capstones — solve a gridworld and design an agent.
Available now
Before you start

Curiosity first, math as you go.

The foundations need nothing but curiosity — bandits and gridworlds are built to be played with. The later, deep-RL sections lean on neural networks and gradient descent; if those are new, Finguard ML is the place to pick them up.

Who it's for

ML learners going into RL Engineers building agents Game & robotics tinkerers Researchers & students
Begin

Watch a machine learn to act.

No account, no install. Progress saves automatically in your browser, separate from your other courses.