Finguard RL & Agents — Reinforcement learning, from bandits to deep RL

Reinforcement learning is the study of learning by consequence: an agent acts, the world responds with a reward, and over time it learns to act better.

It is how a program learned to play Atari from raw pixels, how AlphaGo beat the world champion, and how the assistant you talk to was tuned to be helpful. This course builds that idea from the ground up — from a single slot machine to the algorithms behind modern agents.

What you'll learn

From one slot machine to AlphaGo.

The foundations

The agent–environment loop, reward and return, value functions and policies, and the explore–exploit dilemma — first on bandits, then on gridworlds.

The core algorithms

Markov decision processes and the Bellman equations, dynamic programming (value & policy iteration), and model-free learning — temporal-difference, SARSA, and Q-learning.

Modern RL & agents

Policy gradients and PPO, deep RL (DQN to AlphaZero), the RLHF that aligns language models, and the agentic systems built on top.

The curriculum

Ten sections, one through-line.

All ten sections are live — 53 units and 182 interactive lessons, from your first slot machine to designing an agent.

Reinforcement Learning Foundations

The agent–environment loop, reward and the goal, returns and discounting, policies and value functions, and exploration vs exploitation.

Available now

Multi-Armed Bandits & Exploration

The k-armed bandit, regret, ε-greedy, optimistic initial values, and upper confidence bounds — with a bandit you pull yourself.

Available now

Markov Decision Processes & Dynamic Programming

MDPs and the Markov property, the Bellman equations, policy evaluation, value iteration, and policy iteration — on a live gridworld.

Available now

Temporal-Difference & Q-Learning

Learning without a model, the TD error and bootstrapping, SARSA and Q-learning, and an agent that learns the gridworld over episodes.

Available now

Policy Gradient Methods

Why optimize the policy directly, the policy gradient theorem and REINFORCE, baselines and variance, actor-critic, and A2C/A3C/GAE.

Available now

Deep Reinforcement Learning

Function approximation and the deadly triad, DQN (replay & target nets), the DQN family, trust regions and PPO, and continuous control.

Available now

Model-Based RL & Planning

Learning a model, Dyna, Monte Carlo tree search, and the AlphaGo–AlphaZero–MuZero line of work.

Available now

RL for Language Models (RLHF)

Why RL for language models, reward models from preferences, optimizing with PPO and a KL leash, DPO, and reward hacking.

Available now

Agents & Agentic Systems

From RL to agents, the plan-act-observe loop, tools and environments, memory, planning, multi-agent systems, and reliability.

Available now

Frontiers & Capstones

Offline RL, advanced exploration, safe and aligned RL, and two capstones — solve a gridworld and design an agent.

Available now

Before you start

Curiosity first, math as you go.

The foundations need nothing but curiosity — bandits and gridworlds are built to be played with. The later, deep-RL sections lean on neural networks and gradient descent; if those are new, Finguard ML is the place to pick them up.

Open the course → See all courses

Who it's for

ML learners going into RL Engineers building agents Game & robotics tinkerers Researchers & students

Begin

Watch a machine learn to act.

No account, no install. Progress saves automatically in your browser, separate from your other courses.

Open the course → Back to catalog