From Thorndike's cats to the Free Energy Principle — a hands-on curriculum bridging Reinforcement Learning and Active Inference
In 1898, Thorndike's cats learned to escape puzzle boxes by trial and error. In 1948, Tolman showed that rats build cognitive maps. In 1997, Schultz revealed that dopamine neurons compute prediction errors — the same signal as the TD learning algorithm proposed a decade earlier.
This curriculum traces that intellectual lineage from animal behavior experiments through modern RL into Active Inference and the Free Energy Principle. You will implement every algorithm, see every correspondence, and understand why these two frameworks — seemingly so different — are solving the same problem.
Modeled on OpenAI's Spinning Up in RL, and designed as a pre-course introduction to Neuromatch Academy's computational neuroscience curriculum.
There are excellent RL courses already. This one exists because none of them do what we do.
| Spinning Up (OpenAI) | Deep RL Course (HuggingFace) | Neuromatch Academy | This curriculum | |
|---|---|---|---|---|
| RL algorithms | VPG, TRPO, PPO, DDPG, TD3, SAC | Q-learning, DQN, PPO, A2C | Multi-armed bandits, Q-learning, model-based (1 day) | Q-learning, SARSA, Dyna, REINFORCE, A2C, DQN, PPO, SAC |
| Active Inference | -- | -- | -- | Full AIF track: POMDP generative models, EFE, deep AIF, hierarchical |
| Free Energy Principle | -- | -- | -- | VFE, ELBO decomposition, surprise minimization, FEP as unifying principle |
| RL-AIF bridge | -- | -- | -- | Formal Rosetta Stone: Q = -G, softmax equivalence, same-environment proof |
| Neuroscience grounding | -- | -- | Bayesian decisions, HMMs, Kalman filters | Thorndike, Pavlov, Tolman, Schultz (dopamine = TD error), successor representations |
| Animal behavior history | -- | -- | Implicit | Explicit: ethology, reinforcement schedules, cognitive maps, Tinbergen's 4 questions |
| Bayesian brain | -- | -- | W3D1-D2 (inference, HMMs) | Helmholtz, variational inference, generative models, belief propagation |
| Differentiable inference | -- | -- | -- | jax.grad through the forward algorithm, learning A/B matrices from data |
| GPU-accelerated AIF | -- | -- | -- | JAX jit/vmap/grad, 1000 batch agents on GPU |
| Multi-agent / social | -- | Unit 7 (broken) | -- | Ostrom's commons, Concordia social simulation, experiment ladder L0-L7 |
| Embodied robotics | -- | -- | -- | PyBullet arm, continuous AIF, proprioceptive inference (Module 17) |
| Primary language | Python/TF | Python/PyTorch | Python/NumPy | Python/JAX |
| Environments | MuJoCo | Atari, Unity, VizDoom, Godot | Custom tutorials | neuro-nav grid worlds, PyBullet (neuroscience-native) |
See also: HuggingFace Robotics Course (LeRobot, classical robotics + imitation learning + foundation models) — complementary to our Module 17.
The gap we fill: No existing resource teaches RL and Active Inference side-by-side, in the same environments, with escalating complexity, grounded in the animal behavior research that motivated both frameworks. Spinning Up teaches RL but not AIF. HuggingFace teaches practical deep RL but no theory of mind. Neuromatch covers Bayesian inference and one day of RL, but never mentions the Free Energy Principle. We bridge all three.
How we complement Neuromatch: Neuromatch W3D1 (Bayesian Decisions) and W3D4 (Reinforcement Learning) are the two days most relevant to our curriculum. Students who complete our Modules 1-8 before Neuromatch will arrive with deep intuition for why Bayesian inference and RL are connected — making W3 substantially more meaningful. Our Module 13 (Rosetta Stone) then formalizes what Neuromatch leaves implicit.
- Graduate students in computational neuroscience, cognitive science, or AI
- RL practitioners curious about Active Inference
- Neuroscientists who want to understand computational models of behavior
- Anyone who finds the Free Energy Principle intriguing but impenetrable
- Students preparing for Neuromatch Academy who want deeper RL/Bayesian foundations
Prerequisites: Python, NumPy, basic probability (Bayes' rule). No prior RL or AIF experience required.
The behavioral science that motivated both RL and Active Inference.
| Module | What you'll learn | Key figures | |
|---|---|---|---|
| 01 | The Behaving Animal | Stimulus-response vs. cognitive maps. Your first neuro-nav agent. | Thorndike, Pavlov, Skinner, Tolman, Tinbergen |
| 02 | Bellman and Value | MDPs, the Bellman equation, value iteration from scratch. | Bellman |
| 03 | Dopamine and TD Learning | Prediction errors, the RPE hypothesis, successor representations. | Rescorla-Wagner, Schultz, Dayan |
Classical and modern RL, building toward the bridge to Active Inference.
| Module | What you'll learn | Algorithms | |
|---|---|---|---|
| 04 | Exploration & Exploitation | Epsilon-greedy, softmax, UCB. Why AIF dissolves the dilemma. | SARSA, Q-learning, UCB |
| 05 | Model-Based RL | Dyna, planning as inference. The T-matrix IS the B-matrix. | DynaQ, MBV |
| 06 | Policy Gradients | REINFORCE, actor-critic, entropy regularization = free energy. | REINFORCE, A2C |
| 07 | Deep RL | Function approximation, DQN, PPO, SAC. Representations. | DQN, PPO, SAC |
Everything changes. We start from a model of the world, not a reward signal.
| Module | What you'll learn | Key equations | |
|---|---|---|---|
| 08 | Generative Models & FEP | Bayesian brain, variational inference, free energy. | F = KL + E[ln p], ELBO |
| 09 | Your First AIF Agent | A/B/C/D matrices, the T-maze, belief updating. | P(s|o), EFE |
| 10 | EFE Decomposition | Pragmatic vs. epistemic value. Why curiosity is built in. | G = -pragmatic - epistemic |
| 11 | Learning the World | Differentiable HMMs, gradient-based parameter recovery in JAX. | Forward algorithm, jax.grad |
| 12 | Deep Active Inference | Neural network generative models, encoder collapse. | Amortized inference |
The two frameworks converge. Under the right conditions, they produce identical behavior.
| Module | What you'll learn | Key insight | |
|---|---|---|---|
| 13 | The Rosetta Stone | Formal RL-AIF correspondences. Same environment, two agents, same policy. | Q(s,a) = -G(a) when A=I |
| 14 | JAX Scaling | jit, vmap, grad — run 1000 agents in parallel. |
Sub-linear batch scaling |
From hierarchical models to multi-agent commons to physical bodies.
| Module | What you'll learn | Application | |
|---|---|---|---|
| 15 | Hierarchical AIF | Context-dependent perception, temporal abstraction, cross-level info gain. | Context-dependent T-maze |
| 16 | Multi-Agent Commons | Ostrom's design principles, FEP meets social simulation, experiment ladder L0-L7. | Concordia social simulation |
| 17 | Embodied Active Inference | Continuous-state AIF, proprioceptive inference, PD vs RL vs AIF control, perturbation robustness. | PyBullet 2-link arm reaching |
Every module includes Dual Perspective boxes that show the same concept through both lenses:
|
RL Perspective The TDQ agent learns Q(s,a) — cached stimulus-response mappings. It never builds a model. This is Thorndike's Law of Effect as an algorithm. |
Active Inference Perspective The AIF agent builds transition matrices B(s'|s,a) and plans by simulating consequences. Like Tolman's rats taking shortcuts they've never walked. |
The central result of the curriculum:
| RL | Active Inference | When equivalent |
|---|---|---|
| V(s) | -G(pi) | Epistemic value = 0 |
| R(s) | ln C(o) | Log-preference = log-reward |
| Q(s,a) | -G(a) | Fully observable (A = I) |
| pi(a|s) = softmax(Q/tau) | pi(a) = softmax(-gamma*G) | tau = 1/gamma |
| T(s'|s,a) | B(s'|s,a) | Always identical |
| Epsilon-greedy | Epistemic value | AIF's is principled, RL's is bolted on |
# Clone
git clone https://github.com/m9h/spinning-up-alf.git
cd spinning-up-alf
# Create environment (requires uv: https://docs.astral.sh/uv/)
uv venv .venv --python 3.12
source .venv/bin/activate
# Core dependencies
uv pip install numpy matplotlib seaborn jupyter ipywidgets scikit-learn
# JAX with GPU support (use jax[cpu] if no CUDA)
uv pip install jax[cuda12]
# RL environments and agents
git clone https://github.com/awjuliani/neuro-nav.git ../neuro-nav
uv pip install -e ../neuro-nav
# Active Inference library
git clone https://github.com/m9h/alf.git ../alf
uv pip install -e ../alf
# Optional: PGMax (factor graphs)
git clone https://github.com/m9h/PGMax.git ../PGMax
uv pip install -e ../PGMax
# Optional: Concordia (Module 16 capstone only)
git clone https://github.com/m9h/concordia.git ../concordia
# Install Jupyter kernel
python -m ipykernel install --user --name spinning-up-alf
# Launch
jupyter notebookThis curriculum is designed to work alongside — not replace — other excellent courses:
| Resource | What it's best for | Take it... |
|---|---|---|
| INRIA scikit-learn MOOC | Classical ML pipelines, model selection, evaluation | Before or alongside — builds general ML fluency |
| HuggingFace Deep RL Course | Practical deep RL training (Atari, Unity, VizDoom) | After our Part 2 — you'll understand why the algorithms work |
| OpenAI Spinning Up | Reference implementations of PPO, SAC, TD3 | As a reference during our Modules 6-7 |
| Neuromatch Comp Neuro | Full computational neuroscience (3 weeks, live) | After our Modules 1-8 — deep RL/Bayesian intuition for their W3 |
| Neuromatch Deep Learning | PyTorch, CNNs, VAEs, diffusion, transformers, RL | Before or alongside — their W2D4 (VAEs) connects to our Module 8, their W3D4-D5 (RL) to our Parts 1-2 |
| Neuromatch NeuroAI | Comparing biological and artificial networks, representations | After our full curriculum — our Rosetta Stone (Module 13) is NeuroAI applied to decision-making |
| Lovelace | Computational cognitive science, probabilistic models of cognition | Alongside our Part 3 — same Bayesian brain ideas, different formalism |
Our Modules 9-12 (the core AIF track) cover territory that no general course touches — but several specialized resources overlap with parts of it. Here is how they compare and where to use them:
| Resource | What it covers | Module overlap | Use it for... |
|---|---|---|---|
| Active Inference Institute | Textbook groups, courses, livestreams, internship program. Physics as Information Processing (Chris Fields), Active Inference for Social Sciences. | 8-12 (broad) | Theoretical depth, community, ongoing discussion groups |
| pymdp | Python AIF library (JAX-first as of v1.0.0). Equinox Agent, rollout() API, MCTS, pybefit model fitting. Excellent tutorials: T-maze, epistemic chaining, cue chaining. | 9-11 | The community standard for AIF in Python. ALF builds cognitive extensions (HGF, DDM, metacognition) on top of pymdp's core. |
| CPC Zurich | Annual 5-day computational psychiatry course (ETH/UZH). Practical session on Active Inference using pymdp in Colab. | 9-10 | Clinical/psychiatric context for AIF. 3 ECTS credits. |
| Smith, Friston & Whyte tutorial | Step-by-step MATLAB tutorial: A/B/C/D matrices, T-maze, planning, fitting to empirical data. Paper. | 9-10 | The definitive reference — our Modules 9-10 follow this paper closely |
| Fundamentals of Active Inference (Namjoshi, MIT Press 2026) | Engineering-focused textbook. Worked examples, simulations, no heavy proofs. AII textbook groups running now. | 8-12 | Accessible textbook companion for the full AIF track |
| Active Inference Tutor | Interactive browser-based tutorials: Bayesian updating, belief evolution, grid world EFE agent. | 8-10 | Visual intuition before diving into code |
| SPM/DEM toolbox | Friston's MATLAB toolbox. Continuous-time AIF, generalized filtering, POMDP inversion. CPC 2018 tutorial. | 8-11 | The original implementation — mathematically dense |
| RxInfer.jl | Julia reactive message passing. Key examples: EFE minimization via message passing (gridworld AIF), Active Inference Mountain Car (continuous control), HMM training. | 8-11 | Factor graph formalism for AIF. The EFE example recasts EFE as VFE on augmented graphs — deep connection to our Module 10. |
| AII ActInf_RxInfer guide | Active Inference Institute's bridge between AIF theory and RxInfer.jl implementation. | 9-11 | Julia pathway for students who want factor-graph-native AIF |
| LAIF T-maze in RxInfer | POMDP AIF agent in Julia/RxInfer. Flattened T-maze with generalized free energy, discrete actions. | 9-10 | The Julia equivalent of our Module 9 — same T-maze, different framework |
| ActiveInference.jl | Julia re-implementation of pymdp. T-maze, parameter estimation. Paper. | 9-10 | Julia ecosystem alternative |
| AII Courses | 14 courses spanning K-PhD, 464+ modules. 4-unit core: Philosophy, Cognitive Science, Mathematics, Computer Science — each with 8 topics (Systems, Agents, Perception, Cognition, Action, Learning, Communication, Planning). Parr et al. textbook group. | 8-12 (broad) | The most comprehensive AIF curriculum. Complements our hands-on approach with philosophical and cognitive science grounding |
What's unique about our Modules 9-12:
- RL↔AIF correspondence throughout — pymdp (now also JAX-first as of v1.0.0) and other resources teach AIF standalone. We are the only resource that systematically shows the RL correspondence via Dual Perspective boxes in every section, grounded in the same neuroscience-native environments
- Module 9-10 have good alternatives (pymdp, CPC Zurich, Smith et al., RxInfer EFE example) but none present the RL↔AIF bridge as a central organizing principle
- Module 11 (Learning) — ALF teaches differentiable learning via a lightweight
jax.gradthrough the forward algorithm. pymdp v1.0.0 now also supports JAX-native learning via pybefit + NumPyro. Our pedagogical approach (explicit forward-filter NLL, no external deps) remains clearer for teaching - Module 12 (Deep AIF) has essentially no educational materials anywhere — only research papers (Fountas et al. NeurIPS 2020, Millidge 2019). Our notebook is the first pedagogical treatment of neural network generative models for AIF with explicit encoder collapse analysis
- Cognitive extensions — ALF provides HGF, DDM, metacognition, hierarchical AIF, and normative modeling modules that no other AIF library offers. These are used in Modules 15-16 and the appendices
- The AII Courses (14 courses, 464+ modules) are the broadest AIF curriculum but focus on conceptual/philosophical grounding; we focus on runnable code + RL bridge
Our curriculum was designed as preparation for Neuromatch Academy. Here is where the specific modules align:
| Our module | Neuromatch Comp Neuro | Neuromatch Deep Learning | Neuromatch NeuroAI |
|---|---|---|---|
| 01-03 (Animal behavior, Bellman, TD) | W3D4 RL (bandits, Q-learning, model-based) | W3D4-D5 (RL fundamentals, RL for games) | W1D2 (contrastive and RL for generalization) |
| 04 (Exploration) | W3D1 (Bayesian decisions) | -- | -- |
| 05 (Model-based RL) | W3D3 (Optimal control) | -- | W2D1 (macrocircuits, modularity) |
| 06-07 (Policy gradients, Deep RL) | -- | W3D4-D5 (RL, MCTS) | -- |
| 08 (Generative models, FEP) | W3D1-D2 (Bayesian inference, HMMs, Kalman) | W2D4 (VAEs, diffusion models) | -- |
| 09-10 (AIF agent, EFE) | -- | -- | -- |
| 11 (Learning generative models) | W3D2 (EM algorithm) | W2D4 (VAEs) | W2D3 (microlearning) |
| 13 (Rosetta Stone) | -- | -- | W1D3 (comparing artificial and biological networks) |
| 14 (JAX scaling) | -- | -- | -- |
| 15-16 (Hierarchical, multi-agent) | -- | -- | W2D1-D2 (macrocircuits, neurosymbolic) |
The cells marked bold are the strongest correspondences. Our Modules 9-12 (the core AIF track) have no Neuromatch equivalent — that gap is filled by the Active Inference Institute, pymdp, and CPC Zurich (see AIF resources table above).
Module 16 connects to a broader ecosystem of multi-agent tools. Here is the landscape:
| Tool | What it does | Relation to our curriculum |
|---|---|---|
| Gymnasium (Farama Foundation) | Successor to OpenAI Gym. The standard single-agent RL API. | Our Modules 1-7 use neuro-nav which wraps Gym. Future: migrate to Gymnasium. |
| PettingZoo | Gymnasium for multi-agent RL. Standard MARL API. | Natural API for Module 16's multi-agent experiments. |
| JaxMARL | GPU-accelerated MARL environments and algorithms in JAX. 12,500x speedup via vmap. | Direct complement to our Module 14 (JAX scaling). Run 1000s of multi-agent episodes in parallel. |
| SocialJax | JAX suite for sequential social dilemmas: Commons Harvest, Clean Up, Territory. | The JAX-native version of our Module 16 scenarios. 50x faster than Melting Pot. ICLR 2026. |
| Melting Pot (DeepMind) | 50+ multi-agent substrates testing cooperation, competition, commons tragedies. | The benchmark our Module 16 (Ostrom's commons) prepares students to understand. |
| Concordia (DeepMind) | LLM-powered generative agent social simulation. | Our Module 16 builds on this directly — AIF agents in Concordia's SustainHub game. |
The progression: Gymnasium (single agent) -> PettingZoo (multi-agent API) -> JaxMARL/SocialJax (GPU-accelerated MARL with social dilemmas) -> Melting Pot (rich evaluation) -> Concordia (LLM + AIF social simulation). Our curriculum teaches the theory (Modules 1-15) that makes these tools meaningful, and Module 16 lands in the Concordia layer where AIF meets LLM-powered social agents.
Module 16's multi-agent commons simulation connects to a deep tradition in computational economics. The AI Economist (Salesforce/Harvard, Science Advances 2022) pioneered two-level (bi-level) deep MARL for policy design: inner-level agents optimize their own utility while an outer-level social planner learns optimal tax policy. This Stackelberg structure — where the planner accounts for agents adapting to policy — is exactly the tension Ostrom studied in real commons, and what our Module 16 explores with AIF agents in Concordia's SustainHub and Collective Innovation games.
Agent-based computational economics (ACE) has modeled these dynamics since the 1990s. What's new is coupling ABMs with deep RL and LLMs:
| Framework | What it does | Relevance to Module 16 |
|---|---|---|
| AI Economist | Two-level deep RL for tax policy design (Gym-style) | The bi-level RL template: agents + planner, both learning |
| BeforeIT.jl | Bank of Italy ABM for macro forecasting. First ABM matching DSGE accuracy. Parametrized with real data (Austria, Italy). | Real-data-grounded ABM — shows institutional adoption |
| Mesa | Python ABM framework (500+ papers, Mesa 4 in development) | General-purpose ABM; Concordia's social layer is richer |
| ABCE | ABM for economics (trade, production, consumption) | Classical ACE toolkit |
| EconAgent | LLM-powered agents for macroeconomic simulation (ACL 2024) | Same GABM paradigm as Concordia, applied to macro |
| ABIDES-Economist | RL agents in economic simulation via Gym interface | Bridges RL and econ simulation |
Central banks and governments also publish open-source models, though these are typically DSGE (equation-based) rather than agent-based:
| Model | Institution | Notes |
|---|---|---|
| DSGE.jl | New York Fed | Julia. Now extending to heterogeneous-agent (HANK) models |
| FRB/US | Federal Reserve Board | The Fed's main policy model |
| OpenSourcedMacroModels | Collection | 20+ central bank models (Bank of Finland, French MOF, etc.) |
| ABCredit.jl | Bank of Italy | Credit/macro ABM |
Why this matters for our curriculum: The GABM wave — LLM agents replacing hand-coded behavioral rules — is exactly what Concordia enables. Our Module 16 shows how Active Inference provides principled decision-making (EFE decomposition into productivity vs. community learning) where pure RL agents would need ad-hoc reward shaping. The experiment ladder (L0 pure RL -> L7 LLM+AIF) demonstrates this progression. The Bank of Italy's BeforeIT.jl shows that ABMs are being taken seriously by institutions — the question is whether AIF-grounded agents can do better than rule-based ones in these policy-relevant settings.
Module 17 introduces embodied AIF with PyBullet. Here is how it connects to the broader robotics learning ecosystem:
HuggingFace Robotics Course — Free, self-paced course built on LeRobot (PyTorch). Covers classical robotics (FK/IK, Jacobians, feedback control), RL, imitation learning, and foundation models for robotics. Where our Module 17 teaches embodied AIF on a 2-link arm, the HF course teaches the same classical robotics foundations (their Unit 2) plus modern learning-based approaches. The two are complementary: our Module 17 shows why an AIF controller is robust to perturbation; their course shows how to scale to real hardware with LeRobot datasets and foundation models.
Hugging Science — Open community for AI-enabled scientific discovery across physics, biology, chemistry, and neuroscience. Active Discord with paper discussions and collaborative projects. A natural home for curriculum discussions and community contributions.
The frontier is JAX-native physics — keeping simulation, inference, and learning on the GPU:
MuJoCo XLA (MJX) — Google's JAX port of MuJoCo. Physics and neural networks in the same framework = no CPU-GPU transfer. vmap over hundreds of bodies simultaneously, millions of sim steps per second. The natural extension of our Module 14 (JAX scaling) to embodied agents.
Virtual Rodent / MIMIC-MJX (DeepMind) — A biomechanically accurate rat trained via RL in MJX. Locomotion, turning, obstacle avoidance — all on GPU. This is where Module 17's 2-link arm leads: biologically plausible morphology + differentiable physics + AIF control.
Ludobots (Josh Bongard, UVM) — The world's only MOOC taught from reddit. Students evolve neural-network-controlled robots with arbitrary body plans in pyrosim (ODE) and pybullet. The evolutionary robotics counterpart to our AIF approach: where we optimize controllers, Bongard co-optimizes bodies and brains. His work on xenobots (living robots) is where this leads.
HBP Neurorobotics Platform (EBRAINS) — Connects spiking neural networks (NEST) to robot physics (Gazebo) via a Closed-Loop Engine. The EPFL Neurorobotics MOOC teaches SARSA on embodied robots. Where our curriculum uses discrete POMDPs and JAX inference, the NRP uses continuous spiking networks — but the question is the same.
RatInABox — Lightweight spatial navigation simulator with community JAX wrappers. Bridges our neuro-nav grid worlds (Modules 1-7) to embodied hippocampal place/grid cell models.
- Sutton & Barto (2018). Reinforcement Learning: An Introduction — The RL bible
- Parr, Pezzulo & Friston (2022). Active Inference — The AIF textbook
- Smith, Friston & Whyte (2022). A step-by-step tutorial on active inference — Clearest AIF tutorial
- Millidge, Tschantz & Buckley (2020). Whence the expected free energy? — The RL-AIF equivalence proof
- Sajid et al. (2021). Active inference: Demystified and compared — AIF vs. everything else
- Schultz, Dayan & Montague (1997). A neural substrate of prediction and reward — Dopamine = TD error
- Tolman (1948). Cognitive maps in rats and men — The original model-based argument
- Ostrom (1990). Governing the Commons — Design principles for Module 16
This curriculum builds on:
- ALF — Active inference/Learning Framework: JAX-native, differentiable, GPU-accelerated
- neuro-nav — Neuroscience-inspired RL environments and agents
- RatInABox — Continuous-space locomotion and hippocampal neural activity (eLife 2024). Demos referenced in Modules 3 (successor features), 6 (actor-critic), 8 (generative model inversion), 17 (egocentric perception, path integration)
- Concordia — Multi-agent social simulation
Pedagogical improvements planned, learning from the scikit-learn MOOC design:
- Colab badges — One-click execution for every notebook
- Glossary — RL-AIF terminology Rosetta Stone (the jargon barrier is real)
- Wrap-up quizzes — MCQ comprehension checks at module boundaries
- Separate exercise notebooks — So students can't peek at solutions
- Module 0 — Pre-requisite refresher (probability, linear algebra, JAX basics)
Apache 2.0