Chapter 47: Soft Actor-Critic (SAC)

Learning objectives Implement SAC (Soft Actor-Critic) for HalfCheetah: two Q-networks (min for target), policy that maximizes \(Q - \alpha \log \pi\), and automatic temperature tuning so \(\alpha\) targets a desired entropy. Train and compare sample efficiency with PPO (same env, same or similar compute). Concept and real-world RL SAC combines maximum entropy RL with actor-critic: the critic learns two Q-functions (take min for target to reduce overestimation); the actor maximizes \(\mathbb{E}[ Q(s,a) - \alpha \log \pi(a|s) ]\); and \(\alpha\) is updated to keep the policy entropy near a target (e.g. -\dim(a)). SAC is off-policy (replay buffer), so it is often more sample-efficient than PPO on continuous control. In robot control (HalfCheetah, Hopper, Walker), SAC is a standard baseline; in recommendation and trading, off-policy max-ent methods can improve exploration and stability. ...

March 10, 2026 · 3 min · 519 words · codefrydev

Chapter 48: SAC vs. PPO

Learning objectives Run SAC and PPO on the same continuous control tasks (e.g. Hopper, Walker2d). Compare final performance, sample efficiency (return vs env steps), and wall-clock time. Discuss when to choose one over the other (sample efficiency, stability, tuning, off-policy vs on-policy). Concept and real-world RL SAC is off-policy (replay buffer) and maximizes entropy; PPO is on-policy (rollouts) and uses a clipped objective. SAC often achieves higher sample efficiency (fewer env steps to reach good performance) but can be sensitive to hyperparameters and replay buffer size; PPO is more robust and easier to tune in many settings. In robot control benchmarks (Hopper, Walker2d, HalfCheetah), both are standard; in game AI and RLHF, PPO is more common. Choice depends on data cost (can we afford many env steps?), need for off-policy (e.g. using logged data), and engineering preference. ...

March 10, 2026 · 3 min · 481 words · codefrydev

Chapter 49: Custom Gym Environments (Part 2)

Learning objectives Create a custom Gym environment: a 2D point mass that must navigate to a goal while avoiding an obstacle. Define continuous action (e.g. force in x and y) and a reward function (e.g. distance to goal, penalty for obstacle or boundary). Test the environment with a SAC (or PPO) agent and verify that the agent can learn to reach the goal. Concept and real-world RL Custom environments let you model robot navigation, recommendation (state = user, action = item), or trading (state = market, action = trade). A 2D point mass is a minimal continuous control task: state = (x, y, vx, vy), action = (fx, fy), reward = -distance to goal + penalties. In robot control, similar point-mass or particle models are used for planning and RL; in game AI, custom envs are used for prototyping. Implementing the Gym interface (reset, step, observation_space, action_space) and testing with a known algorithm (SAC) validates the design. ...

March 10, 2026 · 3 min · 525 words · codefrydev

Chapter 50: Advanced Hyperparameter Tuning

Learning objectives Use Weights & Biases (or similar) to run a hyperparameter sweep for SAC on your custom environment (or a standard one). Sweep over learning rate, entropy coefficient (or auto-\(\alpha\) target), and network size (hidden dims). Visualize the effect on final return and learning speed (e.g. steps to reach a threshold). Concept and real-world RL Hyperparameter tuning is essential for getting the best from RL algorithms; sweeps (grid or random search over learning rate, network size, etc.) are standard in research and industry. Weights & Biases (wandb) logs metrics and supports sweep configs; similar tools include MLflow, Optuna, and Ray Tune. In robot control and game AI, tuning learning rate and entropy (or clip range for PPO) often has the largest impact. Automating sweeps saves time and makes results reproducible. ...

March 10, 2026 · 3 min · 473 words · codefrydev

Chapter 51: Model-Free vs. Model-Based RL

Learning objectives Compare model-free (e.g. PPO) and model-based (e.g. Dreamer) RL in terms of sample efficiency on a continuous control task like Walker. Explain why model-based methods can achieve more reward per real environment step (use of imagined rollouts). Identify trade-offs: model bias, computation, and implementation complexity. Concept and real-world RL Model-free methods learn a policy or value function directly from experience; model-based methods learn a dynamics model and use it for planning or imagined rollouts. Model-based RL can be more sample-efficient because each real transition can be reused many times in the model (short rollouts, planning). In robot navigation and trading, where real data is expensive, sample efficiency matters; in game AI, model-based methods (e.g. MuZero) combine learning and planning. The downside is model error (compounding over long rollouts) and extra computation. ...

March 10, 2026 · 3 min · 446 words · codefrydev

Chapter 52: Learning World Models

Learning objectives Collect random trajectories from CartPole and train a neural network to predict the next state given (state, action). Evaluate prediction accuracy over 1 step, 5 steps, and 10 steps; observe compounding error as the horizon grows. Relate model error to the limitations of long-horizon model-based rollouts. Concept and real-world RL A world model (or dynamics model) predicts \(s_{t+1}\) from \(s_t, a_t\). We can train it on collected data (e.g. MSE loss). Errors compound over multi-step rollouts: a small 1-step error becomes large after many steps. In robot navigation, learned models are used for short-horizon planning; in game AI (e.g. Dreamer), models are used in latent space to reduce dimensionality and control rollouts. Understanding compounding error is key to designing model-based algorithms. ...

March 10, 2026 · 3 min · 442 words · codefrydev

Chapter 53: Planning with Known Models

Learning objectives Implement a planner using breadth-first search (BFS) for a gridworld with known deterministic dynamics. Recover the optimal policy (path to goal) and compare with dynamic programming (value iteration) in terms of computation and result. Relate BFS to shortest-path planning in robot navigation. Concept and real-world RL When the model is known and deterministic, we can plan without learning: BFS finds the shortest path from start to goal; value iteration computes optimal values for all states. In robot navigation (grid or graph), BFS is used for pathfinding; DP is used when we need values everywhere (e.g. for reward shaping). Both assume the model is correct; in RL we often learn the model or the value function from data. ...

March 10, 2026 · 3 min · 443 words · codefrydev

Chapter 54: Monte Carlo Tree Search (MCTS)

Learning objectives Implement MCTS for a small game (e.g. tic-tac-toe): selection (UCT), expansion, simulation (rollout), backpropagation. Use UCT (Upper Confidence bound for Trees) for node selection: \(\frac{Q(s,a)}{N(s,a)} + c \sqrt{\frac{\log N(s)}{N(s,a)}}\). Evaluate win rate against a random opponent. Concept and real-world RL MCTS builds a search tree by repeatedly selecting a leaf (UCT), expanding it, doing a random rollout to the end, and backpropagating the result. It does not require a learned value function (though it can use one, as in AlphaZero). In game AI (chess, Go, tic-tac-toe), MCTS is used for planning and action selection; it balances exploration (trying undervisited moves) and exploitation (favoring good moves). ...

March 10, 2026 · 3 min · 444 words · codefrydev

Chapter 55: AlphaZero Architecture

Learning objectives Implement a simplified AlphaZero for tic-tac-toe: a neural network that outputs policy (move probabilities) and value (expected outcome). Use the network inside MCTS: use policy for prior in expansion, value for leaf evaluation (replacing random rollout). Train via self-play: generate games, train the network on (state, policy target, value target), repeat. Concept and real-world RL AlphaZero combines MCTS with a neural network: the network provides a prior over moves and a value for leaf states, so MCTS does not need random rollouts. Training is self-play: the current network plays against itself; the MCTS policy and game outcome become targets for the network. In game AI (chess, Go, shogi), AlphaZero achieves superhuman play. The same idea (planning with a learned model/value) appears in robot planning and dialogue. ...

March 10, 2026 · 3 min · 460 words · codefrydev

Chapter 56: MuZero Intuition

Learning objectives Read a MuZero paper summary and explain how MuZero learns a model in latent space without access to the true environment dynamics. Explain how MuZero handles reward prediction and value prediction in the latent space. Contrast with AlphaZero (which uses the true game rules). Concept and real-world RL MuZero learns a latent dynamics model: instead of predicting raw next state, it predicts the next latent state and (optionally) reward and value. So the “model” is learned end-to-end for the purpose of planning; it does not need to match the true state. This allows MuZero to work in video games and domains where rules are unknown. In game AI, MuZero achieves strong results on Atari and board games without hand-coded dynamics. ...

March 10, 2026 · 3 min · 468 words · codefrydev