Chapter 46: Maximum Entropy RL

Learning objectives Derive or state the maximum entropy objective: maximize \(\mathbb{E}[ \sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) ]\) (or equivalent), where \(\mathcal{H}\) is entropy. Explain how the entropy term encourages exploration: higher entropy means more uniform action distribution, so the policy tries more actions. Contrast with standard expected return maximization (no entropy bonus). Concept and real-world RL Maximum entropy RL adds an entropy bonus to the objective so the agent maximizes return and policy entropy. The optimal policy under this objective is more stochastic (explores more) and is often easier to learn (multiple modes, robustness). In robot control, SAC (Soft Actor-Critic) uses this idea with automatic temperature tuning; in game AI and recommendation, entropy regularization (e.g. in PPO) prevents the policy from becoming too deterministic too fast. The temperature \(\alpha\) (or equivalent) controls the trade-off between return and entropy. ...

March 10, 2026 · 3 min · 500 words · codefrydev

Chapter 47: Soft Actor-Critic (SAC)

Learning objectives Implement SAC (Soft Actor-Critic) for HalfCheetah: two Q-networks (min for target), policy that maximizes \(Q - \alpha \log \pi\), and automatic temperature tuning so \(\alpha\) targets a desired entropy. Train and compare sample efficiency with PPO (same env, same or similar compute). Concept and real-world RL SAC combines maximum entropy RL with actor-critic: the critic learns two Q-functions (take min for target to reduce overestimation); the actor maximizes \(\mathbb{E}[ Q(s,a) - \alpha \log \pi(a|s) ]\); and \(\alpha\) is updated to keep the policy entropy near a target (e.g. -\dim(a)). SAC is off-policy (replay buffer), so it is often more sample-efficient than PPO on continuous control. In robot control (HalfCheetah, Hopper, Walker), SAC is a standard baseline; in recommendation and trading, off-policy max-ent methods can improve exploration and stability. ...

March 10, 2026 · 3 min · 519 words · codefrydev

Chapter 48: SAC vs. PPO

Learning objectives Run SAC and PPO on the same continuous control tasks (e.g. Hopper, Walker2d). Compare final performance, sample efficiency (return vs env steps), and wall-clock time. Discuss when to choose one over the other (sample efficiency, stability, tuning, off-policy vs on-policy). Concept and real-world RL SAC is off-policy (replay buffer) and maximizes entropy; PPO is on-policy (rollouts) and uses a clipped objective. SAC often achieves higher sample efficiency (fewer env steps to reach good performance) but can be sensitive to hyperparameters and replay buffer size; PPO is more robust and easier to tune in many settings. In robot control benchmarks (Hopper, Walker2d, HalfCheetah), both are standard; in game AI and RLHF, PPO is more common. Choice depends on data cost (can we afford many env steps?), need for off-policy (e.g. using logged data), and engineering preference. ...

March 10, 2026 · 3 min · 481 words · codefrydev

Chapter 49: Custom Gym Environments (Part 2)

Learning objectives Create a custom Gym environment: a 2D point mass that must navigate to a goal while avoiding an obstacle. Define continuous action (e.g. force in x and y) and a reward function (e.g. distance to goal, penalty for obstacle or boundary). Test the environment with a SAC (or PPO) agent and verify that the agent can learn to reach the goal. Concept and real-world RL Custom environments let you model robot navigation, recommendation (state = user, action = item), or trading (state = market, action = trade). A 2D point mass is a minimal continuous control task: state = (x, y, vx, vy), action = (fx, fy), reward = -distance to goal + penalties. In robot control, similar point-mass or particle models are used for planning and RL; in game AI, custom envs are used for prototyping. Implementing the Gym interface (reset, step, observation_space, action_space) and testing with a known algorithm (SAC) validates the design. ...

March 10, 2026 · 3 min · 525 words · codefrydev

Chapter 50: Advanced Hyperparameter Tuning

Learning objectives Use Weights & Biases (or similar) to run a hyperparameter sweep for SAC on your custom environment (or a standard one). Sweep over learning rate, entropy coefficient (or auto-\(\alpha\) target), and network size (hidden dims). Visualize the effect on final return and learning speed (e.g. steps to reach a threshold). Concept and real-world RL Hyperparameter tuning is essential for getting the best from RL algorithms; sweeps (grid or random search over learning rate, network size, etc.) are standard in research and industry. Weights & Biases (wandb) logs metrics and supports sweep configs; similar tools include MLflow, Optuna, and Ray Tune. In robot control and game AI, tuning learning rate and entropy (or clip range for PPO) often has the largest impact. Automating sweeps saves time and makes results reproducible. ...

March 10, 2026 · 3 min · 473 words · codefrydev

Chapter 58: Model-Based Policy Optimization (MBPO)

Learning objectives Implement MBPO: learn an ensemble of dynamics models, generate short rollouts from real states, add imagined transitions to the replay buffer, and train SAC on the combined buffer. Compare sample efficiency with SAC alone (same number of real env steps). Explain why short rollouts (e.g. 1–5 steps) help avoid compounding error. Concept and real-world RL MBPO (Model-Based Policy Optimization) uses learned dynamics to augment the replay buffer: from a real state, rollout the model for a few steps and add (s, a, r, s’) to the buffer. SAC (or another off-policy method) then trains on real + imagined data. Short rollouts keep model error manageable. In robot control and trading, MBPO can significantly reduce the number of real steps needed to reach good performance. ...

March 10, 2026 · 3 min · 475 words · codefrydev

Chapter 71: The Offline RL Problem

Learning objectives Collect a dataset of transitions (state, action, reward, next_state, done) from a random policy (or fixed behavior policy) in the Hopper environment. Train a standard SAC agent offline (no environment interaction) on this dataset and observe the overestimation of Q-values for out-of-distribution (OOD) actions. Explain why naive off-policy methods fail in offline RL: the policy is trained to maximize Q, but Q is only trained on in-distribution actions; for OOD actions Q can be overestimated. Identify the distributional shift between the behavior policy (that collected the data) and the learned policy. Relate the offline RL problem to recommendation and healthcare where data comes from logs or historical trials. Concept and real-world RL ...

March 10, 2026 · 4 min · 723 words · codefrydev

Chapter 79: Offline-to-Online Finetuning

Learning objectives Pretrain an SAC (or similar) agent offline on a fixed dataset (e.g. from a mix of policies or from Chapter 71). Finetune the agent online by continuing training with environment interaction. Compare the learning curve (return vs steps) of finetuning from offline pretraining vs training from scratch. Implement a Q-filter: when updating the policy, avoid or downweight updates that use actions for which Q is below a threshold (to avoid reinforcing “bad” actions that could destabilize the policy). Relate offline-to-online to recommendation (pretrain on logs, then A/B test) and healthcare (pretrain on historical data, then cautious online updates). Concept and real-world RL ...

March 10, 2026 · 4 min · 756 words · codefrydev