Chapter 50: Advanced Hyperparameter Tuning

Learning objectives

Use Weights & Biases (or similar) to run a hyperparameter sweep for SAC on your custom environment (or a standard one).
Sweep over learning rate, entropy coefficient (or auto-\(\alpha\) target), and network size (hidden dims).
Visualize the effect on final return and learning speed (e.g. steps to reach a threshold).

Concept and real-world RL

Hyperparameter tuning is essential for getting the best from RL algorithms; sweeps (grid or random search over learning rate, network size, etc.) are standard in research and industry. Weights & Biases (wandb) logs metrics and supports sweep configs; similar tools include MLflow, Optuna, and Ray Tune. In robot control and game AI, tuning learning rate and entropy (or clip range for PPO) often has the largest impact. Automating sweeps saves time and makes results reproducible.

Where you see this in practice: Papers and codebases report sweep ranges; W&B and Optuna are common in RL projects.

Illustration (hyperparameter sweep): Different learning rates yield different final returns. The chart below shows mean final return (over 3 seeds) for 4 learning rate values.

Exercise: Use Weights & Biases to sweep over learning rates, entropy coefficient, and network sizes for SAC on your custom environment. Visualize the effect on final return and learning speed.

Professor’s hints

Define a sweep config (YAML or dict): e.g. method “grid” or “random”, metric “eval/mean_return”, parameters lr (log uniform 1e-4 to 1e-2), network_size (values [64, 128, 256]), etc. Run multiple agents (one per config).
Log to wandb: wandb.init(project="sac-sweep", config=config); log wandb.log({"return": mean_return}, step=step). Use wandb sweep to launch runs.
Visualize: parallel coordinates plot (each run = line, color = return); or scatter (lr vs final return). Identify which lr and network size work best.

Common pitfalls

Too few runs per config: Run at least 2–3 seeds per config so you see variance; otherwise one lucky seed can mislead.
Sweep too large: Start with 2–3 key hyperparameters (lr, entropy/alpha, hidden size); add more only if needed.

Worked solution (warm-up: hyperparameter tuning)

Key idea: For tuning, fix the rest and vary one (or two) key hyperparameters: e.g. learning rate, \(\alpha\) or entropy coefficient, clip range for PPO, or network size. Run a few seeds per setting and compare mean final return. Use a small grid first (e.g. 3 values); expand only if the best is at the boundary. Document the best setting and the metric (e.g. mean return over last 100 episodes).

Extra practice

Warm-up: Why is it important to run multiple seeds when comparing hyperparameters?
Coding: Run a small grid: lr in [1e-4, 3e-4], hidden size in [64, 256]. For each of the 4 configs, run 2 seeds for 100k steps. Report mean and std of final return. Which config wins?
Challenge: Use Bayesian optimization (e.g. Optuna or wandb sweep with bayes) to suggest the next hyperparameters given past results. Compare with random search after 20 runs.
Variant: Compare a random search over the same parameter space (lr, hidden size, entropy α) against grid search with the same total budget of runs. Does random search find a better configuration with the same number of evaluations?
Debug: The code below uses the same random seed for all trials, making every trial identical and giving no meaningful variance estimate. Fix it.

Try it — edit and run (Shift+Enter)

Conceptual: What is the key advantage of Bayesian optimization over grid or random search? Why is it especially useful when each hyperparameter evaluation is expensive?
Recall: Name two tools for hyperparameter sweeps in RL projects (e.g. Weights & Biases, Optuna, Ray Tune) and one metric you would track to compare runs.