Deep Reinforcement Learning in Continuous Action Spaces: Lunar Lander

PYTORCH

GYMNASIUM

PYGAME

This study evaluates continuous deep reinforcement learning (RL) algorithms for control in a physics-based simulation using the Lunar Lander v3 environment. Two actor-critic methods including Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3) are benchmarked to examine the effects of algorithmic design choices, exploration strategies, and stability mechanisms on policy learning in continuous action spaces.

Environment and Problem Setup

Lunar Lander v3 (in continuous mode) is a physics-based simulation where the agent controls a lunar module tasked with safely landing on a designated pad using continuous thrust vectors. The environment’s state space is an 8-dimensional tuple that captures position, velocity, angle, and leg contact flags. The action space is a 2D continuous vector representing the throttle for the main engine and side boosters.

The agent receives the following rewards: +100 for successful landing, −100 for crashing, +10 for each leg in contact, and −0.3 per main engine use. The episode terminates upon safe landing, crashing, or exceeding 1000 simulation steps. A reward threshold of 200 averaged over 100 episodes defines task success.

Algorithm Overview

DDPG is a deterministic actor-critic algorithm that extends Q-learning to continuous action spaces. It maintains separate actor and critic networks, both of which use target networks and a replay buffer for stability. Exploration is handled by injecting noise into action outputs.

TD3 improves on DDPG by addressing its instability and overestimation bias. The main enhancements include: clipped double Q-learning (uses the minimum of two critic network estimates to compute targets), delayed actor updates, and target policy smoothing via Gaussian noise added to target actions.

Both algorithms were implemented in PyTorch, with three-layer fully connected neural networks (64 units in hidden layer) for both actor and critic. Layer normalization and tanh activation were used for stable training and action bounding. Adam optimizers were employed with default parameters.

The agents were trained using experience replay and soft target network updates. Exploration noise was modeled as Gaussian noise added to the actor output during training. Evaluations were conducted for 100-episode test rollouts without the exploration noise. Evaluation performance was measured by averaging cumulative rewards across 100 noise-free rollouts. Actor and critic losses were logged for analysis, along with simple moving averages (10-step) of training rewards.

Summary of Findings

DDPG solved the task but showed high reward variance and strong sensitivity to hyperparameters. It often overshot the success threshold before regressing due to unstable updates. The best performance came with a decaying exploration noise schedule, though convergence typically required 1800–2000 episodes.

TD3 consistently achieved higher rewards with faster convergence, reaching stability in 1000–1200 episodes. It maintained lower loss variance and more stable evaluation performance. While the dual critics slightly increased per-episode runtime, the overall training time was reduced thanks to earlier convergence.

TD3 Parameter Sensitivity

The study also included experimentation with the TD3 algorithm parameters for the actor update delay, exploration noise and policy smoothing. TD3 performance was most stable when the actor was updated every two critic steps, as more frequent updates increased variance and less frequent updates slowed learning. A decaying exploration noise schedule with a large initial noise value balanced early exploration with later stability, outperforming fixed noise values. Moderate policy smoothing noise provided the best regularization for Q-value estimates, while higher smoothing levels introduced bias and reduced rewards.

Key Takeaways

This benchmark highlights the importance of stability mechanisms in continuous actor-critic RL. While DDPG can achieve strong results with careful tuning, TD3’s enhancements such as twin critics, delayed updates, and policy smoothing consistently delivered higher performance with fewer episodes.

These results position TD3 as a robust baseline for continuous control environments, emphasizing the value of targeted noise strategies and controlled update frequencies in stabilizing deep RL training.

Published August 2025

CREATED WITH LOVE IN SAN FRANCISCO

Go back home