DDPG强化学习的PyTorch代码实现和逐步讲解
创始人
2025-06-01 08:42:41
0

深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)是受Deep Q-Network启发的无模型、非策略深度强化算法,是基于使用策略梯度的Actor-Critic,本文将使用pytorch对其进行完整的实现和讲解

DDPG的关键组成部分是

  • Replay Buffer
  • Actor-Critic neural network
  • Exploration Noise
  • Target network
  • Soft Target Updates for Target Network

下面我们一个一个来逐步实现:

Replay Buffer

DDPG使用Replay Buffer存储通过探索环境采样的过程和奖励(Sₜ,aₜ,Rₜ,Sₜ+₁)。Replay Buffer在帮助代理加速学习以及DDPG的稳定性方面起着至关重要的作用:

  • 最小化样本之间的相关性:将过去的经验存储在 Replay Buffer 中,从而允许代理从各种经验中学习。
  • 启用离线策略学习:允许代理从重播缓冲区采样转换,而不是从当前策略采样转换。
  • 高效采样:将过去的经验存储在缓冲区中,允许代理多次从不同的经验中学习。
 classReplay_buffer():'''Code based on:https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.pyExpects tuples of (state, next_state, action, reward, done)'''def__init__(self, max_size=capacity):"""Create Replay buffer.Parameters----------size: intMax number of transitions to store in the buffer. When the bufferoverflows the old memories are dropped."""self.storage= []self.max_size=max_sizeself.ptr=0defpush(self, data):iflen(self.storage) ==self.max_size:self.storage[int(self.ptr)] =dataself.ptr= (self.ptr+1) %self.max_sizeelse:self.storage.append(data)defsample(self, batch_size):"""Sample a batch of experiences.Parameters----------batch_size: intHow many transitions to sample.Returns-------state: np.arraybatch of state or observationsaction: np.arraybatch of actions executed given a statereward: np.arrayrewards received as results of executing actionnext_state: np.arraynext state next state or observations seen after executing actiondone: np.arraydone[i] = 1 if executing ation[i] resulted inthe end of an episode and 0 otherwise."""ind=np.random.randint(0, len(self.storage), size=batch_size)state, next_state, action, reward, done= [], [], [], [], []foriinind:st, n_st, act, rew, dn=self.storage[i]state.append(np.array(st, copy=False))next_state.append(np.array(n_st, copy=False))action.append(np.array(act, copy=False))reward.append(np.array(rew, copy=False))done.append(np.array(dn, copy=False))returnnp.array(state), np.array(next_state), np.array(action), np.array(reward).reshape(-1, 1), np.array(done).reshape(-1, 1)

Actor-Critic Neural Network

这是Actor-Critic 强化学习算法的 PyTorch 实现。该代码定义了两个神经网络模型,一个 Actor 和一个 Critic。

Actor 模型的输入:环境状态;Actor 模型的输出:具有连续值的动作。

Critic 模型的输入:环境状态和动作;Critic 模型的输出:Q 值,即当前状态-动作对的预期总奖励。

 classActor(nn.Module):"""The Actor model takes in a state observation as input and outputs an action, which is a continuous value.It consists of four fully connected linear layers with ReLU activation functions and a final output layer selects one single optimized action for the state"""def__init__(self, n_states, action_dim, hidden1):super(Actor, self).__init__()self.net=nn.Sequential(nn.Linear(n_states, hidden1), nn.ReLU(), nn.Linear(hidden1, hidden1), nn.ReLU(), nn.Linear(hidden1, hidden1), nn.ReLU(), nn.Linear(hidden1, 1))defforward(self, state):returnself.net(state)classCritic(nn.Module):"""The Critic model takes in both a state observation and an action as input and outputs a Q-value, which estimates the expected total reward for the current state-action pair. It consists of four linear layers with ReLU activation functions, State and action inputs are concatenated before being fed into the first linear layer. The output layer has a single output, representing the Q-value"""def__init__(self, n_states, action_dim, hidden2):super(Critic, self).__init__()self.net=nn.Sequential(nn.Linear(n_states+action_dim, hidden2), nn.ReLU(), nn.Linear(hidden2, hidden2), nn.ReLU(), nn.Linear(hidden2, hidden2), nn.ReLU(), nn.Linear(hidden2, action_dim))defforward(self, state, action):returnself.net(torch.cat((state, action), 1))

Exploration Noise

向 Actor 选择的动作添加噪声是 DDPG 中用来鼓励探索和改进学习过程的一种技术。

可以使用高斯噪声或 Ornstein-Uhlenbeck 噪声。 高斯噪声简单且易于实现,Ornstein-Uhlenbeck 噪声会生成时间相关的噪声,可以帮助代理更有效地探索动作空间。但是与高斯噪声方法相比,Ornstein-Uhlenbeck 噪声波动更平滑且随机性更低。

 importnumpyasnpimportrandomimportcopyclassOU_Noise(object):"""Ornstein-Uhlenbeck process.code from :https://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlabThe OU_Noise class has four attributessize: the size of the noise vector to be generatedmu: the mean of the noise, set to 0 by defaulttheta: the rate of mean reversion, controlling how quickly the noise returns to the meansigma: the volatility of the noise, controlling the magnitude of fluctuations"""def__init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):self.mu=mu*np.ones(size)self.theta=thetaself.sigma=sigmaself.seed=random.seed(seed)self.reset()defreset(self):"""Reset the internal state (= noise) to mean (mu)."""self.state=copy.copy(self.mu)defsample(self):"""Update internal state and return it as a noise sample.This method uses the current state of the noise and generates the next sample"""dx=self.theta* (self.mu-self.state) +self.sigma*np.array([np.random.normal() for_inrange(len(self.state))])self.state+=dxreturnself.state

要在DDPG中使用高斯噪声,可以直接将高斯噪声添加到代理的动作选择过程中。

DDPG

DDPG (Deep Deterministic Policy Gradient)采用两组Actor-Critic神经网络进行函数逼近。在DDPG中,目标网络是Actor-Critic ,它目标网络具有与Actor-Critic网络相同的结构和参数化。

在训练期时,代理使用其 Actor-Critic 网络与环境交互,并将经验元组(Sₜ、Aₜ、Rₜ、Sₜ+₁)存储在Replay Buffer中。 然后代理从 Replay Buffer 中采样并使用数据更新 Actor-Critic 网络。 DDPG 算法不是通过直接从 Actor-Critic 网络复制来更新目标网络权重,而是通过称为软目标更新的过程缓慢更新目标网络权重。

软目标的更新是从Actor-Critic网络传输到目标网络的称为目标更新率(τ)的权重的一小部分。

软目标的更新公式如下:

通过使用软目标技术,可以大大提高学习的稳定性。

 #Set Hyperparameters# Hyperparameters adapted for performance fromcapacity=1000000batch_size=64update_iteration=200tau=0.001# tau for soft updatinggamma=0.99# discount factordirectory='./'hidden1=20# hidden layer for actorhidden2=64.#hiiden laye for criticclassDDPG(object):def__init__(self, state_dim, action_dim):"""Initializes the DDPG agent. Takes three arguments:state_dim which is the dimensionality of the state space, action_dim which is the dimensionality of the action space, and max_action which is the maximum value an action can take. Creates a replay buffer, an actor-critic  networks and their corresponding target networks. It also initializes the optimizer for both actor and critic networks alog with counters to track the number of training iterations."""self.replay_buffer=Replay_buffer()self.actor=Actor(state_dim, action_dim, hidden1).to(device)self.actor_target=Actor(state_dim, action_dim,  hidden1).to(device)self.actor_target.load_state_dict(self.actor.state_dict())self.actor_optimizer=optim.Adam(self.actor.parameters(), lr=3e-3)self.critic=Critic(state_dim, action_dim,  hidden2).to(device)self.critic_target=Critic(state_dim, action_dim,  hidden2).to(device)self.critic_target.load_state_dict(self.critic.state_dict())self.critic_optimizer=optim.Adam(self.critic.parameters(), lr=2e-2)# learning rateself.num_critic_update_iteration=0self.num_actor_update_iteration=0self.num_training=0defselect_action(self, state):"""takes the current state as input and returns an action to take in that state. It uses the actor network to map the state to an action."""state=torch.FloatTensor(state.reshape(1, -1)).to(device)returnself.actor(state).cpu().data.numpy().flatten()defupdate(self):"""updates the actor and critic networks using a batch of samples from the replay buffer. For each sample in the batch, it computes the target Q value using the target critic network and the target actor network. It then computes the current Q value using the critic network and the action taken by the actor network. It computes the critic loss as the mean squared error between the target Q value and the current Q value, and updates the critic network using gradient descent. It then computes the actor loss as the negative mean Q value using the critic network and the actor network, and updates the actor network using gradient ascent. Finally, it updates the target networks using soft updates, where a small fraction of the actor and critic network weights are transferred to their target counterparts. This process is repeated for a fixed number of iterations."""foritinrange(update_iteration):# For each Sample in replay buffer batchstate, next_state, action, reward, done=self.replay_buffer.sample(batch_size)state=torch.FloatTensor(state).to(device)action=torch.FloatTensor(action).to(device)next_state=torch.FloatTensor(next_state).to(device)done=torch.FloatTensor(1-done).to(device)reward=torch.FloatTensor(reward).to(device)# Compute the target Q valuetarget_Q=self.critic_target(next_state, self.actor_target(next_state))target_Q=reward+ (done*gamma*target_Q).detach()# Get current Q estimatecurrent_Q=self.critic(state, action)# Compute critic losscritic_loss=F.mse_loss(current_Q, target_Q)# Optimize the criticself.critic_optimizer.zero_grad()critic_loss.backward()self.critic_optimizer.step()# Compute actor loss as the negative mean Q value using the critic network and the actor networkactor_loss=-self.critic(state, self.actor(state)).mean()# Optimize the actorself.actor_optimizer.zero_grad()actor_loss.backward()self.actor_optimizer.step()"""Update the frozen target models using soft updates, where tau,a small fraction of the actor and critic network weights are transferred to their target counterparts. """forparam, target_paraminzip(self.critic.parameters(), self.critic_target.parameters()):target_param.data.copy_(tau*param.data+ (1-tau) *target_param.data)forparam, target_paraminzip(self.actor.parameters(), self.actor_target.parameters()):target_param.data.copy_(tau*param.data+ (1-tau) *target_param.data)self.num_actor_update_iteration+=1self.num_critic_update_iteration+=1defsave(self):"""Saves the state dictionaries of the actor and critic networks to files"""torch.save(self.actor.state_dict(), directory+'actor.pth')torch.save(self.critic.state_dict(), directory+'critic.pth')defload(self):"""Loads the state dictionaries of the actor and critic networks to files"""self.actor.load_state_dict(torch.load(directory+'actor.pth'))self.critic.load_state_dict(torch.load(directory+'critic.pth'))

训练DDPG

这里我们使用 OpenAI Gym 的“MountainCarContinuous-v0”来训练我们的DDPG RL 模型,这里的环境提供连续的行动和观察空间,目标是尽快让小车到达山顶。

下面定义算法的各种参数,例如最大训练次数、探索噪声和记录间隔等等。 使用固定的随机种子可以使得过程能够回溯。

 importgym# create the environmentenv_name='MountainCarContinuous-v0'env=gym.make(env_name)device='cuda'iftorch.cuda.is_available() else'cpu'# Define different parameters for training the agentmax_episode=100max_time_steps=5000ep_r=0total_step=0score_hist=[]# for rensering the environmnetrender=Truerender_interval=10# for reproducibilityenv.seed(0)torch.manual_seed(0)np.random.seed(0)#Environment action ans statesstate_dim=env.observation_space.shape[0]action_dim=env.action_space.shape[0]max_action=float(env.action_space.high[0])min_Val=torch.tensor(1e-7).float().to(device) # Exploration Noiseexploration_noise=0.1exploration_noise=0.1*max_action

创建DDPG代理类的实例,以训练代理达到指定的次数。在每轮结束时调用代理的update()方法来更新参数,并且在每十轮之后使用save()方法将代理的参数保存到一个文件中。

 # Create a DDPG instanceagent=DDPG(state_dim, action_dim)# Train the agent for max_episodesforiinrange(max_episode):total_reward=0step=0state=env.reset()for  tinrange(max_time_steps):action=agent.select_action(state)# Add Gaussian noise to actions for explorationaction= (action+np.random.normal(0, 1, size=action_dim)).clip(-max_action, max_action)#action += ou_noise.sample()next_state, reward, done, info=env.step(action)total_reward+=rewardifrenderandi>=render_interval : env.render()agent.replay_buffer.push((state, next_state, action, reward, np.float(done)))state=next_stateifdone:breakstep+=1score_hist.append(total_reward)total_step+=step+1print("Episode: \t{}  Total Reward: \t{:0.2f}".format( i, total_reward))agent.update()ifi%10==0:agent.save()env.close()

测试DDPG

 test_iteration=100foriinrange(test_iteration):state=env.reset()fortincount():action=agent.select_action(state)next_state, reward, done, info=env.step(np.float32(action))ep_r+=rewardprint(reward)env.render()ifdone: print("reward{}".format(reward))print("Episode \t{}, the episode reward is \t{:0.2f}".format(i, ep_r))ep_r=0env.render()breakstate=next_state

我们使用下面的参数让模型收敛:

  • 从标准正态分布中采样噪声,而不是随机采样。
  • 将polyak常数(tau)从0.99更改为0.001
  • 修改Critic 网络的隐藏层大小为[64,64]。在Critic 网络的第二层之后删除了ReLU激活。改成(Linear, ReLU, Linear, Linear)。
  • 最大缓冲区大小更改为1000000
  • 将batch_size的大小从128更改为64

训练了75轮之后的效果如下:

总结

DDPG算法是一种受deep Q-Network (DQN)算法启发的无模型off-policy Actor-Critic算法。它结合了策略梯度方法和Q-learning的优点来学习连续动作空间的确定性策略。

与DQN类似,它使用重播缓冲区存储过去的经验和目标网络,用于训练网络,从而提高了训练过程的稳定性。

DDPG算法需要仔细的超参数调优以获得最佳性能。超参数包括学习率、批大小、目标网络更新速率和探测噪声参数。超参数的微小变化会对算法的性能产生重大影响。

上面的参数来自:

https://avoid.overfit.cn/post/9951ac196ec84629968ce7168215e461

作者:Renu Khandelwal

相关内容

热门资讯

扫房神器2安卓系统,打造洁净家... 你有没有发现,家里的灰尘就像小精灵一样,总是悄悄地在你不注意的时候跳出来?别急,今天我要给你介绍一个...
安卓完整的系统设置,全面掌控手... 亲爱的手机控们,是不是觉得你的安卓手机用久了,功能越来越强大,但设置却越来越复杂?别急,今天就来带你...
电视安卓系统是几代机子,揭秘新... 你有没有想过,家里的电视是不是已经升级到了最新的安卓系统呢?别小看了这个小小的系统升级,它可是能让你...
安卓系统隐私有经常去,系统级防... 你知道吗?在咱们这个数字化时代,手机可是我们生活中不可或缺的好伙伴。但是,你知道吗?这个好伙伴有时候...
安卓10系统断网软件,轻松实现... 你有没有遇到过这种情况?手机突然断网了,明明信号满格,却连不上网,急得你团团转。别急,今天就来给你揭...
安卓可以改什么系统版本,体验全... 你有没有想过,你的安卓手机其实可以像换衣服一样,换一个全新的“系统版本”呢?没错,这就是今天我们要聊...
最好的平板游戏安卓系统,畅享指... 亲爱的游戏迷们,你是否在寻找一款能够让你在安卓平板上畅玩无忧的游戏神器?别急,今天我就要给你揭秘,究...
华为安卓系统卡顿解决,华为安卓... 你是不是也遇到了华为安卓系统卡顿的问题?别急,今天就来给你支几招,让你的华为手机重新焕发活力!一、清...
安卓建议升级鸿蒙系统吗,探讨鸿... 亲爱的安卓用户们,最近是不是被鸿蒙系统的新鲜劲儿给吸引了?是不是在犹豫要不要把你的安卓手机升级成鸿蒙...
安卓如何变苹果系统桌面,桌面系... 你有没有想过,把你的安卓手机变成苹果系统桌面,是不是瞬间高大上了呢?想象那流畅的动画效果,那简洁的界...
windows平板安卓系统升级... 你有没有发现,最近你的Windows平板电脑突然变得有些不一样了?没错,就是那个一直默默陪伴你的小家...
安卓系统扩大运行内存,解锁更大... 你知道吗?在科技飞速发展的今天,手机已经成为了我们生活中不可或缺的好伙伴。而手机中,安卓系统更是以其...
安卓系统怎么改变zenly,探... 你有没有发现,你的安卓手机上的Zenly应用最近好像变得不一样了?没错,安卓系统的大手笔更新,让Ze...
英特尔安卓子系统,引领高效移动... 你有没有想过,手机里的安卓系统竟然也能和电脑上的英特尔处理器完美结合呢?这可不是天方夜谭,而是科技发...
永远会用安卓系统的手机,探索安... 亲爱的手机控们,你是否也有那么一款手机,它陪伴你度过了无数个日夜,成为了你生活中不可或缺的一部分?没...
有哪些安卓手机系统好用,好用系... 你有没有发现,现在手机市场上安卓手机的品牌和型号真是琳琅满目,让人挑花了眼?不过别急,今天我就来给你...
卡片记账安卓系统有吗,便捷财务... 你有没有想过,用手机记账是不是比拿着小本本记录来得方便多了?现在,手机上的应用层出不穷,那么,有没有...
武汉摩尔影城安卓系统APP,便... 你有没有想过,一部手机就能带你走进电影的世界,享受大屏幕带来的震撼?今天,就让我带你详细了解武汉摩尔...
联想刷安卓p系统,畅享智能新体... 你有没有发现,最近联想的安卓P系统刷机热潮可是席卷了整个互联网圈呢!这不,我就迫不及待地来和你聊聊这...
mac从安卓系统改成双系统,双... 你有没有想过,你的Mac电脑从安卓系统改成双系统后,生活会有哪些翻天覆地的变化呢?想象一边是流畅的苹...