admin管理员组文章数量:1316852
0. Backgrounds
I was trying to train a AC-based agent for a task with large observation space. The task is similar to a huge cliff walk task. The agent starts at some random point on a 20 * 20 * 5 * 9 * 9
grid world, and needs to reach some target point. Every grid point is associated with a certain score ranged in [0, 1]
, and the target points are defined to be those with score higher than 0.2
.
I firstly tried PPO. But the agent could not learn anything, and the loss curves were quite strange to me. So I wondered if anything went wrong with my algorithm, and I tried vanilla Actor-Critic on a small cliff walking task.
1. Task settings
The cliff walk environment is a 3 * 3
grid world, with FROM
the starting point, DEAD
the cliff, and DEST
the destination.
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
episodes = 1000
steps = 200
FREE = 0
FROM = 1
DEAD = 2
DEST = 3
cliff_env = [
[FREE, FREE, FREE],
[FREE, FREE, FREE],
[FROM, DEAD, DEST],
]
env = dict(
type="CliffWalkActorCriticEnv",
env=cliff_env,
device=DEVICE,
)
agent_cfg = dict(
type="CliffWalkActorCritic",
device=DEVICE,
gamma=0.98,
actor_cfg=dict(
device=DEVICE,
state_dim=9,
hidden_dim=128,
action_dim=4,
),
critic_cfg=dict(
device=DEVICE,
state_dim=9,
hidden_dim=128,
),
actor_lr=1e-4,
critic_lr=1e-3,
entropy_loss_coef=0.01,
advantage_coef=0.95,
)
I encode the states(observations) as one-hot vectors. The actor and critic network are both 3-layered fully connected.
class Actor(Module):
def __init__(self, device, state_dim, hidden_dim, action_dim):
super().__init__()
self.device = torch.device(device)
self.linear = Linear(state_dim, hidden_dim, device=self.device)
self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
self.out = Linear(hidden_dim, action_dim, device=self.device)
def forward(self, input: Tensor):
obs_indices = input.type(torch.int64)
x = (
F.one_hot(obs_indices, num_classes=self.linear.in_features)
.type(torch.float)
.view(-1, self.linear.in_features)
)
x = F.relu(self.linear(x))
x = F.relu(self.mid(x))
out = F.softmax(self.out(x), dim=1)
return out
class Critic(Module):
def __init__(self, device, state_dim, hidden_dim):
super().__init__()
self.device = torch.device(device)
self.linear = Linear(state_dim, hidden_dim, device=self.device)
self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
self.out = Linear(hidden_dim, 1, device=self.device)
def forward(self, input: Tensor):
obs_indices = input.type(torch.int64)
x = (
F.one_hot(obs_indices, num_classes=self.linear.in_features)
.type(torch.float)
.view(-1, self.linear.in_features)
)
x = F.relu(self.linear(x))
x = F.relu(self.mid(x))
out = self.out(x)
return out
Some tricks are used in the agent. Entropy loss is added to restrict the shape of action distributions. GAE is used to estimate the critic target in order to decrease the bias.
@AGENTS.register_module()
class CliffWalkActorCritic(BaseAgent):
FREE = 0
FROM = 1
DEAD = 2
DEST = 3
def __init__(
self,
device: str,
gamma: float,
actor_cfg: Dict,
critic_cfg: Optional[Dict] = None,
actor_lr: float = 1e-4,
critic_lr: float = 1e-3,
entropy_loss_coef: float = 0.01,
advantage_coef: float = 0.95,
test_mode: bool = False,
):
super().__init__(device, gamma)
self.test_mode = test_mode
self.actor: Module = Actor(**actor_cfg).to(self.device)
self.entropy_loss_coef = entropy_loss_coef
self.advantage_coef = advantage_coef
self.critic: Module = Critic(**critic_cfg).to(self.device)
self.lr = [actor_lr, critic_lr]
self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=self.lr[0])
self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=self.lr[1])
self.start_episode = 0
self.test_mode = test_mode
def take_action(self, observation, **kwargs):
if self.test_mode:
return self.test_take_action(observation, **kwargs)
else:
return self.train_take_action(observation, **kwargs)
def train_take_action(self, observation, **kwargs):
with torch.no_grad():
obs: Tensor = torch.tensor(
observation, dtype=torch.float, device=self.device
).view(-1, 1)
action_probs: Tensor = self.actor(obs).squeeze(0)
action_dist = torch.distributions.Categorical(action_probs)
action_index: int = action_dist.sample().type(torch.int64).item()
return action_index
def test_take_action(self, observation, **kwargs):
with torch.no_grad():
obs: Tensor = torch.tensor(
observation, dtype=torch.float, device=self.device
).view(-1, 1)
action_dist: Tensor = self.actor(obs).squeeze(0)
action_index: int = action_dist.argmax().type(torch.int64).item()
return action_index
def update(self, transitions: dict, **kwargs):
cur_observations: Tensor = torch.tensor(
transitions["cur_observation"], dtype=torch.float, device=self.device
).view(-1, 1)
cur_actions: Tensor = torch.tensor(
transitions["cur_action"], dtype=torch.int64, device=self.device
).view(-1, 1)
next_observations: Tensor = torch.tensor(
transitions["next_observation"], dtype=torch.float, device=self.device
).view(-1, 1)
rewards: Tensor = torch.tensor(
transitions["reward"], dtype=torch.float, device=self.device
).view(-1, 1)
terminated: Tensor = torch.tensor(
transitions["terminated"], dtype=torch.float, device=self.device
).view(-1, 1)
td_target: Tensor = rewards + self.gamma * self.critic(next_observations) * (
1 - terminated
)
td_error: Tensor = td_target - self.critic(cur_observations)
advantages: Tensor = _compute_advantage(
self.gamma, self.advantage_coef, td_error, self.device
)
log_probs: Tensor = torch.log(
torch.gather(
self.actor(cur_observations),
dim=1,
index=cur_actions,
)
)
actor_loss: Tensor = torch.mean(-log_probs * advantages.detach())
critic_loss: Tensor = torch.mean(
F.mse_loss(self.critic(cur_observations), advantages.detach())
)
entropy_loss: Tensor = torch.mean(
torch.distributions.Categorical(
self.actor(cur_observations)
).entropy()
)
loss: Tensor = (
actor_loss + 0.5 * critic_loss - self.entropy_loss_coef * entropy_loss
)
self.actor_opt.zero_grad()
self.critic_opt.zero_grad()
loss.backward()
self.actor_opt.step()
self.critic_opt.step()
return actor_loss.item(), critic_loss.item(), entropy_loss.item()
This is how I implemented GAE computation.
def _compute_advantage(
gamma: float, advantage_coef: float, td_error: Tensor, device: torch.cuda.device
):
td_error = td_error.clone().detach().cpu().numpy()
advantage_list = []
advantage = 0.0
for delta in td_error[::-1]:
advantage = gamma * advantage_coef * advantage + delta
advantage_list.append(advantage)
advantage_list.reverse()
return torch.tensor(
np.array(advantage_list), dtype=torch.float, device=device
)
I wrote the environment myself. Every step gets -1 reward and falling down the cliff gets -500 reward. The episode ends if 200 steps were taken, or the agent fell down the cliff, or it reached the DEST
. During training the agent starts at a random non-DEAD
point, and when evaluating it starts at FROM
.
@ENVIRONMENTS.register_module()
class CliffWalkActorCriticEnv:
INT = torch.int64
FLOAT = torch.float
FREE = 0
FROM = 1
DEAD = 2
DEST = 3
def __init__(
self,
env: List[int],
device: Optional[str],
save_path: str = None,
):
self.device = torch.device(device)
self.env: array = np.array(env, dtype=np.int64)
assert (
len(np.transpose(np.nonzero(self.env == self.FROM))) == 1
), "Multiple start points found."
self.env_count: array = np.zeros_like(self.env)
@property
def env_shape(self):
return list(self.env.shape)
@property
def start_point_index(self):
coord = np.transpose(np.nonzero(self.env == self.FROM))[0]
obs_index = _coord2obs_index(coord, self.env_shape)
return obs_index
def reset(self, test_mode: bool = False) -> array:
if test_mode:
return self.start_point_index
coords = np.transpose(np.nonzero(self.env != self.DEAD))
obs_index = _coord2obs_index(
coords[np.random.choice(len(coords))], self.env_shape
)
return obs_index
def step(
self,
observation: array,
action: array,
) -> Dict[str, array | int | bool]:
cur_obs_coord = _obs_index2_coord(observation, self.env_shape)
self.env_count[cur_obs_coord[0], cur_obs_coord[1]] += 1
movement = _action_index2movement(action, 2)
env_shape: array = np.array(self.env_shape, dtype=np.int64)
upper_bound: array = env_shape - 1
lower_bound: array = np.zeros_like(upper_bound, dtype=np.int64)
next_obs_coord: array = np.clip(
cur_obs_coord + movement, lower_bound, upper_bound, dtype=np.int64
)
next_pos_state = self.env[next_obs_coord[0], next_obs_coord[1]]
if next_pos_state == self.DEST or next_pos_state == self.DEAD:
self.env_count[next_obs_coord[0], next_obs_coord[1]] += 1
if next_pos_state == self.DEAD:
reward = -500
else:
reward = -1
transition = dict(
cur_observation=observation,
cur_action=action,
next_observation=_coord2obs_index(next_obs_coord, self.env_shape),
reward=reward,
terminated=(next_pos_state == self.DEST or next_pos_state == self.DEAD),
)
return transition
The training process is done by a runner, and here is the train()
method.
# runner
def train(self):
cur_observation = None
for episode in range(self.start_episode, self.episodes):
cur_observation = self.env.reset(test_mode=False)
transitions = dict(
cur_observation=[],
cur_action=[],
next_observation=[],
reward=[],
terminated=[],
)
episode_return: Dict[str, float] = dict(
actor_loss=0,
critic_loss=0,
entropy_loss=0,
reward=0,
coverage=0,
)
terminated = False
for step in range(self.steps):
cur_action = self.agent.take_action(
cur_observation,
episode_index=episode,
step_index=step,
logger=self.logger,
save_dir=self.logger.save_dir,
)
cur_transition = self.env.step(cur_observation, cur_action)
cur_observation = cur_transition["next_observation"]
terminated = cur_transition["terminated"]
truncated = step == self.steps - 1
episode_return["reward"] += cur_transition["reward"]
for key, item in transitions.items():
item.append(cur_transition[key])
if terminated or truncated:
break
(
episode_return["actor_loss"],
episode_return["critic_loss"],
episode_return["entropy_loss"],
) = self.agent.update(
transitions, self.logger
)
2. Experiment results
The following results are AC with entropy loss, and AC w/ entropy loss using GAE.
2.1 AC with entropy loss
2.2 AC w/ entropy loss using GAE
Agents in both experiments barely learnt anything after 300+ episodes, and their loss curves behaved similarly. The entropy loss kept dropping. The fluctuation range of critic loss shrank, and after that it kept near 1000 and never dropped any more. The actor loss was just bouncing up and down even after the critic loss became stable.
While evaluation, I printed out the strategy of the agent. Both experiments ended up with all-goes-up strategy:
^ ^ ^
^ ^ ^
^ x ^
and the action distribution is like:
^ v < >
68.7691% 5.5671% 9.6719% 15.9920%
3. My questions
My experiments on AC behaved quite similar with that on PPO, so I wonder if anything is wrong with my ActorCritic structure. As can be seen above, I've tried adding entropy loss and using GAE instead of td error for critic learning, but the action distributions just kept ending up with all-goes-up.
Any suggestions are appreciated!
0. Backgrounds
I was trying to train a AC-based agent for a task with large observation space. The task is similar to a huge cliff walk task. The agent starts at some random point on a 20 * 20 * 5 * 9 * 9
grid world, and needs to reach some target point. Every grid point is associated with a certain score ranged in [0, 1]
, and the target points are defined to be those with score higher than 0.2
.
I firstly tried PPO. But the agent could not learn anything, and the loss curves were quite strange to me. So I wondered if anything went wrong with my algorithm, and I tried vanilla Actor-Critic on a small cliff walking task.
1. Task settings
The cliff walk environment is a 3 * 3
grid world, with FROM
the starting point, DEAD
the cliff, and DEST
the destination.
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
episodes = 1000
steps = 200
FREE = 0
FROM = 1
DEAD = 2
DEST = 3
cliff_env = [
[FREE, FREE, FREE],
[FREE, FREE, FREE],
[FROM, DEAD, DEST],
]
env = dict(
type="CliffWalkActorCriticEnv",
env=cliff_env,
device=DEVICE,
)
agent_cfg = dict(
type="CliffWalkActorCritic",
device=DEVICE,
gamma=0.98,
actor_cfg=dict(
device=DEVICE,
state_dim=9,
hidden_dim=128,
action_dim=4,
),
critic_cfg=dict(
device=DEVICE,
state_dim=9,
hidden_dim=128,
),
actor_lr=1e-4,
critic_lr=1e-3,
entropy_loss_coef=0.01,
advantage_coef=0.95,
)
I encode the states(observations) as one-hot vectors. The actor and critic network are both 3-layered fully connected.
class Actor(Module):
def __init__(self, device, state_dim, hidden_dim, action_dim):
super().__init__()
self.device = torch.device(device)
self.linear = Linear(state_dim, hidden_dim, device=self.device)
self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
self.out = Linear(hidden_dim, action_dim, device=self.device)
def forward(self, input: Tensor):
obs_indices = input.type(torch.int64)
x = (
F.one_hot(obs_indices, num_classes=self.linear.in_features)
.type(torch.float)
.view(-1, self.linear.in_features)
)
x = F.relu(self.linear(x))
x = F.relu(self.mid(x))
out = F.softmax(self.out(x), dim=1)
return out
class Critic(Module):
def __init__(self, device, state_dim, hidden_dim):
super().__init__()
self.device = torch.device(device)
self.linear = Linear(state_dim, hidden_dim, device=self.device)
self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
self.out = Linear(hidden_dim, 1, device=self.device)
def forward(self, input: Tensor):
obs_indices = input.type(torch.int64)
x = (
F.one_hot(obs_indices, num_classes=self.linear.in_features)
.type(torch.float)
.view(-1, self.linear.in_features)
)
x = F.relu(self.linear(x))
x = F.relu(self.mid(x))
out = self.out(x)
return out
Some tricks are used in the agent. Entropy loss is added to restrict the shape of action distributions. GAE is used to estimate the critic target in order to decrease the bias.
@AGENTS.register_module()
class CliffWalkActorCritic(BaseAgent):
FREE = 0
FROM = 1
DEAD = 2
DEST = 3
def __init__(
self,
device: str,
gamma: float,
actor_cfg: Dict,
critic_cfg: Optional[Dict] = None,
actor_lr: float = 1e-4,
critic_lr: float = 1e-3,
entropy_loss_coef: float = 0.01,
advantage_coef: float = 0.95,
test_mode: bool = False,
):
super().__init__(device, gamma)
self.test_mode = test_mode
self.actor: Module = Actor(**actor_cfg).to(self.device)
self.entropy_loss_coef = entropy_loss_coef
self.advantage_coef = advantage_coef
self.critic: Module = Critic(**critic_cfg).to(self.device)
self.lr = [actor_lr, critic_lr]
self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=self.lr[0])
self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=self.lr[1])
self.start_episode = 0
self.test_mode = test_mode
def take_action(self, observation, **kwargs):
if self.test_mode:
return self.test_take_action(observation, **kwargs)
else:
return self.train_take_action(observation, **kwargs)
def train_take_action(self, observation, **kwargs):
with torch.no_grad():
obs: Tensor = torch.tensor(
observation, dtype=torch.float, device=self.device
).view(-1, 1)
action_probs: Tensor = self.actor(obs).squeeze(0)
action_dist = torch.distributions.Categorical(action_probs)
action_index: int = action_dist.sample().type(torch.int64).item()
return action_index
def test_take_action(self, observation, **kwargs):
with torch.no_grad():
obs: Tensor = torch.tensor(
observation, dtype=torch.float, device=self.device
).view(-1, 1)
action_dist: Tensor = self.actor(obs).squeeze(0)
action_index: int = action_dist.argmax().type(torch.int64).item()
return action_index
def update(self, transitions: dict, **kwargs):
cur_observations: Tensor = torch.tensor(
transitions["cur_observation"], dtype=torch.float, device=self.device
).view(-1, 1)
cur_actions: Tensor = torch.tensor(
transitions["cur_action"], dtype=torch.int64, device=self.device
).view(-1, 1)
next_observations: Tensor = torch.tensor(
transitions["next_observation"], dtype=torch.float, device=self.device
).view(-1, 1)
rewards: Tensor = torch.tensor(
transitions["reward"], dtype=torch.float, device=self.device
).view(-1, 1)
terminated: Tensor = torch.tensor(
transitions["terminated"], dtype=torch.float, device=self.device
).view(-1, 1)
td_target: Tensor = rewards + self.gamma * self.critic(next_observations) * (
1 - terminated
)
td_error: Tensor = td_target - self.critic(cur_observations)
advantages: Tensor = _compute_advantage(
self.gamma, self.advantage_coef, td_error, self.device
)
log_probs: Tensor = torch.log(
torch.gather(
self.actor(cur_observations),
dim=1,
index=cur_actions,
)
)
actor_loss: Tensor = torch.mean(-log_probs * advantages.detach())
critic_loss: Tensor = torch.mean(
F.mse_loss(self.critic(cur_observations), advantages.detach())
)
entropy_loss: Tensor = torch.mean(
torch.distributions.Categorical(
self.actor(cur_observations)
).entropy()
)
loss: Tensor = (
actor_loss + 0.5 * critic_loss - self.entropy_loss_coef * entropy_loss
)
self.actor_opt.zero_grad()
self.critic_opt.zero_grad()
loss.backward()
self.actor_opt.step()
self.critic_opt.step()
return actor_loss.item(), critic_loss.item(), entropy_loss.item()
This is how I implemented GAE computation.
def _compute_advantage(
gamma: float, advantage_coef: float, td_error: Tensor, device: torch.cuda.device
):
td_error = td_error.clone().detach().cpu().numpy()
advantage_list = []
advantage = 0.0
for delta in td_error[::-1]:
advantage = gamma * advantage_coef * advantage + delta
advantage_list.append(advantage)
advantage_list.reverse()
return torch.tensor(
np.array(advantage_list), dtype=torch.float, device=device
)
I wrote the environment myself. Every step gets -1 reward and falling down the cliff gets -500 reward. The episode ends if 200 steps were taken, or the agent fell down the cliff, or it reached the DEST
. During training the agent starts at a random non-DEAD
point, and when evaluating it starts at FROM
.
@ENVIRONMENTS.register_module()
class CliffWalkActorCriticEnv:
INT = torch.int64
FLOAT = torch.float
FREE = 0
FROM = 1
DEAD = 2
DEST = 3
def __init__(
self,
env: List[int],
device: Optional[str],
save_path: str = None,
):
self.device = torch.device(device)
self.env: array = np.array(env, dtype=np.int64)
assert (
len(np.transpose(np.nonzero(self.env == self.FROM))) == 1
), "Multiple start points found."
self.env_count: array = np.zeros_like(self.env)
@property
def env_shape(self):
return list(self.env.shape)
@property
def start_point_index(self):
coord = np.transpose(np.nonzero(self.env == self.FROM))[0]
obs_index = _coord2obs_index(coord, self.env_shape)
return obs_index
def reset(self, test_mode: bool = False) -> array:
if test_mode:
return self.start_point_index
coords = np.transpose(np.nonzero(self.env != self.DEAD))
obs_index = _coord2obs_index(
coords[np.random.choice(len(coords))], self.env_shape
)
return obs_index
def step(
self,
observation: array,
action: array,
) -> Dict[str, array | int | bool]:
cur_obs_coord = _obs_index2_coord(observation, self.env_shape)
self.env_count[cur_obs_coord[0], cur_obs_coord[1]] += 1
movement = _action_index2movement(action, 2)
env_shape: array = np.array(self.env_shape, dtype=np.int64)
upper_bound: array = env_shape - 1
lower_bound: array = np.zeros_like(upper_bound, dtype=np.int64)
next_obs_coord: array = np.clip(
cur_obs_coord + movement, lower_bound, upper_bound, dtype=np.int64
)
next_pos_state = self.env[next_obs_coord[0], next_obs_coord[1]]
if next_pos_state == self.DEST or next_pos_state == self.DEAD:
self.env_count[next_obs_coord[0], next_obs_coord[1]] += 1
if next_pos_state == self.DEAD:
reward = -500
else:
reward = -1
transition = dict(
cur_observation=observation,
cur_action=action,
next_observation=_coord2obs_index(next_obs_coord, self.env_shape),
reward=reward,
terminated=(next_pos_state == self.DEST or next_pos_state == self.DEAD),
)
return transition
The training process is done by a runner, and here is the train()
method.
# runner
def train(self):
cur_observation = None
for episode in range(self.start_episode, self.episodes):
cur_observation = self.env.reset(test_mode=False)
transitions = dict(
cur_observation=[],
cur_action=[],
next_observation=[],
reward=[],
terminated=[],
)
episode_return: Dict[str, float] = dict(
actor_loss=0,
critic_loss=0,
entropy_loss=0,
reward=0,
coverage=0,
)
terminated = False
for step in range(self.steps):
cur_action = self.agent.take_action(
cur_observation,
episode_index=episode,
step_index=step,
logger=self.logger,
save_dir=self.logger.save_dir,
)
cur_transition = self.env.step(cur_observation, cur_action)
cur_observation = cur_transition["next_observation"]
terminated = cur_transition["terminated"]
truncated = step == self.steps - 1
episode_return["reward"] += cur_transition["reward"]
for key, item in transitions.items():
item.append(cur_transition[key])
if terminated or truncated:
break
(
episode_return["actor_loss"],
episode_return["critic_loss"],
episode_return["entropy_loss"],
) = self.agent.update(
transitions, self.logger
)
2. Experiment results
The following results are AC with entropy loss, and AC w/ entropy loss using GAE.
2.1 AC with entropy loss
2.2 AC w/ entropy loss using GAE
Agents in both experiments barely learnt anything after 300+ episodes, and their loss curves behaved similarly. The entropy loss kept dropping. The fluctuation range of critic loss shrank, and after that it kept near 1000 and never dropped any more. The actor loss was just bouncing up and down even after the critic loss became stable.
While evaluation, I printed out the strategy of the agent. Both experiments ended up with all-goes-up strategy:
^ ^ ^
^ ^ ^
^ x ^
and the action distribution is like:
^ v < >
68.7691% 5.5671% 9.6719% 15.9920%
3. My questions
My experiments on AC behaved quite similar with that on PPO, so I wonder if anything is wrong with my ActorCritic structure. As can be seen above, I've tried adding entropy loss and using GAE instead of td error for critic learning, but the action distributions just kept ending up with all-goes-up.
Any suggestions are appreciated!
Share Improve this question asked Jan 28 at 11:30 Eric MonlyeEric Monlye 1214 bronze badges1 Answer
Reset to default 0I just solved the problem. I mistakenly set critic_loss
to be
critic_loss: Tensor = torch.mean(
F.mse_loss(
self.critic(cur_observations),
advantages.detach(), # notice this line
)
)
but it should be
critic_loss: Tensor = torch.mean(
F.mse_loss(
self.critic(cur_observations),
td_target.detach(), # notice this line
)
)
After correcting the loss expression, the agent converged to the safer path after 2000 episodes.
==== strategy ====
> > v
^ > v
^ x ^
本文标签: deep learningActorCritic behaved strange on cliff walkingStack Overflow
版权声明:本文标题:deep learning - Actor-Critic behaved strange on cliff walking - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742011643a2412991.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论