deep learning - Actor-Critic behaved strange on cliff walking - Stack Overflow

IT技术

更新时间：2025-03-150

admin管理员组
文章数量:1316852

0. Backgrounds

I was trying to train a AC-based agent for a task with large observation space. The task is similar to a huge cliff walk task. The agent starts at some random point on a 20 * 20 * 5 * 9 * 9 grid world, and needs to reach some target point. Every grid point is associated with a certain score ranged in [0, 1], and the target points are defined to be those with score higher than 0.2.

I firstly tried PPO. But the agent could not learn anything, and the loss curves were quite strange to me. So I wondered if anything went wrong with my algorithm, and I tried vanilla Actor-Critic on a small cliff walking task.

1. Task settings

The cliff walk environment is a 3 * 3 grid world, with FROM the starting point, DEAD the cliff, and DEST the destination.

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

episodes = 1000
steps = 200

FREE = 0
FROM = 1
DEAD = 2
DEST = 3

cliff_env = [
    [FREE, FREE, FREE],
    [FREE, FREE, FREE],
    [FROM, DEAD, DEST],
]

env = dict(
    type="CliffWalkActorCriticEnv",
    env=cliff_env,
    device=DEVICE,
)
agent_cfg = dict(
    type="CliffWalkActorCritic",
    device=DEVICE,
    gamma=0.98,
    actor_cfg=dict(
        device=DEVICE,
        state_dim=9,
        hidden_dim=128,
        action_dim=4,
    ),
    critic_cfg=dict(
        device=DEVICE,
        state_dim=9,
        hidden_dim=128,
    ),
    actor_lr=1e-4,
    critic_lr=1e-3,
    entropy_loss_coef=0.01,
    advantage_coef=0.95,
)

I encode the states(observations) as one-hot vectors. The actor and critic network are both 3-layered fully connected.

class Actor(Module):
    def __init__(self, device, state_dim, hidden_dim, action_dim):
        super().__init__()
        self.device = torch.device(device)

        self.linear = Linear(state_dim, hidden_dim, device=self.device)
        self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
        self.out = Linear(hidden_dim, action_dim, device=self.device)

    def forward(self, input: Tensor):
        obs_indices = input.type(torch.int64)
        x = (
            F.one_hot(obs_indices, num_classes=self.linear.in_features)
            .type(torch.float)
            .view(-1, self.linear.in_features)
        )

        x = F.relu(self.linear(x))
        x = F.relu(self.mid(x))

        out = F.softmax(self.out(x), dim=1)

        return out


class Critic(Module):
    def __init__(self, device, state_dim, hidden_dim):
        super().__init__()
        self.device = torch.device(device)

        self.linear = Linear(state_dim, hidden_dim, device=self.device)
        self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
        self.out = Linear(hidden_dim, 1, device=self.device)

    def forward(self, input: Tensor):
        obs_indices = input.type(torch.int64)
        x = (
            F.one_hot(obs_indices, num_classes=self.linear.in_features)
            .type(torch.float)
            .view(-1, self.linear.in_features)
        )

        x = F.relu(self.linear(x))
        x = F.relu(self.mid(x))

        out = self.out(x)

        return out

Some tricks are used in the agent. Entropy loss is added to restrict the shape of action distributions. GAE is used to estimate the critic target in order to decrease the bias.

@AGENTS.register_module()
class CliffWalkActorCritic(BaseAgent):
    FREE = 0
    FROM = 1
    DEAD = 2
    DEST = 3

    def __init__(
        self,
        device: str,
        gamma: float,
        actor_cfg: Dict,
        critic_cfg: Optional[Dict] = None,
        actor_lr: float = 1e-4,
        critic_lr: float = 1e-3,
        entropy_loss_coef: float = 0.01,
        advantage_coef: float = 0.95,
        test_mode: bool = False,
    ):
        super().__init__(device, gamma)

        self.test_mode = test_mode
        self.actor: Module = Actor(**actor_cfg).to(self.device)

        self.entropy_loss_coef = entropy_loss_coef
        self.advantage_coef = advantage_coef

        self.critic: Module = Critic(**critic_cfg).to(self.device)

        self.lr = [actor_lr, critic_lr]
        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=self.lr[0])
        self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=self.lr[1])

        self.start_episode = 0
        self.test_mode = test_mode

    def take_action(self, observation, **kwargs):
        if self.test_mode:
            return self.test_take_action(observation, **kwargs)
        else:
            return self.train_take_action(observation, **kwargs)

    def train_take_action(self, observation, **kwargs):
        with torch.no_grad():
            obs: Tensor = torch.tensor(
                observation, dtype=torch.float, device=self.device
            ).view(-1, 1)

            action_probs: Tensor = self.actor(obs).squeeze(0)
            action_dist = torch.distributions.Categorical(action_probs)

            action_index: int = action_dist.sample().type(torch.int64).item()

        return action_index

    def test_take_action(self, observation, **kwargs):
        with torch.no_grad():
            obs: Tensor = torch.tensor(
                observation, dtype=torch.float, device=self.device
            ).view(-1, 1)

            action_dist: Tensor = self.actor(obs).squeeze(0)

            action_index: int = action_dist.argmax().type(torch.int64).item()

        return action_index

    def update(self, transitions: dict, **kwargs):
        cur_observations: Tensor = torch.tensor(
            transitions["cur_observation"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        cur_actions: Tensor = torch.tensor(
            transitions["cur_action"], dtype=torch.int64, device=self.device
        ).view(-1, 1)
        next_observations: Tensor = torch.tensor(
            transitions["next_observation"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        rewards: Tensor = torch.tensor(
            transitions["reward"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        terminated: Tensor = torch.tensor(
            transitions["terminated"], dtype=torch.float, device=self.device
        ).view(-1, 1)

        td_target: Tensor = rewards + self.gamma * self.critic(next_observations) * (
            1 - terminated
        )
        td_error: Tensor = td_target - self.critic(cur_observations)
        advantages: Tensor = _compute_advantage(
            self.gamma, self.advantage_coef, td_error, self.device
        )

        log_probs: Tensor = torch.log(
            torch.gather(
                self.actor(cur_observations),
                dim=1,
                index=cur_actions,
            )
        )

        actor_loss: Tensor = torch.mean(-log_probs * advantages.detach())
        critic_loss: Tensor = torch.mean(
            F.mse_loss(self.critic(cur_observations), advantages.detach())
        )
        entropy_loss: Tensor = torch.mean(
            torch.distributions.Categorical(
                self.actor(cur_observations)
            ).entropy()
        )

        loss: Tensor = (
            actor_loss + 0.5 * critic_loss - self.entropy_loss_coef * entropy_loss
        )

        self.actor_opt.zero_grad()
        self.critic_opt.zero_grad()
        loss.backward()
        self.actor_opt.step()
        self.critic_opt.step()

        return actor_loss.item(), critic_loss.item(), entropy_loss.item()

This is how I implemented GAE computation.

def _compute_advantage(
    gamma: float, advantage_coef: float, td_error: Tensor, device: torch.cuda.device
):
    td_error = td_error.clone().detach().cpu().numpy()
    advantage_list = []
    advantage = 0.0

    for delta in td_error[::-1]:
        advantage = gamma * advantage_coef * advantage + delta
        advantage_list.append(advantage)
    advantage_list.reverse()

    return torch.tensor(
        np.array(advantage_list), dtype=torch.float, device=device
    )

I wrote the environment myself. Every step gets -1 reward and falling down the cliff gets -500 reward. The episode ends if 200 steps were taken, or the agent fell down the cliff, or it reached the DEST. During training the agent starts at a random non-DEAD point, and when evaluating it starts at FROM.

@ENVIRONMENTS.register_module()
class CliffWalkActorCriticEnv:
    INT = torch.int64
    FLOAT = torch.float

    FREE = 0
    FROM = 1
    DEAD = 2
    DEST = 3

    def __init__(
        self,
        env: List[int],
        device: Optional[str],
        save_path: str = None,
    ):
        self.device = torch.device(device)
        self.env: array = np.array(env, dtype=np.int64)
        assert (
            len(np.transpose(np.nonzero(self.env == self.FROM))) == 1
        ), "Multiple start points found."
        self.env_count: array = np.zeros_like(self.env)

    @property
    def env_shape(self):
        return list(self.env.shape)

    @property
    def start_point_index(self):
        coord = np.transpose(np.nonzero(self.env == self.FROM))[0]

        obs_index = _coord2obs_index(coord, self.env_shape)

        return obs_index

    def reset(self, test_mode: bool = False) -> array:
        if test_mode:
            return self.start_point_index

        coords = np.transpose(np.nonzero(self.env != self.DEAD))

        obs_index = _coord2obs_index(
            coords[np.random.choice(len(coords))], self.env_shape
        )

        return obs_index

    def step(
        self,
        observation: array,
        action: array,
    ) -> Dict[str, array | int | bool]:
        cur_obs_coord = _obs_index2_coord(observation, self.env_shape)
        self.env_count[cur_obs_coord[0], cur_obs_coord[1]] += 1
        movement = _action_index2movement(action, 2)

        env_shape: array = np.array(self.env_shape, dtype=np.int64)
        upper_bound: array = env_shape - 1
        lower_bound: array = np.zeros_like(upper_bound, dtype=np.int64)

        next_obs_coord: array = np.clip(
            cur_obs_coord + movement, lower_bound, upper_bound, dtype=np.int64
        )
        next_pos_state = self.env[next_obs_coord[0], next_obs_coord[1]]
        if next_pos_state == self.DEST or next_pos_state == self.DEAD:
            self.env_count[next_obs_coord[0], next_obs_coord[1]] += 1

        if next_pos_state == self.DEAD:
            reward = -500
        else:
            reward = -1

        transition = dict(
            cur_observation=observation,
            cur_action=action,
            next_observation=_coord2obs_index(next_obs_coord, self.env_shape),
            reward=reward,
            terminated=(next_pos_state == self.DEST or next_pos_state == self.DEAD),
        )

        return transition

The training process is done by a runner, and here is the train() method.

# runner

    def train(self):
        cur_observation = None

        for episode in range(self.start_episode, self.episodes):
            cur_observation = self.env.reset(test_mode=False)

            transitions = dict(
                cur_observation=[],
                cur_action=[],
                next_observation=[],
                reward=[],
                terminated=[],
            )

            episode_return: Dict[str, float] = dict(
                actor_loss=0,
                critic_loss=0,
                entropy_loss=0,
                reward=0,
                coverage=0,
            )
            terminated = False

            for step in range(self.steps):
                cur_action = self.agent.take_action(
                    cur_observation,
                    episode_index=episode,
                    step_index=step,
                    logger=self.logger,
                    save_dir=self.logger.save_dir,
                )

                cur_transition = self.env.step(cur_observation, cur_action)

                cur_observation = cur_transition["next_observation"]
                terminated = cur_transition["terminated"]
                truncated = step == self.steps - 1

                episode_return["reward"] += cur_transition["reward"]

                for key, item in transitions.items():
                    item.append(cur_transition[key])

                if terminated or truncated:
                    break

            (
                episode_return["actor_loss"],
                episode_return["critic_loss"],
                episode_return["entropy_loss"],
            ) = self.agent.update(
                transitions, self.logger
            )

2. Experiment results

The following results are AC with entropy loss, and AC w/ entropy loss using GAE.

2.1 AC with entropy loss

2.2 AC w/ entropy loss using GAE

Agents in both experiments barely learnt anything after 300+ episodes, and their loss curves behaved similarly. The entropy loss kept dropping. The fluctuation range of critic loss shrank, and after that it kept near 1000 and never dropped any more. The actor loss was just bouncing up and down even after the critic loss became stable.

While evaluation, I printed out the strategy of the agent. Both experiments ended up with all-goes-up strategy:

^  ^  ^  
^  ^  ^  
^  x  ^

and the action distribution is like:

    ^         v         <         >    
68.7691%  5.5671%  9.6719%  15.9920%

3. My questions

My experiments on AC behaved quite similar with that on PPO, so I wonder if anything is wrong with my ActorCritic structure. As can be seen above, I've tried adding entropy loss and using GAE instead of td error for critic learning, but the action distributions just kept ending up with all-goes-up.

Any suggestions are appreciated!

0. Backgrounds

I was trying to train a AC-based agent for a task with large observation space. The task is similar to a huge cliff walk task. The agent starts at some random point on a 20 * 20 * 5 * 9 * 9 grid world, and needs to reach some target point. Every grid point is associated with a certain score ranged in [0, 1], and the target points are defined to be those with score higher than 0.2.

I firstly tried PPO. But the agent could not learn anything, and the loss curves were quite strange to me. So I wondered if anything went wrong with my algorithm, and I tried vanilla Actor-Critic on a small cliff walking task.

1. Task settings

The cliff walk environment is a 3 * 3 grid world, with FROM the starting point, DEAD the cliff, and DEST the destination.

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

episodes = 1000
steps = 200

FREE = 0
FROM = 1
DEAD = 2
DEST = 3

cliff_env = [
    [FREE, FREE, FREE],
    [FREE, FREE, FREE],
    [FROM, DEAD, DEST],
]

env = dict(
    type="CliffWalkActorCriticEnv",
    env=cliff_env,
    device=DEVICE,
)
agent_cfg = dict(
    type="CliffWalkActorCritic",
    device=DEVICE,
    gamma=0.98,
    actor_cfg=dict(
        device=DEVICE,
        state_dim=9,
        hidden_dim=128,
        action_dim=4,
    ),
    critic_cfg=dict(
        device=DEVICE,
        state_dim=9,
        hidden_dim=128,
    ),
    actor_lr=1e-4,
    critic_lr=1e-3,
    entropy_loss_coef=0.01,
    advantage_coef=0.95,
)

I encode the states(observations) as one-hot vectors. The actor and critic network are both 3-layered fully connected.

class Actor(Module):
    def __init__(self, device, state_dim, hidden_dim, action_dim):
        super().__init__()
        self.device = torch.device(device)

        self.linear = Linear(state_dim, hidden_dim, device=self.device)
        self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
        self.out = Linear(hidden_dim, action_dim, device=self.device)

    def forward(self, input: Tensor):
        obs_indices = input.type(torch.int64)
        x = (
            F.one_hot(obs_indices, num_classes=self.linear.in_features)
            .type(torch.float)
            .view(-1, self.linear.in_features)
        )

        x = F.relu(self.linear(x))
        x = F.relu(self.mid(x))

        out = F.softmax(self.out(x), dim=1)

        return out


class Critic(Module):
    def __init__(self, device, state_dim, hidden_dim):
        super().__init__()
        self.device = torch.device(device)

        self.linear = Linear(state_dim, hidden_dim, device=self.device)
        self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
        self.out = Linear(hidden_dim, 1, device=self.device)

    def forward(self, input: Tensor):
        obs_indices = input.type(torch.int64)
        x = (
            F.one_hot(obs_indices, num_classes=self.linear.in_features)
            .type(torch.float)
            .view(-1, self.linear.in_features)
        )

        x = F.relu(self.linear(x))
        x = F.relu(self.mid(x))

        out = self.out(x)

        return out

Some tricks are used in the agent. Entropy loss is added to restrict the shape of action distributions. GAE is used to estimate the critic target in order to decrease the bias.

@AGENTS.register_module()
class CliffWalkActorCritic(BaseAgent):
    FREE = 0
    FROM = 1
    DEAD = 2
    DEST = 3

    def __init__(
        self,
        device: str,
        gamma: float,
        actor_cfg: Dict,
        critic_cfg: Optional[Dict] = None,
        actor_lr: float = 1e-4,
        critic_lr: float = 1e-3,
        entropy_loss_coef: float = 0.01,
        advantage_coef: float = 0.95,
        test_mode: bool = False,
    ):
        super().__init__(device, gamma)

        self.test_mode = test_mode
        self.actor: Module = Actor(**actor_cfg).to(self.device)

        self.entropy_loss_coef = entropy_loss_coef
        self.advantage_coef = advantage_coef

        self.critic: Module = Critic(**critic_cfg).to(self.device)

        self.lr = [actor_lr, critic_lr]
        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=self.lr[0])
        self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=self.lr[1])

        self.start_episode = 0
        self.test_mode = test_mode

    def take_action(self, observation, **kwargs):
        if self.test_mode:
            return self.test_take_action(observation, **kwargs)
        else:
            return self.train_take_action(observation, **kwargs)

    def train_take_action(self, observation, **kwargs):
        with torch.no_grad():
            obs: Tensor = torch.tensor(
                observation, dtype=torch.float, device=self.device
            ).view(-1, 1)

            action_probs: Tensor = self.actor(obs).squeeze(0)
            action_dist = torch.distributions.Categorical(action_probs)

            action_index: int = action_dist.sample().type(torch.int64).item()

        return action_index

    def test_take_action(self, observation, **kwargs):
        with torch.no_grad():
            obs: Tensor = torch.tensor(
                observation, dtype=torch.float, device=self.device
            ).view(-1, 1)

            action_dist: Tensor = self.actor(obs).squeeze(0)

            action_index: int = action_dist.argmax().type(torch.int64).item()

        return action_index

    def update(self, transitions: dict, **kwargs):
        cur_observations: Tensor = torch.tensor(
            transitions["cur_observation"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        cur_actions: Tensor = torch.tensor(
            transitions["cur_action"], dtype=torch.int64, device=self.device
        ).view(-1, 1)
        next_observations: Tensor = torch.tensor(
            transitions["next_observation"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        rewards: Tensor = torch.tensor(
            transitions["reward"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        terminated: Tensor = torch.tensor(
            transitions["terminated"], dtype=torch.float, device=self.device
        ).view(-1, 1)

        td_target: Tensor = rewards + self.gamma * self.critic(next_observations) * (
            1 - terminated
        )
        td_error: Tensor = td_target - self.critic(cur_observations)
        advantages: Tensor = _compute_advantage(
            self.gamma, self.advantage_coef, td_error, self.device
        )

        log_probs: Tensor = torch.log(
            torch.gather(
                self.actor(cur_observations),
                dim=1,
                index=cur_actions,
            )
        )

        actor_loss: Tensor = torch.mean(-log_probs * advantages.detach())
        critic_loss: Tensor = torch.mean(
            F.mse_loss(self.critic(cur_observations), advantages.detach())
        )
        entropy_loss: Tensor = torch.mean(
            torch.distributions.Categorical(
                self.actor(cur_observations)
            ).entropy()
        )

        loss: Tensor = (
            actor_loss + 0.5 * critic_loss - self.entropy_loss_coef * entropy_loss
        )

        self.actor_opt.zero_grad()
        self.critic_opt.zero_grad()
        loss.backward()
        self.actor_opt.step()
        self.critic_opt.step()

        return actor_loss.item(), critic_loss.item(), entropy_loss.item()

This is how I implemented GAE computation.

def _compute_advantage(
    gamma: float, advantage_coef: float, td_error: Tensor, device: torch.cuda.device
):
    td_error = td_error.clone().detach().cpu().numpy()
    advantage_list = []
    advantage = 0.0

    for delta in td_error[::-1]:
        advantage = gamma * advantage_coef * advantage + delta
        advantage_list.append(advantage)
    advantage_list.reverse()

    return torch.tensor(
        np.array(advantage_list), dtype=torch.float, device=device
    )

I wrote the environment myself. Every step gets -1 reward and falling down the cliff gets -500 reward. The episode ends if 200 steps were taken, or the agent fell down the cliff, or it reached the DEST. During training the agent starts at a random non-DEAD point, and when evaluating it starts at FROM.

@ENVIRONMENTS.register_module()
class CliffWalkActorCriticEnv:
    INT = torch.int64
    FLOAT = torch.float

    FREE = 0
    FROM = 1
    DEAD = 2
    DEST = 3

    def __init__(
        self,
        env: List[int],
        device: Optional[str],
        save_path: str = None,
    ):
        self.device = torch.device(device)
        self.env: array = np.array(env, dtype=np.int64)
        assert (
            len(np.transpose(np.nonzero(self.env == self.FROM))) == 1
        ), "Multiple start points found."
        self.env_count: array = np.zeros_like(self.env)

    @property
    def env_shape(self):
        return list(self.env.shape)

    @property
    def start_point_index(self):
        coord = np.transpose(np.nonzero(self.env == self.FROM))[0]

        obs_index = _coord2obs_index(coord, self.env_shape)

        return obs_index

    def reset(self, test_mode: bool = False) -> array:
        if test_mode:
            return self.start_point_index

        coords = np.transpose(np.nonzero(self.env != self.DEAD))

        obs_index = _coord2obs_index(
            coords[np.random.choice(len(coords))], self.env_shape
        )

        return obs_index

    def step(
        self,
        observation: array,
        action: array,
    ) -> Dict[str, array | int | bool]:
        cur_obs_coord = _obs_index2_coord(observation, self.env_shape)
        self.env_count[cur_obs_coord[0], cur_obs_coord[1]] += 1
        movement = _action_index2movement(action, 2)

        env_shape: array = np.array(self.env_shape, dtype=np.int64)
        upper_bound: array = env_shape - 1
        lower_bound: array = np.zeros_like(upper_bound, dtype=np.int64)

        next_obs_coord: array = np.clip(
            cur_obs_coord + movement, lower_bound, upper_bound, dtype=np.int64
        )
        next_pos_state = self.env[next_obs_coord[0], next_obs_coord[1]]
        if next_pos_state == self.DEST or next_pos_state == self.DEAD:
            self.env_count[next_obs_coord[0], next_obs_coord[1]] += 1

        if next_pos_state == self.DEAD:
            reward = -500
        else:
            reward = -1

        transition = dict(
            cur_observation=observation,
            cur_action=action,
            next_observation=_coord2obs_index(next_obs_coord, self.env_shape),
            reward=reward,
            terminated=(next_pos_state == self.DEST or next_pos_state == self.DEAD),
        )

        return transition

The training process is done by a runner, and here is the train() method.

# runner

    def train(self):
        cur_observation = None

        for episode in range(self.start_episode, self.episodes):
            cur_observation = self.env.reset(test_mode=False)

            transitions = dict(
                cur_observation=[],
                cur_action=[],
                next_observation=[],
                reward=[],
                terminated=[],
            )

            episode_return: Dict[str, float] = dict(
                actor_loss=0,
                critic_loss=0,
                entropy_loss=0,
                reward=0,
                coverage=0,
            )
            terminated = False

            for step in range(self.steps):
                cur_action = self.agent.take_action(
                    cur_observation,
                    episode_index=episode,
                    step_index=step,
                    logger=self.logger,
                    save_dir=self.logger.save_dir,
                )

                cur_transition = self.env.step(cur_observation, cur_action)

                cur_observation = cur_transition["next_observation"]
                terminated = cur_transition["terminated"]
                truncated = step == self.steps - 1

                episode_return["reward"] += cur_transition["reward"]

                for key, item in transitions.items():
                    item.append(cur_transition[key])

                if terminated or truncated:
                    break

            (
                episode_return["actor_loss"],
                episode_return["critic_loss"],
                episode_return["entropy_loss"],
            ) = self.agent.update(
                transitions, self.logger
            )

2. Experiment results

The following results are AC with entropy loss, and AC w/ entropy loss using GAE.

2.1 AC with entropy loss

2.2 AC w/ entropy loss using GAE

Agents in both experiments barely learnt anything after 300+ episodes, and their loss curves behaved similarly. The entropy loss kept dropping. The fluctuation range of critic loss shrank, and after that it kept near 1000 and never dropped any more. The actor loss was just bouncing up and down even after the critic loss became stable.

While evaluation, I printed out the strategy of the agent. Both experiments ended up with all-goes-up strategy:

^  ^  ^  
^  ^  ^  
^  x  ^

and the action distribution is like:

    ^         v         <         >    
68.7691%  5.5671%  9.6719%  15.9920%

3. My questions

My experiments on AC behaved quite similar with that on PPO, so I wonder if anything is wrong with my ActorCritic structure. As can be seen above, I've tried adding entropy loss and using GAE instead of td error for critic learning, but the action distributions just kept ending up with all-goes-up.

Any suggestions are appreciated!

Share Improve this question asked Jan 28 at 11:30 Eric Monlye 1214 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

I just solved the problem. I mistakenly set critic_loss to be

critic_loss: Tensor = torch.mean(
    F.mse_loss(
        self.critic(cur_observations),
        advantages.detach(),  # notice this line
    )
)

but it should be

critic_loss: Tensor = torch.mean(
    F.mse_loss(
        self.critic(cur_observations),
        td_target.detach(),  # notice this line
    )
)

After correcting the loss expression, the agent converged to the safer path after 2000 episodes.

==== strategy ====
>  >  v  
^  >  v  
^  x  ^

本文标签： deep learningActorCritic behaved strange on cliff walkingStack Overflow

版权声明：本文标题：deep learning - Actor-Critic behaved strange on cliff walking - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1742011643a2412991.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

deep learning - Actor-Critic behaved strange on cliff walking - Stack Overflow

0. Backgrounds

1. Task settings

2. Experiment results

3. My questions

0. Backgrounds

1. Task settings

2. Experiment results

3. My questions

1 Answer 1

更多相关文章

deep learning - Actor-Critic behaved strange on cliff walking - Stack Overflow

发表评论

推荐文章

css - child-theme style not changing on localhost

javascript - Why is Meteor Template Object Not Defined? - Stack Overflow

javascript - Content Script: Uncaught TypeError: Cannot read property &#39;onRequest&#39; of undefined - Stack Overflow

Handler for endEvent of SVG animate element not invoked when done via Javascript DOM - Stack Overflow

javascript - Nesting for loop multidimensional array. Conceptual - Stack Overflow

热门文章

featured post - Get attachment thumbnail from get_posts function

mysql - how to sum rows together but also leave the original rows in the results? - Stack Overflow

database - FILL missing intervals with Previous value from one column to multiple columns - Stack Overflow

plugin development - URL issue retrieving Custom Post Types using Backbone JS API

javascript - How can my input element receive onclick events when it is set to disabled? - Stack Overflow

javascript - HTML Canvas Hover Text - Stack Overflow

filters - Register new user, assign custom role then send email

javascript - Closing Electron app does not stop the script - Stack Overflow

javascript - SVG turns black when cloned - Stack Overflow

javascript - Convert a dictionary into an array keeping the dictionary keys - Stack Overflow

最新文章

Vue后台管理系统项目(2)后台管理系统模板介绍

harmonyos下载安装,harmonyOS系统官网,华为harmonyOS系统官方版安装包预约 -手游汇

Ubuntu Linux 操作系统-清华大学开源软件镜像站下载

st-linkv2给stm32最小系统下载代码

STM32F1--FreeRTOS系统学习（一）：系统下载移植以及跑马灯测试

php - Why does my callback function not work with this custom filter hook?

javascript - getTime() in AngularJS , error in new Date(date) in Java Script - Stack Overflow

javascript - MouseOver MouseOut issue in SafariChrome - Stack Overflow

jquery - Change position of JavaScript object key and data - Stack Overflow

javascript - When should I use getElementById? - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - Content Script: Uncaught TypeError: Cannot read property 'onRequest' of undefined - Stack Overflow