admin管理员组

文章数量:1316852

0. Backgrounds

I was trying to train a AC-based agent for a task with large observation space. The task is similar to a huge cliff walk task. The agent starts at some random point on a 20 * 20 * 5 * 9 * 9 grid world, and needs to reach some target point. Every grid point is associated with a certain score ranged in [0, 1], and the target points are defined to be those with score higher than 0.2.

I firstly tried PPO. But the agent could not learn anything, and the loss curves were quite strange to me. So I wondered if anything went wrong with my algorithm, and I tried vanilla Actor-Critic on a small cliff walking task.

1. Task settings

The cliff walk environment is a 3 * 3 grid world, with FROM the starting point, DEAD the cliff, and DEST the destination.

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

episodes = 1000
steps = 200

FREE = 0
FROM = 1
DEAD = 2
DEST = 3

cliff_env = [
    [FREE, FREE, FREE],
    [FREE, FREE, FREE],
    [FROM, DEAD, DEST],
]

env = dict(
    type="CliffWalkActorCriticEnv",
    env=cliff_env,
    device=DEVICE,
)
agent_cfg = dict(
    type="CliffWalkActorCritic",
    device=DEVICE,
    gamma=0.98,
    actor_cfg=dict(
        device=DEVICE,
        state_dim=9,
        hidden_dim=128,
        action_dim=4,
    ),
    critic_cfg=dict(
        device=DEVICE,
        state_dim=9,
        hidden_dim=128,
    ),
    actor_lr=1e-4,
    critic_lr=1e-3,
    entropy_loss_coef=0.01,
    advantage_coef=0.95,
)

I encode the states(observations) as one-hot vectors. The actor and critic network are both 3-layered fully connected.

class Actor(Module):
    def __init__(self, device, state_dim, hidden_dim, action_dim):
        super().__init__()
        self.device = torch.device(device)

        self.linear = Linear(state_dim, hidden_dim, device=self.device)
        self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
        self.out = Linear(hidden_dim, action_dim, device=self.device)

    def forward(self, input: Tensor):
        obs_indices = input.type(torch.int64)
        x = (
            F.one_hot(obs_indices, num_classes=self.linear.in_features)
            .type(torch.float)
            .view(-1, self.linear.in_features)
        )

        x = F.relu(self.linear(x))
        x = F.relu(self.mid(x))

        out = F.softmax(self.out(x), dim=1)

        return out


class Critic(Module):
    def __init__(self, device, state_dim, hidden_dim):
        super().__init__()
        self.device = torch.device(device)

        self.linear = Linear(state_dim, hidden_dim, device=self.device)
        self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
        self.out = Linear(hidden_dim, 1, device=self.device)

    def forward(self, input: Tensor):
        obs_indices = input.type(torch.int64)
        x = (
            F.one_hot(obs_indices, num_classes=self.linear.in_features)
            .type(torch.float)
            .view(-1, self.linear.in_features)
        )

        x = F.relu(self.linear(x))
        x = F.relu(self.mid(x))

        out = self.out(x)

        return out

Some tricks are used in the agent. Entropy loss is added to restrict the shape of action distributions. GAE is used to estimate the critic target in order to decrease the bias.

@AGENTS.register_module()
class CliffWalkActorCritic(BaseAgent):
    FREE = 0
    FROM = 1
    DEAD = 2
    DEST = 3

    def __init__(
        self,
        device: str,
        gamma: float,
        actor_cfg: Dict,
        critic_cfg: Optional[Dict] = None,
        actor_lr: float = 1e-4,
        critic_lr: float = 1e-3,
        entropy_loss_coef: float = 0.01,
        advantage_coef: float = 0.95,
        test_mode: bool = False,
    ):
        super().__init__(device, gamma)

        self.test_mode = test_mode
        self.actor: Module = Actor(**actor_cfg).to(self.device)

        self.entropy_loss_coef = entropy_loss_coef
        self.advantage_coef = advantage_coef

        self.critic: Module = Critic(**critic_cfg).to(self.device)

        self.lr = [actor_lr, critic_lr]
        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=self.lr[0])
        self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=self.lr[1])

        self.start_episode = 0
        self.test_mode = test_mode

    def take_action(self, observation, **kwargs):
        if self.test_mode:
            return self.test_take_action(observation, **kwargs)
        else:
            return self.train_take_action(observation, **kwargs)

    def train_take_action(self, observation, **kwargs):
        with torch.no_grad():
            obs: Tensor = torch.tensor(
                observation, dtype=torch.float, device=self.device
            ).view(-1, 1)

            action_probs: Tensor = self.actor(obs).squeeze(0)
            action_dist = torch.distributions.Categorical(action_probs)

            action_index: int = action_dist.sample().type(torch.int64).item()

        return action_index

    def test_take_action(self, observation, **kwargs):
        with torch.no_grad():
            obs: Tensor = torch.tensor(
                observation, dtype=torch.float, device=self.device
            ).view(-1, 1)

            action_dist: Tensor = self.actor(obs).squeeze(0)

            action_index: int = action_dist.argmax().type(torch.int64).item()

        return action_index

    def update(self, transitions: dict, **kwargs):
        cur_observations: Tensor = torch.tensor(
            transitions["cur_observation"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        cur_actions: Tensor = torch.tensor(
            transitions["cur_action"], dtype=torch.int64, device=self.device
        ).view(-1, 1)
        next_observations: Tensor = torch.tensor(
            transitions["next_observation"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        rewards: Tensor = torch.tensor(
            transitions["reward"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        terminated: Tensor = torch.tensor(
            transitions["terminated"], dtype=torch.float, device=self.device
        ).view(-1, 1)

        td_target: Tensor = rewards + self.gamma * self.critic(next_observations) * (
            1 - terminated
        )
        td_error: Tensor = td_target - self.critic(cur_observations)
        advantages: Tensor = _compute_advantage(
            self.gamma, self.advantage_coef, td_error, self.device
        )

        log_probs: Tensor = torch.log(
            torch.gather(
                self.actor(cur_observations),
                dim=1,
                index=cur_actions,
            )
        )

        actor_loss: Tensor = torch.mean(-log_probs * advantages.detach())
        critic_loss: Tensor = torch.mean(
            F.mse_loss(self.critic(cur_observations), advantages.detach())
        )
        entropy_loss: Tensor = torch.mean(
            torch.distributions.Categorical(
                self.actor(cur_observations)
            ).entropy()
        )

        loss: Tensor = (
            actor_loss + 0.5 * critic_loss - self.entropy_loss_coef * entropy_loss
        )

        self.actor_opt.zero_grad()
        self.critic_opt.zero_grad()
        loss.backward()
        self.actor_opt.step()
        self.critic_opt.step()

        return actor_loss.item(), critic_loss.item(), entropy_loss.item()

This is how I implemented GAE computation.

def _compute_advantage(
    gamma: float, advantage_coef: float, td_error: Tensor, device: torch.cuda.device
):
    td_error = td_error.clone().detach().cpu().numpy()
    advantage_list = []
    advantage = 0.0

    for delta in td_error[::-1]:
        advantage = gamma * advantage_coef * advantage + delta
        advantage_list.append(advantage)
    advantage_list.reverse()

    return torch.tensor(
        np.array(advantage_list), dtype=torch.float, device=device
    )

I wrote the environment myself. Every step gets -1 reward and falling down the cliff gets -500 reward. The episode ends if 200 steps were taken, or the agent fell down the cliff, or it reached the DEST. During training the agent starts at a random non-DEAD point, and when evaluating it starts at FROM.

@ENVIRONMENTS.register_module()
class CliffWalkActorCriticEnv:
    INT = torch.int64
    FLOAT = torch.float

    FREE = 0
    FROM = 1
    DEAD = 2
    DEST = 3

    def __init__(
        self,
        env: List[int],
        device: Optional[str],
        save_path: str = None,
    ):
        self.device = torch.device(device)
        self.env: array = np.array(env, dtype=np.int64)
        assert (
            len(np.transpose(np.nonzero(self.env == self.FROM))) == 1
        ), "Multiple start points found."
        self.env_count: array = np.zeros_like(self.env)

    @property
    def env_shape(self):
        return list(self.env.shape)

    @property
    def start_point_index(self):
        coord = np.transpose(np.nonzero(self.env == self.FROM))[0]

        obs_index = _coord2obs_index(coord, self.env_shape)

        return obs_index

    def reset(self, test_mode: bool = False) -> array:
        if test_mode:
            return self.start_point_index

        coords = np.transpose(np.nonzero(self.env != self.DEAD))

        obs_index = _coord2obs_index(
            coords[np.random.choice(len(coords))], self.env_shape
        )

        return obs_index

    def step(
        self,
        observation: array,
        action: array,
    ) -> Dict[str, array | int | bool]:
        cur_obs_coord = _obs_index2_coord(observation, self.env_shape)
        self.env_count[cur_obs_coord[0], cur_obs_coord[1]] += 1
        movement = _action_index2movement(action, 2)

        env_shape: array = np.array(self.env_shape, dtype=np.int64)
        upper_bound: array = env_shape - 1
        lower_bound: array = np.zeros_like(upper_bound, dtype=np.int64)

        next_obs_coord: array = np.clip(
            cur_obs_coord + movement, lower_bound, upper_bound, dtype=np.int64
        )
        next_pos_state = self.env[next_obs_coord[0], next_obs_coord[1]]
        if next_pos_state == self.DEST or next_pos_state == self.DEAD:
            self.env_count[next_obs_coord[0], next_obs_coord[1]] += 1

        if next_pos_state == self.DEAD:
            reward = -500
        else:
            reward = -1

        transition = dict(
            cur_observation=observation,
            cur_action=action,
            next_observation=_coord2obs_index(next_obs_coord, self.env_shape),
            reward=reward,
            terminated=(next_pos_state == self.DEST or next_pos_state == self.DEAD),
        )

        return transition

The training process is done by a runner, and here is the train() method.

# runner

    def train(self):
        cur_observation = None

        for episode in range(self.start_episode, self.episodes):
            cur_observation = self.env.reset(test_mode=False)

            transitions = dict(
                cur_observation=[],
                cur_action=[],
                next_observation=[],
                reward=[],
                terminated=[],
            )

            episode_return: Dict[str, float] = dict(
                actor_loss=0,
                critic_loss=0,
                entropy_loss=0,
                reward=0,
                coverage=0,
            )
            terminated = False

            for step in range(self.steps):
                cur_action = self.agent.take_action(
                    cur_observation,
                    episode_index=episode,
                    step_index=step,
                    logger=self.logger,
                    save_dir=self.logger.save_dir,
                )

                cur_transition = self.env.step(cur_observation, cur_action)

                cur_observation = cur_transition["next_observation"]
                terminated = cur_transition["terminated"]
                truncated = step == self.steps - 1

                episode_return["reward"] += cur_transition["reward"]

                for key, item in transitions.items():
                    item.append(cur_transition[key])

                if terminated or truncated:
                    break

            (
                episode_return["actor_loss"],
                episode_return["critic_loss"],
                episode_return["entropy_loss"],
            ) = self.agent.update(
                transitions, self.logger
            )

2. Experiment results

The following results are AC with entropy loss, and AC w/ entropy loss using GAE.

2.1 AC with entropy loss

2.2 AC w/ entropy loss using GAE

Agents in both experiments barely learnt anything after 300+ episodes, and their loss curves behaved similarly. The entropy loss kept dropping. The fluctuation range of critic loss shrank, and after that it kept near 1000 and never dropped any more. The actor loss was just bouncing up and down even after the critic loss became stable.

While evaluation, I printed out the strategy of the agent. Both experiments ended up with all-goes-up strategy:

^  ^  ^  
^  ^  ^  
^  x  ^ 

and the action distribution is like:

    ^         v         <         >    
68.7691%  5.5671%  9.6719%  15.9920%  

3. My questions

My experiments on AC behaved quite similar with that on PPO, so I wonder if anything is wrong with my ActorCritic structure. As can be seen above, I've tried adding entropy loss and using GAE instead of td error for critic learning, but the action distributions just kept ending up with all-goes-up.

Any suggestions are appreciated!

0. Backgrounds

I was trying to train a AC-based agent for a task with large observation space. The task is similar to a huge cliff walk task. The agent starts at some random point on a 20 * 20 * 5 * 9 * 9 grid world, and needs to reach some target point. Every grid point is associated with a certain score ranged in [0, 1], and the target points are defined to be those with score higher than 0.2.

I firstly tried PPO. But the agent could not learn anything, and the loss curves were quite strange to me. So I wondered if anything went wrong with my algorithm, and I tried vanilla Actor-Critic on a small cliff walking task.

1. Task settings

The cliff walk environment is a 3 * 3 grid world, with FROM the starting point, DEAD the cliff, and DEST the destination.

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

episodes = 1000
steps = 200

FREE = 0
FROM = 1
DEAD = 2
DEST = 3

cliff_env = [
    [FREE, FREE, FREE],
    [FREE, FREE, FREE],
    [FROM, DEAD, DEST],
]

env = dict(
    type="CliffWalkActorCriticEnv",
    env=cliff_env,
    device=DEVICE,
)
agent_cfg = dict(
    type="CliffWalkActorCritic",
    device=DEVICE,
    gamma=0.98,
    actor_cfg=dict(
        device=DEVICE,
        state_dim=9,
        hidden_dim=128,
        action_dim=4,
    ),
    critic_cfg=dict(
        device=DEVICE,
        state_dim=9,
        hidden_dim=128,
    ),
    actor_lr=1e-4,
    critic_lr=1e-3,
    entropy_loss_coef=0.01,
    advantage_coef=0.95,
)

I encode the states(observations) as one-hot vectors. The actor and critic network are both 3-layered fully connected.

class Actor(Module):
    def __init__(self, device, state_dim, hidden_dim, action_dim):
        super().__init__()
        self.device = torch.device(device)

        self.linear = Linear(state_dim, hidden_dim, device=self.device)
        self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
        self.out = Linear(hidden_dim, action_dim, device=self.device)

    def forward(self, input: Tensor):
        obs_indices = input.type(torch.int64)
        x = (
            F.one_hot(obs_indices, num_classes=self.linear.in_features)
            .type(torch.float)
            .view(-1, self.linear.in_features)
        )

        x = F.relu(self.linear(x))
        x = F.relu(self.mid(x))

        out = F.softmax(self.out(x), dim=1)

        return out


class Critic(Module):
    def __init__(self, device, state_dim, hidden_dim):
        super().__init__()
        self.device = torch.device(device)

        self.linear = Linear(state_dim, hidden_dim, device=self.device)
        self.mid = Linear(hidden_dim, hidden_dim, device=self.device)
        self.out = Linear(hidden_dim, 1, device=self.device)

    def forward(self, input: Tensor):
        obs_indices = input.type(torch.int64)
        x = (
            F.one_hot(obs_indices, num_classes=self.linear.in_features)
            .type(torch.float)
            .view(-1, self.linear.in_features)
        )

        x = F.relu(self.linear(x))
        x = F.relu(self.mid(x))

        out = self.out(x)

        return out

Some tricks are used in the agent. Entropy loss is added to restrict the shape of action distributions. GAE is used to estimate the critic target in order to decrease the bias.

@AGENTS.register_module()
class CliffWalkActorCritic(BaseAgent):
    FREE = 0
    FROM = 1
    DEAD = 2
    DEST = 3

    def __init__(
        self,
        device: str,
        gamma: float,
        actor_cfg: Dict,
        critic_cfg: Optional[Dict] = None,
        actor_lr: float = 1e-4,
        critic_lr: float = 1e-3,
        entropy_loss_coef: float = 0.01,
        advantage_coef: float = 0.95,
        test_mode: bool = False,
    ):
        super().__init__(device, gamma)

        self.test_mode = test_mode
        self.actor: Module = Actor(**actor_cfg).to(self.device)

        self.entropy_loss_coef = entropy_loss_coef
        self.advantage_coef = advantage_coef

        self.critic: Module = Critic(**critic_cfg).to(self.device)

        self.lr = [actor_lr, critic_lr]
        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=self.lr[0])
        self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=self.lr[1])

        self.start_episode = 0
        self.test_mode = test_mode

    def take_action(self, observation, **kwargs):
        if self.test_mode:
            return self.test_take_action(observation, **kwargs)
        else:
            return self.train_take_action(observation, **kwargs)

    def train_take_action(self, observation, **kwargs):
        with torch.no_grad():
            obs: Tensor = torch.tensor(
                observation, dtype=torch.float, device=self.device
            ).view(-1, 1)

            action_probs: Tensor = self.actor(obs).squeeze(0)
            action_dist = torch.distributions.Categorical(action_probs)

            action_index: int = action_dist.sample().type(torch.int64).item()

        return action_index

    def test_take_action(self, observation, **kwargs):
        with torch.no_grad():
            obs: Tensor = torch.tensor(
                observation, dtype=torch.float, device=self.device
            ).view(-1, 1)

            action_dist: Tensor = self.actor(obs).squeeze(0)

            action_index: int = action_dist.argmax().type(torch.int64).item()

        return action_index

    def update(self, transitions: dict, **kwargs):
        cur_observations: Tensor = torch.tensor(
            transitions["cur_observation"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        cur_actions: Tensor = torch.tensor(
            transitions["cur_action"], dtype=torch.int64, device=self.device
        ).view(-1, 1)
        next_observations: Tensor = torch.tensor(
            transitions["next_observation"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        rewards: Tensor = torch.tensor(
            transitions["reward"], dtype=torch.float, device=self.device
        ).view(-1, 1)
        terminated: Tensor = torch.tensor(
            transitions["terminated"], dtype=torch.float, device=self.device
        ).view(-1, 1)

        td_target: Tensor = rewards + self.gamma * self.critic(next_observations) * (
            1 - terminated
        )
        td_error: Tensor = td_target - self.critic(cur_observations)
        advantages: Tensor = _compute_advantage(
            self.gamma, self.advantage_coef, td_error, self.device
        )

        log_probs: Tensor = torch.log(
            torch.gather(
                self.actor(cur_observations),
                dim=1,
                index=cur_actions,
            )
        )

        actor_loss: Tensor = torch.mean(-log_probs * advantages.detach())
        critic_loss: Tensor = torch.mean(
            F.mse_loss(self.critic(cur_observations), advantages.detach())
        )
        entropy_loss: Tensor = torch.mean(
            torch.distributions.Categorical(
                self.actor(cur_observations)
            ).entropy()
        )

        loss: Tensor = (
            actor_loss + 0.5 * critic_loss - self.entropy_loss_coef * entropy_loss
        )

        self.actor_opt.zero_grad()
        self.critic_opt.zero_grad()
        loss.backward()
        self.actor_opt.step()
        self.critic_opt.step()

        return actor_loss.item(), critic_loss.item(), entropy_loss.item()

This is how I implemented GAE computation.

def _compute_advantage(
    gamma: float, advantage_coef: float, td_error: Tensor, device: torch.cuda.device
):
    td_error = td_error.clone().detach().cpu().numpy()
    advantage_list = []
    advantage = 0.0

    for delta in td_error[::-1]:
        advantage = gamma * advantage_coef * advantage + delta
        advantage_list.append(advantage)
    advantage_list.reverse()

    return torch.tensor(
        np.array(advantage_list), dtype=torch.float, device=device
    )

I wrote the environment myself. Every step gets -1 reward and falling down the cliff gets -500 reward. The episode ends if 200 steps were taken, or the agent fell down the cliff, or it reached the DEST. During training the agent starts at a random non-DEAD point, and when evaluating it starts at FROM.

@ENVIRONMENTS.register_module()
class CliffWalkActorCriticEnv:
    INT = torch.int64
    FLOAT = torch.float

    FREE = 0
    FROM = 1
    DEAD = 2
    DEST = 3

    def __init__(
        self,
        env: List[int],
        device: Optional[str],
        save_path: str = None,
    ):
        self.device = torch.device(device)
        self.env: array = np.array(env, dtype=np.int64)
        assert (
            len(np.transpose(np.nonzero(self.env == self.FROM))) == 1
        ), "Multiple start points found."
        self.env_count: array = np.zeros_like(self.env)

    @property
    def env_shape(self):
        return list(self.env.shape)

    @property
    def start_point_index(self):
        coord = np.transpose(np.nonzero(self.env == self.FROM))[0]

        obs_index = _coord2obs_index(coord, self.env_shape)

        return obs_index

    def reset(self, test_mode: bool = False) -> array:
        if test_mode:
            return self.start_point_index

        coords = np.transpose(np.nonzero(self.env != self.DEAD))

        obs_index = _coord2obs_index(
            coords[np.random.choice(len(coords))], self.env_shape
        )

        return obs_index

    def step(
        self,
        observation: array,
        action: array,
    ) -> Dict[str, array | int | bool]:
        cur_obs_coord = _obs_index2_coord(observation, self.env_shape)
        self.env_count[cur_obs_coord[0], cur_obs_coord[1]] += 1
        movement = _action_index2movement(action, 2)

        env_shape: array = np.array(self.env_shape, dtype=np.int64)
        upper_bound: array = env_shape - 1
        lower_bound: array = np.zeros_like(upper_bound, dtype=np.int64)

        next_obs_coord: array = np.clip(
            cur_obs_coord + movement, lower_bound, upper_bound, dtype=np.int64
        )
        next_pos_state = self.env[next_obs_coord[0], next_obs_coord[1]]
        if next_pos_state == self.DEST or next_pos_state == self.DEAD:
            self.env_count[next_obs_coord[0], next_obs_coord[1]] += 1

        if next_pos_state == self.DEAD:
            reward = -500
        else:
            reward = -1

        transition = dict(
            cur_observation=observation,
            cur_action=action,
            next_observation=_coord2obs_index(next_obs_coord, self.env_shape),
            reward=reward,
            terminated=(next_pos_state == self.DEST or next_pos_state == self.DEAD),
        )

        return transition

The training process is done by a runner, and here is the train() method.

# runner

    def train(self):
        cur_observation = None

        for episode in range(self.start_episode, self.episodes):
            cur_observation = self.env.reset(test_mode=False)

            transitions = dict(
                cur_observation=[],
                cur_action=[],
                next_observation=[],
                reward=[],
                terminated=[],
            )

            episode_return: Dict[str, float] = dict(
                actor_loss=0,
                critic_loss=0,
                entropy_loss=0,
                reward=0,
                coverage=0,
            )
            terminated = False

            for step in range(self.steps):
                cur_action = self.agent.take_action(
                    cur_observation,
                    episode_index=episode,
                    step_index=step,
                    logger=self.logger,
                    save_dir=self.logger.save_dir,
                )

                cur_transition = self.env.step(cur_observation, cur_action)

                cur_observation = cur_transition["next_observation"]
                terminated = cur_transition["terminated"]
                truncated = step == self.steps - 1

                episode_return["reward"] += cur_transition["reward"]

                for key, item in transitions.items():
                    item.append(cur_transition[key])

                if terminated or truncated:
                    break

            (
                episode_return["actor_loss"],
                episode_return["critic_loss"],
                episode_return["entropy_loss"],
            ) = self.agent.update(
                transitions, self.logger
            )

2. Experiment results

The following results are AC with entropy loss, and AC w/ entropy loss using GAE.

2.1 AC with entropy loss

2.2 AC w/ entropy loss using GAE

Agents in both experiments barely learnt anything after 300+ episodes, and their loss curves behaved similarly. The entropy loss kept dropping. The fluctuation range of critic loss shrank, and after that it kept near 1000 and never dropped any more. The actor loss was just bouncing up and down even after the critic loss became stable.

While evaluation, I printed out the strategy of the agent. Both experiments ended up with all-goes-up strategy:

^  ^  ^  
^  ^  ^  
^  x  ^ 

and the action distribution is like:

    ^         v         <         >    
68.7691%  5.5671%  9.6719%  15.9920%  

3. My questions

My experiments on AC behaved quite similar with that on PPO, so I wonder if anything is wrong with my ActorCritic structure. As can be seen above, I've tried adding entropy loss and using GAE instead of td error for critic learning, but the action distributions just kept ending up with all-goes-up.

Any suggestions are appreciated!

Share Improve this question asked Jan 28 at 11:30 Eric MonlyeEric Monlye 1214 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

I just solved the problem. I mistakenly set critic_loss to be

critic_loss: Tensor = torch.mean(
    F.mse_loss(
        self.critic(cur_observations),
        advantages.detach(),  # notice this line
    )
)

but it should be

critic_loss: Tensor = torch.mean(
    F.mse_loss(
        self.critic(cur_observations),
        td_target.detach(),  # notice this line
    )
)

After correcting the loss expression, the agent converged to the safer path after 2000 episodes.

==== strategy ====
>  >  v  
^  >  v  
^  x  ^  

本文标签: deep learningActorCritic behaved strange on cliff walkingStack Overflow