Multi-agent Papers

There is a question on my mind: does surviving in a society push agents to become more intelligent? To explore this, I’m recording my paper-reading notes on multi-agent learning here. Maybe the next step is to enable learning inside a society.

Multi-agent Reinforcement Learning in Sequential Social Dilemmas ¹¹ arXiv:1702.03037

This work trains two agents in an environment and observes whether they cooperate or defect under different environment configurations.

Two games:

Gathering: Two players move on a 2D gridworld to collect apples, each worth reward. Collected apples respawn after frames, controlling scarcity. Players have 8 actions: move forward / back / left / right, rotate left / right, fire beam, stand still. The beam gives no reward but tags any player it hits twice, removing them from the game for frames. Defection corresponds to aggressive beam use to monopolize apples; cooperation corresponds to peaceful gathering. Beam-use rate serves as a social metric of aggressiveness.
Wolfpack: Two wolves chase a prey on a 2D gridworld. When either wolf touches the prey, all wolves within the capture radius share a reward. Lone capture yields ; joint capture (both wolves within radius) yields the higher . Cooperation corresponds to coordinated hunting; defection corresponds to lone-wolf capture.

The result is intuitive. In the gathering game, when resource redundancy is high (small ) or the cost of being tagged is small (small ), agents learn to be more cooperative. In the wolfpack game, when the advantage of joint capture is high (large ) or the difficulty of joint capture is low (large radius), agents learn to be more cooperative.

This simulates the learning dynamics of social behavior even without setting an intrinsic “persona” for an agent. The emergent behavior pattern is purely driven by the environment configuration.

Some side points made by this paper

A game can be classified using the sequential social dilemma (SSD) framework. Take the gathering game as an example. We can use the policy learned in an environment with high resource redundancy and low tag cost as a cooperative policy , and obtain a defective policy in the contrasting environment. Rolling out the four scenarios — — gives an empirical payoff in the form of a prisoner’s-dilemma-style matrix. The idea of (1) classifying agent behavior with a metric, (2) using a learned policy as the instantiation of an agent with a specific persona, and (3) rolling out agents of different personas to obtain an empirical payoff that indicates the structure of the game is very interesting.
Some hyperparameters change agent behavior. Two examples. (1) A higher reward discount factor leads to less myopic agents, which can encourage defection in the gathering game (beaming delays reward). (2) A larger replay buffer leads to more refined evasive behavior in the gathering game, since the agent has more chances to model its opponent’s beaming behavior. Maybe these things should not be treated as hyperparameters but as learnable parameters (learn to be myopic / learn to memorize more).
Games can also be classified by their default behavior. In the gathering game, cooperation is the default and defection is harder to learn. In the wolfpack game, defection is the default and cooperation is harder to learn. So the game structure provides an implicit bias on agent behavior.

Learning algorithm (Q-Learning)

The loss:

A replay buffer with transitions is maintained. Actions are sampled as follows:

where is a uniform distribution used to encourage exploration.