Multi-agent Papers

There is a question on my mind: does surviving in a society push agents to become more intelligent? To explore this, I’m recording my paper-reading notes on multi-agent learning here. Maybe the next step is to enable learning inside a society.

1. Conceptual Frameworks

  1. Language Model Teams as Distributed Systems. Distributed systems serve as a conceptual baseline for interpreting the failure modes and trade-offs in MAS.
  2. Collective cooperative intelligence. MARL is engineering-driven (build it, then explain it), while CSS (complex systems science) is science-driven (explain it, then intervene).

2. Papers

2.1 Multi-agent Reinforcement Learning in Sequential Social Dilemmas

This work11 arXiv:1702.03037 trains two agents in an environment and observes whether they cooperate or defect under different environment configurations.

Two games:

  1. Gathering: Two players move on a 2D gridworld to collect apples, each worth reward. Collected apples respawn after frames, controlling scarcity. Players have 8 actions: move forward / back / left / right, rotate left / right, fire beam, stand still. The beam gives no reward but tags any player it hits twice, removing them from the game for frames. Defection corresponds to aggressive beam use to monopolize apples; cooperation corresponds to peaceful gathering. Beam-use rate serves as a social metric of aggressiveness.
  2. Wolfpack: Two wolves chase a prey on a 2D gridworld. When either wolf touches the prey, all wolves within the capture radius share a reward. Lone capture yields ; joint capture (both wolves within radius) yields the higher . Cooperation corresponds to coordinated hunting; defection corresponds to lone-wolf capture.

The result is intuitive. In the gathering game, when resource redundancy is high (small ) or the cost of being tagged is small (small ), agents learn to be more cooperative. In the wolfpack game, when the advantage of joint capture is high (large ) or the difficulty of joint capture is low (large radius), agents learn to be more cooperative.

This simulates the learning dynamics of social behavior even without setting an intrinsic “persona” for an agent. The emergent behavior pattern is purely driven by the environment configuration.

2.1.1 Some side points made by this paper
  1. A game can be classified using the sequential social dilemma (SSD) framework. Take the gathering game as an example. We can use the policy learned in an environment with high resource redundancy and low tag cost as a cooperative policy , and obtain a defective policy in the contrasting environment. Rolling out the four scenarios — — gives an empirical payoff in the form of a prisoner’s-dilemma-style matrix. The idea of (1) classifying agent behavior with a metric, (2) using a learned policy as the instantiation of an agent with a specific persona, and (3) rolling out agents of different personas to obtain an empirical payoff that indicates the structure of the game is very interesting.
  2. Some hyperparameters change agent behavior. Two examples. (1) A higher reward discount factor leads to less myopic agents, which can encourage defection in the gathering game (beaming delays reward). (2) A larger replay buffer leads to more refined evasive behavior in the gathering game, since the agent has more chances to model its opponent’s beaming behavior. Maybe these things should not be treated as hyperparameters but as learnable parameters (learn to be myopic / learn to memorize more).
  3. Games can also be classified by their default behavior. In the gathering game, cooperation is the default and defection is harder to learn. In the wolfpack game, defection is the default and cooperation is harder to learn. So the game structure provides an implicit bias on agent behavior.
2.1.2 Learning algorithm (Q-Learning)

The loss:

A replay buffer with transitions is maintained. Actions are sampled as follows:

where is a uniform distribution used to encourage exploration.

2.2 Social Influence as Intrinsic Motivation for MARL

This paper22 arXiv:1810.08647 explores an intrinsic reward of social influence in the MARL setting, defined as having a causal influence over other agents’ actions.

2.2.1 Definition of social influence and its connection with mutual information

Social influence reward for agent on agent :

where the marginal is

Intuition: If ‘s specific action shifts ‘s distribution far from the average-over-’s-actions baseline, then ‘s action mattered — high social influence from to , and thus high intrinsic reward for . This is also why the reward equals mutual information in expectation: MI measures how much knowing changes the distribution of relative to not knowing at all (i.e., the marginal). The author therefore says that social influence can be seen as a social form of empowerment.

In other words, the social influence reward is defined through a counterfactual analysis.

2.2.2 Emergent behavior
  1. In a harvest task where agents collect apples, the influencer (the agent that receives the intrinsic reward) learns to stand still when no apple spawns and to move toward apples when they spawn, while other agents move randomly when there is no apple. This indicates that the influencer learns to be both informative and altruistic. “Informative” means the influencer’s behavior pattern indicates whether an apple has spawned. “Altruistic” means the influencer does not convey misleading information. The influencer learns to be informative because it needs intrinsic rewards (implicitly maximizing MI), and it learns to be altruistic because the others receive extrinsic reward and can learn to ignore the influencer if it is not benign (which would give the influencer a low intrinsic reward).
  2. Highly influential moments are sparse. “Because the listener agent is not compelled to listen to any given speaker, listeners selectively listen to a speaker only when it is beneficial, and influence cannot occur all the time.” My further conjecture is that the influencer also learns a behavior pattern that is highly classifiable (i.e., has high MI with some external signal).

The emergence of a communication protocol between agents driven by an intrinsic reward is fascinating. The connection between the social intrinsic reward and the influencer–environment MI / influencer–influencee MI is interesting.

2.3 MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

This paper33 arXiv:2602.24188 proposes to evaluate the collaboration capacity of LLMs by checking whether they can efficiently convey information to achieve a common goal. The task is formulated as completing a job between two agents by sharing their private information under a (1) per-round token budget and a (2) number-of-rounds budget.

2.3.1 Some thoughts on this paper
  1. Token and round-count budgets are good levers for testing whether agents can communicate sufficiently and efficiently toward a shared goal. Another consideration is to make sure that extracting a single agent message is not itself hard. For instance, an image-selection task is (in my opinion) less effective than a name-game, because an agent may fail to describe an image correctly (so the conveyed information is itself wrong), whereas an agent can describe a selected portion of a table well. The remedy is to (1) use a strong enough model so that message construction is robust, and (2) make message construction an easy task while making which information to include the hard part.
  2. What makes a good conceptual game for evaluating efficient communication? A game that has (1) an optimal solution (rounds of conversation needed) under a budget, (2) can be constructed at scale with precise control over difficulty, and (3) makes single-round message construction easy.
  3. Is a single scalar metric enough for such benchmarks? The qualitative analysis in Section 5 is interesting and cannot be conveyed by a single scalar. See this related position paper.
  4. Sycophantic language-model players might be too willing to accept claims or ideas from the other player, potentially overriding private information that would help them reach a more accurate answer. Sycophancy may be a serious problem in agent-to-agent collaboration and is worth a paper of its own. A related paper is here.
  5. The paper uses two metrics: the content-word ratio penalizes verbosity (models that produce more conversational scaffolding — acknowledgments, hedges — score lower regardless of informational value), and the novelty component penalizes repetition across turns. The result shows no correlation between these metrics and task success rate. This suggests that current LLMs can generate content-rich utterances but struggle to deploy that content strategically in service of collaborative goals.
  6. Tracking the metrics of human dialogue when humans solve the same problems would also be an interesting direction.

2.4 CooperBench: Why Coding Agents Cannot be Your Teammates Yet

This paper44 arXiv:2601.13295 checks whether agents can collaborate to complete coding tasks. The result: GPT-5 and Claude Sonnet 4.5 achieve only 25% success when two agents cooperate — 50% lower than a single agent doing both features alone.

2.4.1 Some thoughts on this paper
  1. The results don’t change even with git. I think agents should be able to see what each other is doing. git shows the results, not what the partner is doing. git is enough for humans because humans complete tasks much more slowly than agents do. A partner’s action is a resource for real-time reasoning. The author says: “agents are trained to verify, but collaboration requires them to trust, and this mismatch may explain why they fail to update their picture of what their partner is doing.” I disagree — just let agents see each other’s actions.
  2. The point above can be extended into a pattern: instead of forcing agents to build an internal MOA (model of the other agent), we should just let the policy condition on the others’ actions.
  3. The task taxonomy should be quantified. How much coordination is needed? To what extent are subtasks independent? This is necessary for a benchmark.
  4. “Cursor reported running hundreds of concurrent agents on a single project by separating them into planners and workers with explicit role hierarchies. This scaffolding is a workaround, not a solution. It places the coordination burden on human developers who must design the right structures.” Nice words.

3. MARL Algorithms

3.1 Value-Decomposition Networks for Cooperative Multi-Agent Learning

This paper55 arXiv:1706.05296 tackles credit assignment in MARL with a team reward.

3.1.1 Technical challenges
  1. Lazy agent in centralized training (CT). In CT, you concatenate the agents’ observations into one big observation and treat the joint action as a single action drawn from the product action space . The team is treated as if it were a single “large” agent. The lazy-agent problem arises when one agent learns a useful policy but a second agent is discouraged from learning, because its exploration would hinder the first agent and lead to a worse team reward.
  2. Spurious reward in decentralized training (DT). For instance, in Fetch, agent 1 receives a team reward when agent 2 drops off an item somewhere off-screen. Agent 1 had nothing to do with it, but the reward arrives during whatever action agent 1 happened to take — so Q-learning incorrectly credits that action. This happens because each agent’s is trained directly on the team reward as if it were its own reward.
3.1.2 Method

Each agent contributes to the team reward through an additive formulation:

  1. As long as the network is fine, so we have infinitely many valid decompositions — why does this even work? Because of the SGD process. will be pushed toward . Since we sample many values under , eventually learns the correct value. Intuitively, training will not be stable until the high-rewarding region of the space is well explored.
  2. Limitation: it can’t represent coordination. In coordination, agents should act in a temporally aligned way. For instance, two agents need to push a heavy cube together. For a value function to capture coordination, it needs to see all actions from the coordinating team, e.g. . This may motivate an asymmetric actor–critic where the critic network can see the global action information.
3.1.3 Follow-up works

QMIX and QTRAN take the essence of value decomposition to be keeping the global aligned with the per-agent when selecting actions, and they use more expressive value-mixing function classes (VDN: sum; QMIX: monotonic; etc.) to express this:

I don’t appreciate these works, because “local = global” is not the essence of VDN.

4. Side Remarks

  1. Current MAS work uses benchmarks that are designed for a single agent — for example, papers 66 arXiv:2604.25917, 77 arXiv:2511.20639, 88 arXiv:2510.08529, and 99 arXiv:2604.00344. This feels fundamentally wrong intuitively — but why?
  2. This paper: (1) Generative collective behavioral outcomes are shaped by the interplay between internalized priors (the “person”-perspective) and the interactive, in-context social-learning phase (the “situation”-perspective). (2) MARL theory assumes that agents start with minimal inductive bias and must learn coordination through exploration and explicit reward shaping, but LLMs are different. (3) Identifying causal pathways is essential for credit assignment in a society.