Evolution of Cooperation[5]: Brain Evolution and MARL Reflections

Reading notes on Michael Tomasello’s work, with my own reflections. Translated from Chinese and lightly polished with Claude.

This essay is mostly my own thinking, on two topics:

the debate over what drove primate brain evolution;
how multi-agent reinforcement learning might model the evolutionary mechanism of cooperative behavior.

1. The debate over what drove primate brain evolution

I am of course a complete amateur in anthropology and evolutionary biology, so what follows is only a rough thought.

Tomasello’s research can be placed in a much bigger research landscape.¹¹ Dunbar, Robin I. M., and Susanne Shultz. “Why are there so many explanations for primate brain evolution?” Philosophical Transactions of the Royal Society B 372.1727 (2017). In that landscape, the central question is: why do primates live in groups? Why did primates evolve such large brains, with brain size showing a strikingly strong positive correlation with group size? See the Dunbar–Shultz paper for details (recommended reading).

From the broader vantage point, two things about Tomasello’s theory stand out:

T mainly thinks about fitness from a foraging angle. But another important motive for group living is predation risk. T does not examine the behavioral patterns of apes or of primates more broadly in the context of predator avoidance, nor the potential cooperative motives behind such behavior.
T’s main interest is the cognitive abilities unique to humans, so he focuses on behavioral analyses and comparisons of humans (especially infants) and great apes. He therefore concentrates on late-stage primate evolution, and pays less attention to other primates (e.g., baboons).

These are real blind spots in T’s work. Yet I find T’s research very interesting — in both method and direction — and very meaningful for the study of intelligence:

The author of the Dunbar–Shultz paper originated the social brain hypothesis (SBH). As he says, the heart of SBH is this: group living arises because certain ecological challenges must be faced by a group rather than by an individual (“an individual in a group has higher fitness than an individual alone”); and intelligence is not fundamentally determined by group size, but by what social skills an individual evolves in the course of life within a group.

I largely agree with this. Two things follow from it:

The environmental challenge is primary. One has to argue that a particular challenge is more advantageously faced by a group.
A qualitative analysis of behavior and cognitive abilities is necessary. Statistical correlation (e.g., between neocortex size and some behavior) is not the point.

The analysis of cognitive ability is crucial because behavior is always deceptive. Recall from Evolution of Cooperation [4] that apes hunting monkeys in coordinated formation does not imply that they have the cognitive capacity to coordinate group behavior. Yet much of the research that links predation risk to group intelligence stops at correlation, with little deep analysis of the cognitive capacity embedded in the group’s anti-predator behavior. (There may be such work — I should look — but Dunbar does not cite any.) In fact, as Evolution of Cooperation [2] noted, vocalization is part of the primate response to predators, but primate vocalization has no real cognitive scaffolding — it is just a gene-encoded emotional response.

A related question: what cognitive abilities, exactly, do different primate groups evolve in the course of their group lives? “Life processes” include things like (1) the “small-group bonding” produced by grooming; (2) inhabiting different layers of social relations, from close kin to distant acquaintances; (3) attacking another individual while having to consider its status in the wider group, including its ties to individuals not currently visible. The same principle applies: a “life process” does not entail a cognitive capacity. A qualitative analysis of cognitive abilities across different primates, different life processes, and different collective actions is what really matters.

T focuses on qualitative research, on a comparative perspective, and proposes a reasonable environmental challenge that favors group life. His theory is self-consistent. Still, I have one more intuition:

The ape-to-human transition may be special because we truly moved from large-scale collective life to large-scale collective action. The difficulty of such collective action forces many cognitive abilities to be deployed at once for the action to succeed.

So, for intelligence — or for group intelligence — perhaps the complexity of an environment should be defined by the minimum set of cognitive abilities required to act within it. This idea matches Leibo’s view.²² Joel Leibo, YouTube talk It also relates, perhaps, to the notion of “horizontal capacity.”

2. Modeling the evolution of cooperative behavior with multi-agent RL

My deepest motivation is not really primate research, but this: what are the unique capabilities that cannot be acquired in a solipsistic way? In other words: if a single agent is already strong enough (or rather, useful enough, since usefulness is not intelligence), does “learning inside a group” add anything that is uniquely intelligent?

You can approach this from many angles — for example, by asking what tasks or environments truly require a group (perhaps the most important first step). But here I want to focus on the angle of multi-agent reinforcement learning algorithms, since that is closest to Tomasello’s work.

T’s key claim from Evolution of Cooperation [3], recapped here:

A person has an individual goal. To achieve it, they may need another person’s participation — a social intention. To make participation possible, they need to form a common conceptual ground with the other person. Common conceptual ground rests on joint attention — wanting the other person and oneself to jointly attend to something (the referential intention), and to expect or direct the other to act accordingly. One route to joint attention is communication. All of this is enveloped in social norms, including the mutual assumption — or norm — of cooperation, which takes hold whenever behavior is made public.

This packs in many “cognitive capabilities” — too many to list. But the most important may include:

Common conceptual ground and recursive belief (shared goal, joint attention, understanding others as agents who act on perception, etc.).
Coordinated action (shared-goal understanding, role understanding, joint attention, communication, etc.).
Identification (recognizing which group one belongs to — including conventions, trust, etc.).

From a modeling perspective, a few principles may be useful:

Both individual rewards and group rewards: Much work focuses on credit assignment in a team-play setting, but that framing always starts from the replay buffer: in state , action yields group reward — how does each individual contribute? But individuals do not only contribute to a group reward, and rewards should not only be group-level and then split among individuals. Each individual should have its own reward function, e.g. an intrinsic reward. A group reward is necessary because in a complex environment there is also a layer of competition–cooperation between groups.
Two critics — group-level and individual-level: A lot of work focuses on asymmetric actor–critic. This makes sense: when modeling coordination, the reward function must be triggered by several agents taking temporally aligned actions. So the critic can only allocate sensible rewards if it sees the global state and global action — a global critic is necessary. Local actors and critics are necessary too, because the world is big (the big world hypothesis): once the global critic has handled the highly abstract state, only zooming in to a more fine-grained local state lets us model individual behavior and reward prediction reasonably. Individual reward can also provide a signal for credit assignment.
Mind of Agent (MoA): Necessary, or at least some kind of visibility mechanism that lets each agent record the intersection of others’ observation spaces with its own, and that allows asymmetric distinction between agents. Modeling common conceptual ground of course requires a hierarchical structure — again from the big world hypothesis — with each layer representing relational distance and intimacy: pairwise conventions, group-wide norms, with a prior–posterior relationship between layers.
Reputation modeling: Intuitively the replay buffer is enough, but one might need something like a reward bundle or reward state³³ Abel, David, et al. “Expressing non-Markov reward to a Markov agent.” Multidisciplinary Conference on Reinforcement Learning and Decision Making, vol. 9, 2022. to preserve the Markov property. The reward state stores the past state–action history rewards of a particular agent. And there should be two layers of reputation: person-to-person and person-to-group.
Group-level identification: First, the environment fundamentally needs to introduce group-vs-group competition and cooperation. Group identification might come through observing the shared state space of person-to-person interactions. Intuitively, people prefer people similar to themselves, from the same group. How to model “convention”? “Convention” means that something could be done in an arbitrary way, yet a group consistently chooses one specific way — the learned policy is the convention. The key, then, is using the inconsistency between two individuals’ behavior policies to judge whether they belong to the same group. An intrinsic reward defined over abstract actions in the replay buffer might suffice.

A separate observation: a lot of work either trains MARL from scratch and looks for emergent behavior, or uses LLMs for simulation (mostly in scenarios that don’t really require cooperation). In-context learning is, of course, a form of learning — the core mechanism is shifting the output distribution. But why not provide the right environment (with person-to-person and group-to-group mixed-motive dynamics), run MARL on LLMs, and watch what emerges? It is messy, sure — but worth doing.

My views keep changing. The content above is valid as of May 27, 2026.

Additional reference: Herrmann, Esther, et al. “Humans have evolved specialized skills of social cognition: The cultural intelligence hypothesis.” Science 317.5843 (2007): 1360–1366.