Evolution of Cooperation[3]: Joint Attention and Pre-Linguistic Gestures

Reading notes on Michael Tomasello’s work. Translated from Chinese and lightly polished with Claude.

This essay sketches Tomasello’s answer to one question: to what extent do basic human gestures already contain the cognitive capability infrastructure required by modern language?

Tomasello identifies two basic human gestures: pointing and pantomiming. By “pantomiming” he means something like this: imagine yourself at a bar with an empty glass in front of you. You mime a drinking motion at the bartender, empty-handed. The bartender pours you a drink.

His conclusion: an analysis of how humans use pointing and pantomime shows that these basic gestures already carry the full cognitive infrastructure needed for using language and constructing social norms.

1. Two basic gestures

In pointing, the referent is typically something both parties can see in their shared visual field — it is deictic.

In pantomime, the referent is typically absent from the current physical scene. The gesture mobilizes the receiver’s imagination to evoke a scene, an event, an object — it is iconic.

Both gestures, Tomasello argues, are pre-linguistic — both ontogenetically and evolutionarily. They are extremely simple, yet in the right context they can convey a great deal of information (consider the bar example). As a rule, the better two people know each other, the less they need to spell out in words: more is left “to be understood.” A natural question follows: how is this possible? What cognitive capacities does it require?

2. The cognition behind gestural communication

2.1 Common conceptual ground

Recall the bicycle example from Evolution of Cooperation [1], which involves pointing. Why is the gesture meaningful? Because (1) in our past shared experience, we both know that the bicycle belongs to my ex-boyfriend, and we both know about the unhappy breakup; and (2) in our current physical space, both of us can see the bicycle. The pointing only directs attention to that one object. The visual field contains many other objects; I can tell that the friend is pointing at the bicycle — and not at something else — only because of our shared experience.

So common conceptual ground has two layers. First, what is immediately perceivable in the shared physical space (analogous to the observation in reinforcement learning). Second, the relevant shared experience from our past interactions (analogous to the previous states ). The RL analogy: the shared physical space maps to the current observation , while shared past experience maps to the history — common ground is thus the agent’s full state representation, but held jointly by two parties.

Now consider this scene: A and B are friends doing a jigsaw puzzle together. A circles a hole where a piece is missing; B finds the missing piece and hands it to A. For B to interpret A’s gesture and act on it requires two preconditions. First, on the basis of a shared goal (completing the puzzle), B infers A’s intention. Second, even if B does not want to help, B cannot just walk away in silence — B has to give a reason (“I need a bathroom break”), or A will be baffled.

So common conceptual ground has two further layers. First, a shared goal, and the chain of intention and action that follows from it. Second, social norms: a commitment to the shared goal, an expectation that the partner will help / behave altruistically toward it, and so on. The second point is unmistakable: if you silently stop cooperating or actively sabotage the puzzle, your partner will be shocked. Either the friendship is over, or they will not invite you to do things together again.

Taken together, common conceptual ground is what makes “the unsaid” possible: with little verbal explanation, one gesture is enough — you understand me.

2.2 Social motives, expectations of altruism, and joint attention

Social motives are motives whose fulfillment requires another person. Tomasello identifies three:

Requesting: I point at something, hoping you will hand it to me.
Informing: As you cross the street, I point to a bicycle coming from the side, warning you to be careful.
Sharing: I just saw a concert I loved, and I tell you how excited I am.

Requesting hopes that you will help me. Informing assumes that what I am offering is unknown to you but useful to you. Sharing is expressive — its aim is to share an emotion, a perception, and to expand our shared experience.

How do humans differ from apes here (see Evolution of Cooperation [2])? Apes already understand others as agents with perception and intention, but humans understand others as agents with cooperative intention, and this cooperative expectation has become a social norm — even a kind of implicit assumption. Tomasello develops the point as follows:

When we try to make another person notice or know something, beyond wanting them to attend to “that object,” there is a second-order intention: we want them to notice the very intention that we want them to notice.

And when the other person naturally notices this “I want you to notice,” they seek to discover why I want them to notice — they search for the relevance within our common conceptual ground.

For example, when A points out a tree to B, B not only notices the tree but also recognizes “A wants me to notice this tree.” B then asks: is A informing me that this tree is useful to me, requesting that I notice this tree because A wants me to use it for A’s sake, or sharing some feeling about this tree?

More precisely, the essence of this process is: I want us to know this together. That is, we jointly invoke joint attention.

Joint attention is a core concept. It is the foundation of common ground, and it also makes human communication public, placing behavior in the domain of social norms. To see why “publicness” matters, consider this somewhat involved scenario:

I am invited to a party. There is a drink I love, but out of politeness I do not want to ask the host directly for one. So I place my glass somewhere the host can see it, hoping it will be refilled. If the host never sees that the glass is mine — or I never notice that the host saw it was mine (recall the recursive belief from Evolution of Cooperation [1]) — then a non-refill will not upset me. But if somehow the host sees me set down the glass, and I then realize that the host saw me set it down, and yet the host still does not refill it, I will feel a small slight.

There is a specific term for the behavior in the first scenario — conceal authorship. The behavior is not made public, joint attention is not established, and the norms of host–guest etiquette are not invoked. In the second scenario, the behavior is public, joint attention is established, and social norms kick in.

And what is the most important social norm humans hold about each other? A norm of cooperation: that within a shared conceptual frame, we act cooperatively. The human expectation of altruism is itself a recursive belief — I know I am cooperative, I know you are cooperative, I know that you know I am cooperative, and so on. The clearest illustration: when a friend points at something, you, drawing on your common ground, infer that they want you to hand it to them. If you don’t, you feel motivated (and they expect you) to explain why.

This is also why, in the bucket-choice case from Evolution of Cooperation [2], humans interpret the pointing as “telling me where the food is hidden”: humans ask themselves “why are you pointing at this bucket — is it useful to me?” because they assume that the pointer is motivated to help them achieve their goal. Apes do not — they read the human as “just another agent commanding me to do something for its purposes.”

3. Summary

A person has an individual goal. To achieve it, they may need another person’s participation — a social intention. To make participation possible, they need to form a common conceptual ground with the other person. Common conceptual ground rests on joint attention — wanting the other person and oneself to jointly attend to something that ought to be attended to (the referential intention), and to expect or direct the other to act accordingly. One route to joint attention is communication (the communicative intention). All of this is enveloped in social norms, including a mutual assumption — or norm — of cooperation, which takes hold whenever behavior is made public.

These are the cognitive processes that human language needs in order to work. They are already in place in the most basic gestures — pointing and pantomime.