Meta-Learning and Reward Learning Algorithms

1. Meta-Learning

1.1 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks [Link]

The goal is to learn a model that can adapt to a task in a few-shot learning setting as fast as possible. This meta-learning mechanism should be designed in a model-agnostic way so it is general.

The first step: a general task formulation. is a task, where is the initial state distribution and is the episode length. In i.i.d. supervised learning problems, .

The “task” of meta-learning is to learn a new task well as fast as possible. More formally, it means that for a task , after the network is trained on few-shot data to become , the test error on test data should be low. Therefore, the test error is the training error of the meta-learning problem.

The update rule of meta-learning is where . The former is the meta-learning formula and the latter is the few-shot adaptation formula. The meta-learning rule in words is “how to change parameter to make the test error on as small as possible using few data examples”.

1.2 On First-Order Meta-Learning Algorithms [Link]

The paper thinks about “what in essence meta-learning algorithms do”. The answer of this paper is that (1) these algorithms optimize for within-task generalization and (2) they optimize generalization by aligning gradients between mini-batches. The baselines are MAML and first-order MAML, and the authors propose a new algorithm, Reptile, that is simple yet captures the “essence”.

1.2.1 Algorithmic foundation

Let be the model parameter vector to be optimized and be an update function, i.e., SGD gives where is the gradient. Assume the task is and the training/test datasets of are and . The gradient for meta-learning is where is the loss function. In one word, meta-learning in this form is optimizing for within-task generalization that can be learned as quickly as possible.

First-order meta-learning ignores , and computes the gradient for meta-learning as . That is, to consider how changes on parameter can influence within-task generalization, it assumes .

The proposed Reptile algorithm takes one step forward: it treats a stream of mini-batches as a stream of training/test set construction. Thus, it does three things iteratively: (1) sample a task and mini-batches from ; (2) do SGD steps on these batches and get ; (3) update according to .

1.2.2 Analysis

The main result of the paper’s analysis is that MAML, FO meta-learning, and Reptile all achieve the effect of “meta-learning” by aligning gradients across mini-batches. Go to page 6 of the paper for the Taylor expansion analysis.

The main idea is to expand at , where is the parameter after being updated on mini-batches and is the loss of the -th mini-batch with model parameter .

Take an example on Reptile. The result is that the gradient on the -th mini-batch can all be split into two parts: and . and . Here, the expectation operator on and rules out randomness from the task choice and the first and second mini-batch choices, respectively. means , that is, the gradient of the -th mini-batch under the initial model parameter .

is the steepest direction for aligning and . This alignment allows within-task generalization intuitively, because if the model is trained on mini-batch 1, then the resulting model “looks” as if it were trained on mini-batch 2, and thus can achieve good performance on mini-batch 2 even though the model is not trained on it.

1.2.3 Thoughts

SGD on batched data is implicitly doing gradient alignment across mini-batches within the same task. The difference between SGD and Reptile is visualized below. To what extent does optimization on batched data achieve generalization? What if we align gradients on data from multiple tasks rather than a single task? Reptile is not model-agnostic — how to make it work on RL?

Figure 1: SGD vs. Reptile

2. Meta-Gradient RL and Reward Learning

2.1 On Learning Intrinsic Rewards for Policy Gradient Methods [LIRPG]

The philosophy of the learned intrinsic reward in this work: the ultimate measure of performance we care about improving is the value of the extrinsic rewards achieved by the agent; the intrinsic rewards serve only to influence the change in policy parameters.

The algorithm:

The goal of step 1 is to “understand” how a change in can lead to a change in ex. The key is to get via .

The same idea is used in different settings: credit assignment in MARL learning to reward others for cooperation.

2.2 What Can Learned Intrinsic Rewards Capture? [Link]

This is an interesting paper. It is a deep question to think about the rationality of distinguishing extrinsic and intrinsic reward: “extrinsic rewards define the task and capture the designer’s preferences over agent behavior, whereas intrinsic rewards serve as helpful signals to improve the learning dynamics of the agent”.

The setting of intrinsic reward is philosophically worth thinking about. A method like this is asking “should I explore object A, or object B?”, whereas a policy is asking “what action sequence should I emit to explore object A?”. Therefore, this method is more “meta”:

  1. It is defined on the trajectory: is the trajectory and is the intrinsic reward parametrized by .
  2. The intrinsic reward is the only thing that can be “inherited” across a lifetime. The learning setting is illustrated in figure 1. The goal of the intrinsic reward is to enable a randomly initialized policy to learn from online experience to maximize its extrinsic reward across a lifetime. In the author’s words: “the objective of our method is to learn knowledge that is useful for training randomly-initialised policies by capturing what to do”.

Figure 2: The lifetime learning setting of intrinsic reward

2.2.1 Experiments

Several things to notice:

  1. The paper mainly focuses on exploration-exploitation.
  2. The intrinsic reward captures “meta knowledge”: it defines a general guidance of the whole behavior pattern — when to explore, when to exploit, what objects/places are worth exploring, etc.

Experiment settings (interesting):