Meta-Learning and Reward Learning Algorithms

1. Meta-Learning

1.1 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks [Link]

The goal is to learn a good weight initialization such that it is ready to be fine-tuned on any task. A good property is adapting to a task in a few-shot learning setting as fast as possible. This meta-learning mechanism should be designed in a model-agnostic way so that it is general.

The first step: a general task formulation. is a task, where is the initial state distribution and is the episode length. In i.i.d. supervised learning problems, .

The “task” of meta-learning is to learn a new task well as fast as possible. More formally, it means that for a task , after the network is trained on few-shot data to become , the test error on test data should be low. Therefore, the test error is the training error of the meta-learning problem.

The update rule of meta-learning is where . The former is the meta-learning formula and the latter is the few-shot adaptation formula. The meta-learning rule in words is “how to change parameter to make the test error on as small as possible using few data examples”.

1.2 On First-Order Meta-Learning Algorithms [Link]

The paper thinks about “what in essence meta-learning algorithms do”. The answer of this paper is that (1) these algorithms optimize for within-task generalization and (2) they optimize generalization by aligning gradients between mini-batches. The baselines are MAML and first-order MAML, and the authors propose a new algorithm, Reptile, that is simple yet captures the “essence”.

1.2.1 Algorithmic foundation

Let be the model parameter vector to be optimized and be an update function, i.e., SGD gives where is the gradient. Assume the task is and the training/test datasets of are and . The gradient for meta-learning is where is the loss function. In one word, meta-learning in this form is optimizing for within-task generalization that can be learned as quickly as possible.

First-order meta-learning ignores , and computes the gradient for meta-learning as . That is, to consider how changes on parameter can influence within-task generalization, it assumes .

The proposed Reptile algorithm takes one step forward: it treats a stream of mini-batches as a stream of training/test set construction. Thus, it does three things iteratively: (1) sample a task and mini-batches from ; (2) do SGD steps on these batches and get ; (3) update according to .

1.2.2 Analysis

The main result of the paper’s analysis is that MAML, FO meta-learning, and Reptile all achieve the effect of “meta-learning” by aligning gradients across mini-batches. Go to page 6 of the paper for the Taylor expansion analysis.

The main idea is to expand at , where is the parameter after being updated on mini-batches and is the loss of the -th mini-batch with model parameter .

Take an example on Reptile. The result is that the gradient on the -th mini-batch can all be split into two parts: and . and . Here, the expectation operator on and rules out randomness from the task choice and the first and second mini-batch choices, respectively. means , that is, the gradient of the -th mini-batch under the initial model parameter .

is the steepest direction for aligning and . This alignment allows within-task generalization intuitively, because if the model is trained on mini-batch 1, then the resulting model “looks” as if it were trained on mini-batch 2, and thus can achieve good performance on mini-batch 2 even though the model is not trained on it.

1.2.3 Thoughts

SGD on batched data is implicitly doing gradient alignment across mini-batches within the same task. The difference between SGD and Reptile is visualized below. To what extent does optimization on batched data achieve generalization? What if we align gradients on data from multiple tasks rather than a single task? Reptile is not model-agnostic — how to make it work on RL?

1.3 Meta-Learning and Universality [Link]

1.3.1 In theory

The main result is that gradient-based meta-learning can approximate any learning algorithm. A learning algorithm is defined as a function mapping a training dataset and a test input to a test output: .

Gradient-based meta-learning means that from a model initialized at , gradient descent is conducted using : . Another class of RNN-based meta-learning is seemingly more complicated — it updates using with an RNN: .

Therefore, “gradient-based meta-learning can approximate any learning algorithm” means that for every , there exists a neural network such that, given any , for all in the compact set:

where is produced by one gradient step:

The is a learning algorithm. This theorem means that one gradient step on a common starting point can recover any learning algorithm to arbitrary precision. The expressive power of gradient-based meta-learning is very strong theoretically, and we just need to learn/find this good common starting point.

One more thing: network depth matters for gradient-based meta-learning. A single step of gradient descent on a single layer is just a rank-1 update on the network weight, while a weight update on can be rank-. Depth in MAML buys expressiveness of the update.

1.3.2 In practice

The gradient-based method is more resilient to overfitting on a small dataset than RNN-based methods. When trained on OOD tasks with a small dataset, the gradient-based method can even improve without overfitting when the number of test-time update steps is larger than the number of meta-training steps (inner loop). Two levels of generalization:

Generalize to unseen tasks. can identify the test-time task using even though it is not seen during meta-training.
Generalize to test-time update steps. More steps can buy improvement without overfitting, suggesting strong within-task generalization.

All suggest that MAML finds a better than RNN-based methods.

1.3.3 Discussion

In theory, MAML and RNN both can represent any learning algorithm. In practice, the gradient-based method performs better than the RNN-based method. Therefore, gradient-based meta-learners have a “strong inductive bias for reasonable learning strategies”. But what inductive bias? Unstated.

1.4 Recasting Gradient-Based Meta-Learning as Hierarchical Bayes [Link]

Gradient-based meta-learning adapts to according to data from task . The result of this paper is that by optimizing , we are also optimizing . The magic of meta-learning is to formulate the task of optimizing into a hierarchical model — if we directly optimize , we will get an uninteresting “mean”; but if we first adapt to data and optimize , then we will get an interesting solution .

MAML just approximates with a point estimate . In this way, . But to make this approximation valid, we want the integrand to be sharply concentrated at . Fortunately, gradient descent supports this (partially). Gradient descent is essentially finding a good solution that is close to the starting point. For instance, gradient descent on a linear model is doing . In this way, we can expect that , as a result of doing gradient descent on with data , is a mode of , because the first term wants to predict the data with and the second term wants to be not far away from .

All in all, is a valid point estimate of and optimizing it is implicitly optimizing . And the magic of meta-learning is to formulate the learning on different data as a hierarchical model.

1.5 Some Considerations on Learning to Explore via Meta-Reinforcement Learning [Link]

This paper considers the right formulation of meta-RL. The proposed formulation is:

The key thing is to consider how changes in will change the collected dataset , which will lead to a different adaptation result . Mathematically, it weights using , the probability of collecting a specific dataset. Intuitively, meta-RL should achieve a good that can enable quick learning on new RL tasks, where an informative is the key. So meta-RL should learn a good that can sample a good .

Doing -style analysis on the proposed formulation gives:

The first term is MAML. The gradient flow is meta_loss → θ' → (θ - α * ∇_θ inner_loss) → θ. Pay special attention to . It acts as if, in the “local region” of , we can gather the same data , and we only consider how influences .

The second term is the key. It reinforces trajectory according to rather than . That is, the trajectory emitted by the initialization policy can only be reinforced if it leads to a future updated policy emitting a high-reward trajectory . Therefore, it is exploratory, because it allows reinforcing a trajectory even if it itself leads to bad reward.

1.X Side Works

Meta-Learning with Implicit Gradients. Instead of ignoring in first-order meta-learning, it formulates how influences in an explicit term . This is valid because gradient descent can be deemed as a regularization term that asks the destination not to be far away from the starting point, plus a loss term to minimize the task cost. The proximal loss is .
Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML. Does its meta-initialization succeed through rapid learning (the inner loop drives large, efficient representational change per task) or feature reuse (the initialization already holds good features, barely changed by the inner loop)? The paper shows body-layer representations stay near-identical before and after adaptation, while only the head shifts — evidence for feature reuse. This motivates ANIL (Almost No Inner Loop), which applies inner-loop adaptation only to the task-specific head, dropping it for the body during training and testing. ANIL matches MAML’s accuracy on classification and RL benchmarks while running substantially faster. My opinion: MAML can learn both things; it is just because the task is too simple that “representation reuse” is enough (i.e., adapt the task head only).

2. Meta-Gradient RL and Reward Learning

2.1 On Learning Intrinsic Rewards for Policy Gradient Methods [LIRPG]

The philosophy of the learned intrinsic reward in this work: the ultimate measure of performance we care about improving is the value of the extrinsic rewards achieved by the agent; the intrinsic rewards serve only to influence the change in policy parameters.

The algorithm:

Step 1: update (policy parameter) to maximize the in+ex reward, getting .
Step 2: update by asking: how can I change such that can lead to higher ex reward?

The goal of step 1 is to “understand” how a change in can lead to a change in ex. The key is to get via .

The same idea is used in different settings: credit assignment in MARL learning to reward others for cooperation.

2.2 What Can Learned Intrinsic Rewards Capture? [Link]

This is an interesting paper. It is a deep question to think about the rationality of distinguishing extrinsic and intrinsic reward: “extrinsic rewards define the task and capture the designer’s preferences over agent behavior, whereas intrinsic rewards serve as helpful signals to improve the learning dynamics of the agent”.

The setting of intrinsic reward is philosophically worth thinking about. A method like this is asking “should I explore object A, or object B?”, whereas a policy is asking “what action sequence should I emit to explore object A?”. Therefore, this method is more “meta”:

It is defined on the trajectory: is the trajectory and is the intrinsic reward parametrized by .
The intrinsic reward is the only thing that can be “inherited” across a lifetime. The learning setting is illustrated in figure 1. The goal of the intrinsic reward is to enable a randomly initialized policy to learn from online experience to maximize its extrinsic reward across a lifetime. In the author’s words: “the objective of our method is to learn knowledge that is useful for training randomly-initialised policies by capturing what to do”.

2.2.1 Experiments

Several things to notice:

The paper mainly focuses on exploration-exploitation.
The intrinsic reward captures “meta knowledge”: it defines a general guidance of the whole behavior pattern — when to explore, when to exploit, what objects/places are worth exploring, etc.

Experiment settings (interesting):

Empty Rooms: The agent must visit an “invisible goal location, which is fixed within each lifetime but varies across lifetimes”, over 200 episodes. Good intrinsic reward “should encourage the agent to go to unvisited locations to locate the goal, and then to exploit that knowledge for the rest of the lifetime.”
Random ABC: Rewards for objects A, B, C are “uniformly sampled from [−1, 1], [−0.5, 0], and [0, 0.5]… but are held fixed within the lifetime.” Good intrinsic reward should learn that “1) B should be avoided, 2) A and C have uncertain rewards, hence require systematic exploration (first go to one and then the other), and 3) once it is determined which of the two, A or C, is better, exploit that knowledge.”
Key-Box: Similar to Random ABC but “the agent needs to collect the key first to open one of the boxes (A, B, and C)”; the key gives a neutral reward of 0. Good intrinsic reward should capture “that the key is necessary to open any box, which is true across many lifetimes.”
Non-stationary ABC: Rewards: A is 1 or −1, B is −0.5, C is the negative value of A; “the rewards of A and C are swapped every 250 episodes”, over 1000 episodes. Good intrinsic reward should “capture the (regularly) repeated non-stationarity across many lifetimes and make the agent intrinsically motivated not to commit too firmly to a policy, in anticipation of changes in the environment.”