Policy generated by a pre-trained language model for a given household task
The policy learned by fine-tuning the pre-trained language model successfully finishes the task described in the goal predicates.
We highlight the key actions in the map, where the agent is finding, grabbing, or placing objects in the target positions.
Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and fine-tuning them via behavior cloning improves task completion rates by 43.6% in the VirtualHome environment. We then examine how our framework may be used in environments without pre-collected expert data. To do this, we integrate an active data gathering procedure into pre-trained LMs. The agent iteratively learns by interacting with the environment, relabeling the language goal of past ``failed'' experiences, and updating the policy in a self-supervised loop. The active data gathering procedure also enables effective combinatorial generalization, outperforming the best baseline by 25.1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We find that sequential input representations (vs. fixed-dimensional feature vectors) and favorable weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g.\ as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.
Qualitative results of our model on VirtualHome and BabyAI. We only show a sub-trajectory in each example to save space. The interacted objects are labelled by green bounding boxes.
Failure cases. We show failure cases caused by the grounding error and policy error. We only show a sub-trajectory in each example and omit most exploration actions to save space. The interacted objects are labelled by green bounding boxes.
Can pre-trained language models be used as a general framework for tasks across different environments?
In this paper, we study this question through the lens of embodied decision-making, investigating the effectiveness of LM pre-training as a general framework for learning policies across a variety of environments.
We propose to use pre-trained language models as a general framework for interactive decision-making across a variety of environments by converting all policy inputs into sequential data.
This framework is generic, accommodating goals and environment states represented as natural language strings, image patches, or scene graphs.
Learning without pre-collected expert data
We further examine how our method may be used in environments, where the expert data is not available and an agent must actively gather data from the surrounding environment. To do this, we integrate an Active Data Gathering (ADG) procedure into pretrained LMs.
ADG consists of three parts. First, exploration collects trajectories using a mix of random actions and actions generated by the current policy. Exploration is insufficient in this high dimensional problem and most of the trajectories will likely fail to achieve the end goal. A key insight is that even the failed trajectories contain useful sub-trajectories that solve certain sub-goals, and we relabel their goal in the hindsight relabeling stage. The relabeled goal describes what was achieved in the extracted sub-trajectory. Policy update samples relabeled trajectories to update the policy.
Combinatorial generalization to out-of-distribution tasks
We find that using pre-trained LMs as policy initializers improves in-domain performance and enables several forms of strong generalization over tasks.
For i.i.d. training and evaluation tasks, we find that this approach yields 20% more successful policies than other baseline methods in VirtualHome.
For combinatorial generalization to out-of-distribution tasks, i.e. tasks involving new combinations of goals, states or objects, we find that LM pre-training confers even more benefits: it improves task completion rates by 43.6% for tasks involving novel goals.
Pre-trained Language Model with Active Data Gathering (LID-ADG)
We compare LID-ADG, the proposed LM framework for decision-making using actively gathered data, to a variety of baselines that do not use pre-collected expert data on VirtualHome.
LID-ADG (Ours) outperforms all the baselines.
Is the effective combinatorial generalization because LMs are effective models of relations between natural language descriptions of states and actions, or because they provide a more general framework for combinatorial generalization in decision-making?
We hypothesize and investigate three possible factors underlying the effectiveness of language modeling for generalization in policy learning:
(1) input encoding scheme;
(2) sequential input representations;
and (3) favorable weight initialization.
(1) Input encoding scheme
We investigate (1) by encoding the environment as different types of sequences. Different input encoding schemes have only a negligible impact on model performance: the effectiveness of language modeling is not limited to utilizing natural strings, but in fact extends to arbitrary sequential encodings.
Success rates of policies trained with different input encodings in the Novel Tasks setting on VirtualHome. The text encoding is most sample-efficient, but all models converge to similar performance given sufficient training data.
(2) Sequential input representations
We investigate (2) by encoding observations with a single vector embedding, thereby removing its sequential structure (No-Seq). This operation significantly hurts the model's performance on novel tasks.
(3) Parameter pre-training
Finally, we investigate (3) by learning the parameters of the policy network from scratch (No-Pretrain). The success rate on novel tasks after removing the pre-trained LM weights drops by 11.2%.
"LID-Text (Ours)"" refines a pre-trained LM while "No-Pretrain" learns it from scratch. "No-FT" freezes the pre-trained weights. "No-Seq" uses non-sequential inputs. Fine-tuning the pre-trained weights and the usage of sequential encoding are important for combinatorial generalization.
We find that sequential input representations (vs. fixed-dimensional feature vectors) and favorable weight initialization are both important for generalization, however, the input encoding schemes (e.g. as a natural language string vs. an arbitrary encoding scheme) has little influence.
More Qualitative Results
1. Policy with pre-trained Language Model v.s. Policy without pre-trained Language Model
2. Policy with pre-trained Language Model v.s. Policy with LSTM
3. Policy with pre-trained Language Model on different test settings