Mobile Navigation

Raw Text

Close

Search

Skip to main content

Site Navigation

Research Overview Index

Product Overview ChatGPT GPT-4 DALL·E 2 Customer stories Safety standards API data privacy Pricing

Developers Overview Documentation API reference Examples

Safety

Company About Blog Careers Charter Security

Search

Navigation quick links

Log in

Sign up

Menu

Close

Site Navigation

Research

Product

Developers

Safety

Company

Quick Links

Log in

Sign up

Search

Learning a hierarchy

October 26, 2017

More resources

Read paper

View code

Reinforcement learning ,

Meta-learning ,

Publication ,

Release

Learning a hierarchy 00:45

Humans solve complicated challenges by breaking them up into small, manageable components. Grilling pancakes consists of a series of high-level actions, such as measuring flour, whisking eggs, transferring the mixture to the pan, turning the stove on, and so on. Humans are able to learn new tasks rapidly by sequencing together these learned components, even though the task might take millions of low-level actions, i.e., individual muscle contractions.

On the other hand, today’s reinforcement learning methods operate through brute force search over low-level actions, requiring an enormous number of attempts to solve a new task. These methods become very inefficient at solving tasks that take a large number of timesteps.

Our solution is based on the idea of hierarchical reinforcement learning, where agents represent complicated behaviors as a short sequence of high-level actions. This lets our agents solve much harder tasks: while the solution might require 2000 low-level actions, the hierarchical policy turns this into a sequence of 10 high-level actions, and it’s much more efficient to search over the 10-step sequence than the 2000-step sequence.

Meta-learning shared hierarchies

Our algorithm, meta-learning shared hierarchies (MLSH), learns a hierarchical policy where a master policy switches between a set of sub-policies. The master selects an action every every N timesteps, where we might take N=200. A sub-policy executed for N timesteps constitutes a high-level action, and for our navigation tasks, sub-policies correspond to walking or crawling in different directions.

In most prior work, hierarchical policies have been explicitly hand-engineered. Instead, we aim to discover this hierarchical structure automatically through interaction with the environment. Taking a meta-learning perspective, we define a good hierarchy as one that quickly reaches high reward quickly when training on unseen tasks. Hence, the MLSH algorithm aims to learn sub-policies that enable fast learning on previously unseen tasks.

We train on a distribution over tasks, sharing the sub-policies while learning a new master policy on each sampled task. By repeatedly training new master policies, this process automatically finds sub-policies that accommodate the master policy’s learning dynamics.

Experiments

In our AntMaze environment, a Mujoco Ant robot is placed into a distribution of 9 different mazes and must navigate from the starting position to the goal. Our algorithm is successfully able to find a diverse set of sub-policies that can be sequenced together to solve the maze tasks, solely through interaction with the environment. This set of sub-policies can then be used to master a larger task than the ones they were trained on (see video at beginning at post).

Code

We’re releasing the code for training the MLSH agents, as well as the MuJoCo environments we built to evaluate these algorithms.

Authors

Kevin Frans

Jonathan Ho

Peter Chen

Pieter Abbeel

Research

Overview

Index

Product

Overview

ChatGPT

GPT-4

DALL·E 2

Customer stories

Safety standards

API data privacy

Pricing

Safety

Overview

Company

About

Blog

Careers

Charter

Security

OpenAI © 2015 – 2023

Terms & policies

Privacy policy

Brand guidelines

Social

Twitter

YouTube

GitHub

SoundCloud

LinkedIn

Back to top

Single Line Text

Close. Search. Skip to main content. Site Navigation. Research Overview Index. Product Overview ChatGPT GPT-4 DALL·E 2 Customer stories Safety standards API data privacy Pricing. Developers Overview Documentation API reference Examples. Safety. Company About Blog Careers Charter Security. Search. Navigation quick links. Log in. Sign up. Menu. Close. Site Navigation. Research. Product. Developers. Safety. Company. Quick Links. Log in. Sign up. Search. Learning a hierarchy. October 26, 2017. More resources. Read paper. View code. Reinforcement learning , Meta-learning , Publication , Release. Learning a hierarchy 00:45. Humans solve complicated challenges by breaking them up into small, manageable components. Grilling pancakes consists of a series of high-level actions, such as measuring flour, whisking eggs, transferring the mixture to the pan, turning the stove on, and so on. Humans are able to learn new tasks rapidly by sequencing together these learned components, even though the task might take millions of low-level actions, i.e., individual muscle contractions. On the other hand, today’s reinforcement learning methods operate through brute force search over low-level actions, requiring an enormous number of attempts to solve a new task. These methods become very inefficient at solving tasks that take a large number of timesteps. Our solution is based on the idea of hierarchical reinforcement learning, where agents represent complicated behaviors as a short sequence of high-level actions. This lets our agents solve much harder tasks: while the solution might require 2000 low-level actions, the hierarchical policy turns this into a sequence of 10 high-level actions, and it’s much more efficient to search over the 10-step sequence than the 2000-step sequence. Meta-learning shared hierarchies. Our algorithm, meta-learning shared hierarchies (MLSH), learns a hierarchical policy where a master policy switches between a set of sub-policies. The master selects an action every every N timesteps, where we might take N=200. A sub-policy executed for N timesteps constitutes a high-level action, and for our navigation tasks, sub-policies correspond to walking or crawling in different directions. In most prior work, hierarchical policies have been explicitly hand-engineered. Instead, we aim to discover this hierarchical structure automatically through interaction with the environment. Taking a meta-learning perspective, we define a good hierarchy as one that quickly reaches high reward quickly when training on unseen tasks. Hence, the MLSH algorithm aims to learn sub-policies that enable fast learning on previously unseen tasks. We train on a distribution over tasks, sharing the sub-policies while learning a new master policy on each sampled task. By repeatedly training new master policies, this process automatically finds sub-policies that accommodate the master policy’s learning dynamics. Experiments. In our AntMaze environment, a Mujoco Ant robot is placed into a distribution of 9 different mazes and must navigate from the starting position to the goal. Our algorithm is successfully able to find a diverse set of sub-policies that can be sequenced together to solve the maze tasks, solely through interaction with the environment. This set of sub-policies can then be used to master a larger task than the ones they were trained on (see video at beginning at post). Code. We’re releasing the code for training the MLSH agents, as well as the MuJoCo environments we built to evaluate these algorithms. Authors. Kevin Frans. Jonathan Ho. Peter Chen. Pieter Abbeel. Research. Overview. Index. Product. Overview. ChatGPT. GPT-4. DALL·E 2. Customer stories. Safety standards. API data privacy. Pricing. Safety. Overview. Company. About. Blog. Careers. Charter. Security. OpenAI © 2015 – 2023. Terms & policies. Privacy policy. Brand guidelines. Social. Twitter. YouTube. GitHub. SoundCloud. LinkedIn. Back to top.