Imitation Learning is a promising approach to endow robots with various complex manipulation capabilities. By allowing robots to learn from datasets collected by humans, robots can learn to perform the same skills that were demonstrated by the human. Typically, these datasets are collected by having humans control robot arms, guiding them through different tasks. While this paradigm has proved effective, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three real-world multi-stage manipulation tasks of varying complexity, and with datasets of varying quality. Our study analyzes the most critical challenges when learning from offline human data for manipulation.

Based on the study, we derive several lessons to understand the challenges in learning from human demonstrations, including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stopping criteria due to the different objectives in training and evaluation. We also highlight opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks beyond the scope of current reinforcement learning methods, and the ability to easily scale to natural, real-world manipulation scenarios where only raw sensory signals are available.

We have open-sourced our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data. Please see the robomimic website for more information.

In this study, we investigate several challenges of offline learning from human datasets and extract lessons to guide future work.

Why is learning from human-labeled datasets difficult?

We explore five challenges in learning from human-labeled datasets.

  • (C1) Unobserved Factors in Human Decision Making. Humans are not perfect Markovian agents. In addition to what they currently see, their actions may be influenced by other external factors - such as the device they are using to control the robot and the history of the actions that they have provided.

  • (C2) Mixed Demonstration Quality. Collecting data from multiple humans can result in mixed quality data, since some people might be better quality supervisors than others.

  • (C3) Dependence on dataset size. When a robot learns from an offline dataset, it needs to understand how it should act (action) in every scenario that it might encounter (state). This is why the coverage of states and actions in the dataset matters. Larger datasets are likely to contain more situations, and are therefore likely to train better robots.

  • (C4) Train Objective ≠ Eval Objective. Unlike traditional supervised learning, where validation loss is a strong indicator of how good a model is, policies are usually trained with surrogate losses. Consider an example where we train a policy via Behavioral Cloning from a set of demonstrations on a block lifting task. Here, the policy is trained to replicate the actions taken by the demonstrator, but this is not necessarily equivalent to optimizing the block lifting success rate (see the Dagger paper for a more precise explanation). This makes it hard to know which trained policy checkpoints are good without trying out each and every model directly on the robot – a time consuming process.

  • (C5) Sensitivity to Agent Design Decisions. Performance can be very sensitive to important agent design decisions, like the observation space and hyperparameters used for learning.

Study Design

In this section, we summarize the tasks (5 simulated and 3 real), datasets (3 different variants), algorithms (6 offline methods, including 3 imitation and 3 batch reinforcement), and observation spaces (2 main variants) that we explored in our study.


Tool Hang
Lift (Real)
Can (Real)
Tool Hang (Real)
We collect datasets across 6 operators of varying proficiency and evaluate offline policy learning methods on 8 challenging manipulation tasks that test a wide range of manipulation capabilities including pick-and-place, multi-arm coordination, and high-precision insertion and assembly.

Task Reset Distributions

When measuring the task success rate of a policy, the policy is evaluated across several trials. At the start of each trial, the initial placement of all objects in the task are randomized from a task reset distribution. The videos below show this distribution for each task. This gives an impression of the range of different scenarios that a trained policy is supposed to be able to handle.

Tool Hang
Lift (Real)
Can (Real)
Tool Hang (Real)
We show the task reset distributions for each task, which governs the initial placement of all objects in the scene at the start of each episode. Initial states are sampled from this distribution at both train and evaluation time.


We collected 3 kinds of datasets in this study.


These datasets consist of rollouts from a series of SAC agent checkpoints trained on Lift and Can, instead of humans. As a result, they contain random, suboptimal, and expert data due to the varied success rates of the agents that generated the data. This kind of mixed quality data is common in offline RL works (e.g. D4RL, RLUnplugged).

Lift (MG)
Can (MG)
Lift and Can Machine-Generated datasets.


These datasets consist of 200 demonstrations collected from a single proficient human operator using RoboTurk.

Lift (PH)
Can (PH)
Square (PH)
Transport (PH)
Tool Hang (PH)
Proficient-Human datasets generated by 1 proficient operator (with the exception of Transport, which had 2 proficient operators working together).


These datasets consist of 300 demonstrations collected from six human operators of varied proficiency using RoboTurk. Each operator falls into one of 3 groups - “Worse”, “Okay”, and “Better” – each group contains two operators. Each operator collected 50 demonstrations per task. As a result, these datasets contain mixed quality human demonstration data. We show videos for a single operator from each group.

Lift (MH) - Worse
Lift (MH) - Okay
Lift (MH) - Better
Multi-Human Lift dataset. The videos show three operators - one that's "worse" (left), "okay" (middle) and "better" (right).
Can (MH) - Worse
Can (MH) - Okay
Can (MH) - Better
Multi-Human Can dataset. The videos show three operators - one that's "worse" (left), "okay" (middle) and "better" (right).
Square (MH) - Worse
Square (MH) - Okay
Square (MH) - Better
Multi-Human Square dataset. The videos show three operators - one that's "worse" (left), "okay" (middle) and "better" (right).
Transport (MH) - Worse-Worse
Transport (MH) - Okay-Okay
Transport (MH) - Better-Better
Transport (MH) - Worse-Okay
Transport (MH) - Worse-Better
Transport (MH) - Okay-Better
Multi-Human Transport dataset. These were collected using pairs of operators with Multi-Arm RoboTurk (each one controlled 1 robot arm). We collected 50 demonstrations per combination of the operator subgroups.


We evaluated 6 different offline learning algorithms in this study, including 3 imitation learning and 3 batch (offline) reinforcement learning algorithms.

We evaluated 6 different offline learning algorithms in this study, including 3 imitation learning and 3 batch (offline) reinforcement learning algorithms.
  • BC: standard Behavioral Cloning, which is direct regression from observations to actions.
  • BC-RNN: Behavioral Cloning with a policy network that’s a recurrent neural network (RNN), which allows modeling temporal correlations in decision-making.
  • HBC: Hierarchical Behavioral Cloning, where a high-level subgoal planner is trained to predict future observations, and a low-level recurrent policy is conditioned on a future observation (subgoal) to predict action sequences (see Mandlekar*, Xu* et al. (2020) and Tung*, Wong* et al. (2021) for more details).
  • BCQ: Batch-Constrained Q-Learning, a batch reinforcement learning method proposed in Fujimoto et al. (2019).
  • CQL: Conservative Q-Learning, a batch reinforcement learning method proposed in Kumar et al. (2020).
  • IRIS: Implicit Reinforcement without Interaction, a batch reinforcement learning method proposed in Mandlekar et al. (2020).

Observation Spaces

We study two different observation spaces in this work – low-dimensional observations and image observations.

We study two different observation spaces in this work.

Image Observations

We provide examples of the image observations used in each task below.

Most tasks have a front view and wrist view camera. The front view matches the view provided to the operator during data collection.
Tool Hang has a side view and wrist view camera. The side view matches the view provided to the operator during data collection.
Transport has a shoulder view and wrist view camera per arm. The shoulder view cameras match the views provided to each operator during data collection.

Summary of Lessons Learned

In this section, we briefly highlight the lessons we learned from our study. See the paper for more thorough results and discussion.

Lesson 1: History-dependent models are extremely effective.

We found that there is a substantial performance gap between BC-RNN and BC, which highlights the benefits of history-dependence. This performance gap is larger for longer-horizon tasks (e.g. ~55% for the Transport (PH) dataset compared to ~5% for the Square (PH) dataset)) and also larger for multi-human data compared to single-human data (e.g.~25% for Square (MH) compared to ~5% for Square (PH)).

Methods that make decisions based on history, such as BC-RNN and HBC, outperform other methods on human datasets.

Lesson 2: Batch (Offline) RL struggles with suboptimal human data.

Recent batch (offline) RL algorithms such as BCQ and CQL have demonstrated excellent results in learning from suboptimal and multi-modal machine-generated datasets. Our results confirm the capacity of such algorithms to work well – BCQ in particular performs strongly on our agent-generated MG datasets that consist of a diverse mixture of good and poor policies (for example, BCQ achieves 91.3% success rate on Lift (MG) compared to BC which achieves 65.3%).

Surprisingly though, neither BCQ nor CQL performs particularly well on these human-generated datasets. For example, BCQ and CQL achieve 62.7% and 22.0% success respectively on the Can (MH) dataset, compared to BC-RNN which achieves 100% success. This puts the ability of such algorithms to learn from more natural dataset distributions into question (instead of those collected via RL exploration or pre-trained agents). There is an opportunity for future work in batch RL to resolve this gap.

While batch (offline) RL methods are proficient at dealing with mixed quality machine-generated data, they struggle to deal with mixed quality human data.
To further evaluate methods in a simpler setting, we collected the Can Paired dataset, where every task instance has two demonstrations, one success and one failure. Even this simple setting, where each start state has exactly one positive and one negative demonstration, poses a problem.

Lesson 3: Improving offline policy selection is important.

The mismatch between train and evaluation objective causes problems for policy selection - unlike supervised learning, the best validation loss does not correspond to the best performing policy. We found that the best validation policy is 50 to 100% worse than the best performing policy. Thus, each policy checkpoint needs to be tried directly on the robot – this can be costly.

Lesson 4: Observation space and hyperparameters play a large role in policy performance.

We found that observation space choice and hyperparameter selection is crucial for good performance. As an example, not including wrist camera observations can reduce performance by 10 to 45 percent

Lesson 5: Using human data for manipulation is promising.

Studying how dataset size impacts performance made us realize that using human data holds much promise. For each task, the bar chart shows how performance changes going from 20% to 50% to 100% of the data. Simpler tasks like Lift and Can require just a fraction of our collected datasets to learn, while more complex tasks like Square and Transport benefit substantially from adding more human data, suggesting that more complex tasks could be addressed by using large human datasets.

Lesson 6: Study results transfer to real world.

We collected 200 demonstrations per task, and trained a BC-RNN policy using identical hyperparameters to simulation, with no hyperparameter tuning. We see that in most cases, performance and insights on what works in simulation transfer well to the real world.

Lift (Real). 96.7% success rate. Nearly matches performance in simulation (100%).
Can (Real). 73.3% success rate. Nearly matches performance in simulation (100%).
Tool Hang (Real). 3.3% success rate. Far from simulation (67.3%) - the real task is harder.

Below, we present examples of policy failures on the Tool Hang task, which illustrate its difficulty, and the large room for improvement.

Insertion Miss
Failed Insertion
Failed Tool Grasp
Tool Drop
Failures which illustrate the difficulty of the Tool Hang task.

We also show that results from our observation space study hold true in the real world – visuomotor policies benefit strongly from wrist observations and pixel shift randomization.

Can (no Wrist). 43.3% success rate (compared to 73.3% with wrist).
Can (no Rand). 26.7% success rate (compared to 73.3% with randomization).
Without wrist observations (left) the success rate decreases from 73.3% to 43.3%. Without pixel shift randomization (right), the success rate decreases from 73.3% to 26.7%.


  1. Learning from large multi-human datasets can be challenging.

  2. Large multi-human datasets hold promise for endowing robots with dexterous manipulation capabilities.

  3. Studying this setting in simulation can enable reproducible evaluation and insights can transfer to real world.

Please see the robomimic website for more information.

This blog post is based on the following paper: