*
We have applied our apprenticeship learning and reinforcement learning
algorithms to the problem of autonomous helicopter flight. This
resulted in a robust, highly capable, controller for our helicopter.
In particular, our helicopter can now perform very difficult aerobatic
maneuvers, such as in-place flips
(pitching backward to perform a 360 degrees rotation---imagine a
backward somersault), in-place rolls, a steep ``funnel'' maneuver
(flying sideways in a circle while steeply pitched forwards or
backwards, so that the helicopter traces out the surface of a
``funnel''), and even tic-tocs (analogous to a metronome or an inverted
pendulum, where the helicopter, with nose up and tail down, quickly
pitches approximately 30 degrees back and forth) and chaos (arguably, the most challenging aerobatic
maneuver). Such maneuvers are well beyond the abilities of all but
the best human pilots; to our knowledge, these are also by far
the most difficult maneuvers performed on any autonomous helicopter.
*

For more information, also see the Stanford Autonomous Helicopter Project.

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng.

In

** Using Inaccurate Models in Reinforcement Learning**,

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng.

In * Proceedings of ICML*, 2006.
(ps,
pdf,
long version:
ps
pdf)

** Modeling Vehicular Dynamics, with Application to Modeling Helicopters**,

Pieter Abbeel, Varun Ganapathi and Andrew Y. Ng.

In * NIPS 18*, 2006.
(ps,
.pdf)

** Exploration and Apprenticeship Learning in Reinforcement Learning**,

Pieter Abbeel and Andrew Y. Ng.

In * Proceedings of ICML*, 2005.
(ps,
pdf,
long version:
ps,
pdf)

** Learning First Order Markov Models for Control**,

Pieter Abbeel and Andrew Y. Ng.

In * NIPS 17*, 2005.
(ps,
pdf)

** Apprenticeship Learning via Inverse Reinforcement Learning**,

Pieter Abbeel and Andrew Y. Ng.

In * Proceedings of ICML*, 2004.
(ps,
pdf,
supplement:
ps ,
pdf,
supplementary webpage here)

*
The RC car and flight simulator videos below illustrate our reinforcement learning with inaccurate models algorithm
presented at ICML 2006. We have since also applied an extension of that idea to design
controllers for our autonomous helicopter.
*

*
In the model-based policy search approach to reinforcement
learning (RL), policies are found using a model (or ``simulator'') of
the Markov decision process. However, for high-dimensional
continuous-state tasks, it can be extremely difficult to build an
accurate model, and thus often the algorithm returns a policy that
works in simulation but not in real-life. The other extreme,
model-free RL, tends to require infeasibly large numbers of real-life
trials. In this paper, we present a hybrid algorithm that requires
only an approximate model, and only a small number of real-life
trials. The key idea is to successively ``ground'' the policy
evaluations using real-life trials, but to rely on the approximate
model to suggest local changes. Our theoretical results show that this
algorithm achieves near-optimal performance in the real system, even
when the model is only approximate. Empirical results also
demonstrate that---when given only a crude model and a small number of
real-life trials---our algorithm can obtain near-optimal performance
in the real system.
*

Learning to execute a closed-loop "figure-8" trajectory: mpg.

Learning to execute an open-loop turning trajectory: mpg.

** Using Inaccurate Models in Reinforcement Learning**,

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng.

In * Proceedings of ICML*, 2006.
(ps,
pdf,
long version:
ps
pdf)

*
The highway driving videos below illustrate our "apprenticeship
learning via inverse reinforcement learning" algorithm presented at
ICML 2004.
*

*
We consider learning in a Markov decision process where we are not explicitly
given a reward function, but where instead we can observe an expert demonstrating
the task that we want to learn to perform. This setting is useful in applications
(such as the task of driving) where it may be difficult to write down an
explicit reward function specifying exactly how different desiderata should be
traded off.
We think of the expert as trying to maximize a reward function that is expressible
as a linear combination of known features, and give an algorithm for
learning the task demonstrated by the expert. Our algorithm is based
on using ``inverse reinforcement learning'' to try to recover the unknown
reward function.
We show that our algorithm terminates in a small number of iterations,
and that even though we may never recover the expert's reward function,
the policy output by the algorithm will attain performance close to
that of the expert, where here performance is measured with respect to
the expert's \emph{unknown} reward function.
*

1: Nice. expert demonstration learned controller expert and learned side-by-side

2: Bad. expert demonstration learned controller expert and learned side-by-side

3: Right lane nice. expert demonstration learned controller expert and learned side-by-side

4: Right lane bad. expert demonstration learned controller expert and learned side-by-side

5: Middle lane. expert demonstration learned controller expert and learned side-by-side

** Apprenticeship Learning via Inverse Reinforcement Learning**,

Pieter Abbeel and Andrew Y. Ng.

In * Proceedings of ICML*, 2004.
(ps,
pdf,
supplement:
ps ,
pdf,
supplementary webpage here)

*
Legged robots, unlike wheeled robots, have the potential to access
nearly all of the earth's land mass, enabling robotic applications in
areas where they are currently infeasible. However, the current
control software for legged robots is quite limited, and does not let
them realize this potential.
*

*
We proposed a method for hierarchical apprenticeship learning: our
algorithm accepts advice for the quadruped locomotion task at
different hierarchical levels of the control task. Our algorithm then
uses this advice to find a controller that allows the quadruped to
successfully traverse highly non-trivial, previously unseen terrains.
*

Planning Before/After Learning #2 (9/2007). (mp4, wmv)

For more information, also see the Stanford Learning Locomotion Project.

J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng.

In