AI Locomotion Learning in Complex Environments

The researchers from DeepMind are working on complex artificial intelligence solutions which could be used in animation and gamedev.

Last week Cornell University Library published a very interesting paper with a link to a particular video, which is now making rounds on the internet. The paper is penned by a team of researchers from DeepMind Technologies: Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, David Silver. The man behind DeepMind is probably one of the most inspiring minds in the field of game design and artificial intelligence. We’re talking, of course, about Demis Hassabis.

Hassabis worked at Bullfrog Productions and Lionhead, where he programmed AI for Black & White – a technological marvel at the time. He later founded Elixir Studios, where he helped to create Republic: The Revolution and Evil Genius. Later this studio was closed, but Hassabis moved to found a new stat up called DeepMind. In 2014 this company was acquired by Google.

DeepMind performs research in various fields, paying most attention to artificial intelligence. The current paper deals with learning paradigm and in particular “Emergence of Locomotion Behaviours in Rich Environments”. In plain English, it means that researches taught virtual AI characters to move in complex environments guided by its desire to achieve progress. While hilarious, this video does show an incredible power of the artificial intelligence and shows various ways it could be implemented. Games and animation are probably one of the basic fields where this could be applied, but also the key points of this research could fuel the performance of future generations of robots or whatever we’ll be enslaved by in the future.

Here’s a little abstract from the paper, which you can find over here. 

We focus on a set of novel locomotion tasks that go significantly beyond the previous state-of-the-art for agents trained directly from reinforcement learning. They include a variety of obstacle courses for agents with different bodies (Quadruped, Planar Walker, and Humanoid [5, 6]). The courses are procedurally generated such that every episode presents a different instance of the task.

Our environments include a wide range of obstacles with varying levels of difficulty (e.g. steepness, unevenness, distance between gaps). The variations in difficulty present an implicit curriculum to the agent – as it increases its capabilities it is able to overcome increasingly hard challenges, resulting in the emergence of ostensibly sophisticated locomotion skills which may naïvely have seemed to require careful reward design or other instruction. We also show that learning speed can be improved by explicitly structuring terrains to gradually increase in difficulty so that the agent faces easier obstacles first and harder obstacles only when it has mastered the easy ones.

In order to learn effectively in these rich and challenging domains, it is necessary to have a reliable and scalable reinforcement learning algorithm. We leverage components from several recent approaches to deep reinforcement learning. First, we build upon robust policy gradient algorithms, such as trust region policy optimization (TRPO) and proximal policy optimization (PPO) [7, 8], which bound parameter updates to a trust region to ensure stability. Second, like the widely used A3C algorithm [2] and related approaches [3] we distribute the computation over many parallel instances of agent and environment. Our distributed implementation of PPO improves over TRPO in terms of wall clock time with little difference in robustness, and also improves over our existing implementation of A3C with continuous actions when the same number of workers is used.

The paper proceeds as follows. In Section 2 we describe the distributed PPO (DPPO) algorithm that enables the subsequent experiments, and validate its effectiveness empirically. Then in Section 3 we introduce the main experimental setup: a diverse set of challenging terrains and obstacles. We provide evidence in Section 4 that effective locomotion behaviours emerge directly from simple rewards; furthermore we show that terrains with a “curriculum” of difficulty encourage much more rapid progress, and that agents trained in more diverse conditions can be more robust.

And while the majority of us will be interested in the solution for pathfinding and other similar tasks, what you really want to think about is what kind of reward you should be giving to your AI. It’s a long-term task, but in this day and age ‘long-term quickly’ becomes ‘short term’.

Join discussion

Comments 0

    You might also like

    We need your consent

    We use cookies on this website to make your browsing experience better. By using the site you agree to our use of cookies.Learn more