The Challenge of Hands Motion: VR & Gloves vs. AI-Powered Tech

Slava Smirnov and Gleb Sterkin discussed why it's so challenging to recreate the hands movement, talked about the current approaches to the task and their disadvantages, and shared the details about the new AI-powered software they are developing at Capture Tech. They are open for creative collabs, so you can get a chance to test out their tool for free.

Introduction

Slava Smirnov: My name is Slava and I'm responsible for the business side of things at Capture Tech. The company is democratizing CG content production by leveraging AI technologies.

Gleb Sterkin: My name is Gleb and I'm in charge of the tech side within Capture Tech. We are doing deep learning for animation and as a starting point, we are making motion capture from a regular camera.

Slava: First and foremost, I'd like to thank the editorial team at 80 Level and the community for being so responsive. As some may recall, we've recently introduced an AI-powered regular camera body mocap and are receiving a tremendous amount of messages from all around the world, from a guy who educates people around the globe on a rare African language and a guy from England working to bring club life to VR to VFX professionals from Australia with IMDb rating titles over 8.0 and animation studios with Netflix titles. And there was also this one company whose games has the word “craft” in it if you know what I mean. The 80 Level community is truly one of a kind. We are deeply sorry we cannot answer all of the amazing people out there. However, we are working 24/7 on making our software public as soon as possible.

Gleb: As we say to ourselves: "sleep is overrated". Aside from business things, the team is working hard on many things in tech – dealing with neural networks quality and speed, working on bringing it to Unreal, polishing ways of how our software can be brought to the end-user in the most friendly and plug-and-play way.

The Challenges of Hands Motion

Slava: The difficulty appears during all phases – concept, rigging, animation, and late-stage iterations. The concept phase is difficult as it's hard to find one special reference. It is important as normally hands tend to be occluded for a lot of important poses. The rigging phase comes down to the high-quality rig, not to mention hands are different between different characters. Finally, there are no public libraries of hand movements. Now think of the amount of work when you have these "minor" late-stage comments on hands interacting with objects (let's try another cup or let's change a cup to a stove) and you have to tune/rework the whole stuff all over again. Two hands? Think twice. Thus hands motion has to be well budgeted and that’s the reason why there are so many cool projects missing hands animation. And there are not only games related aspects but movies, animation series, YouTube, streaming and they all require different sets of tweaks. Thus the whole thing of hands animation eventually becomes quite challenging.

Approaches to Hands Animation

Slava: Almost like with the body motion capture, there are 3 ways to approach this:

keyframing (unscalable)
hardware (pricey to get, $4K per one glove for one hand)
software-only (the new one, affordable to indie and mid-level studios, scales gracefully)

Obviously, the keyframing approach requires you to rig a model and animate it by keyframes, 3D software interpolates frames between them for you and thus the animation is born. This is tedious work to do and it’s perfect for short sequences. However, when you need more than 10 seconds, the keyframing approach is hard to scale.

As for the hardware approach, you are required to purchase hardware in advance. The hardware provides you with the required flexibility yet it lacks affordability. And if you in a part of the world where there's no guaranteed shipping, the package can be lost and you'll have to wait months – well, it doesn't make your life easier.

Gleb: As humans, we tend to spot even the tiniest flaws in hand movements since hands are something we see and use almost any second of our day. So any error in hand animation might be fatal and this raises additional technical challenges for the artists. For keyframing, the tech challenges lie on a 3D software side. How one builds bone constraints and interpolates in the most efficient way can become a crucial factor distinguishing good animation from a mediocre one. As for the hardware case, the noise that affects inertial sensors can easily ruin the whole experience, as well as the glove itself, which restricts the actor from some of the movements. So the sensors have to be quite small but precise (meaning pricey).

What Method Is the Most Efficient?

Slava: In short, it all depends on the task being solved and that depends on why you need hands animation after all. If you need 2 seconds of hands animation, you would probably choose to keyframe at a rate being roughly $30 per hour. Depending on a skillset of a person, in 5-6 hours, you can get 5-6 seconds of hands animation. So you end up with 5 seconds of hands movement for $150. This means that if you need a lot of sequences or would prefer to experiment on the go, this doesn’t scale in terms of budgeting though it can produce quite astonishing results. Now if you need more, like minutes and hours or even infinite real-time motion, you are better off with other approaches.

1 of 2

Disadvantages of VR and Gloves Solutions

Slava: Basically, there are 2 types of hardware solutions: DIY-type with VR controllers and gloves. We need to keep in mind VR platforms are not meant to serve as hand capturing hardware but rather to control stuff within a game. So building pipelines around them is a bit risky. As for the gloves, they are limited by wires plus a requirement to wear something on your hand eventually limits your expressiveness. And the price. I can't spend thousands of dollars just to try one out. And if I need two hands...

Gleb: VR controllers give you tracking info, yet finger movements data is quite limited. Most of them track fingers with a limited number of steps: pushed, hovered, non-hovered, etc. Glove issues are based around the fact that IMU sensors happen to be affected by the outside signals generated by magnetic objects. And usually, there are lots of them around us in cities. All these factors combined made us go for deep learning for hand motion capture.

Exploring AI-Powered Hands Motion Capture

Slava: If you think about great stories and experiences, hands are an essential part of most of them. Hands add up expressiveness and a new level of story depth when used in the right way. Yet, as you see historically, hands animation is a difficult task. It is tedious work and a pricey thing to get, so hands are missing in a lot of great projects. We believe AI is set to free up creativity and today we are showing an early version of AI-powered hands motion capture with just a regular camera.

Gleb: Software-only hands motion capture has its own challenges and in order to tackle them better, we'd like the industry to tell us what's important for hands capture next. That’s why just like with the body solution, we invite all sorts of creatives to free of charge creative collabs for hands motion capture. So if you have a project in mind where hands motion could leverage your story, provide a crucial experience for your audience, or you simply want to test it out, do drop us a line here.

Future Plans for the Software

Slava: We are aimed to provide the best quality and precision possible with just a regular camera plus deep learning magic. So for the quality to be the best, there are 3 things to be done. The first thing is to support both hands simultaneously. We are working on providing that in a matter of weeks to creatives we collaborate with. The second thing is an increase in motion precision. The team is aimed to capture detailed movements on phalanxes, fists, and fingers' complex interactions better. This is a bit more challenging yet the team is on track to solve that soon.

Gleb: Occlusion handling is the third fundamental issue to be solved. Some of the hand motions can't be fully seen with a single camera. As an example when you look at your hand and pick up an object from a table, a big part of your hand might be invisible to the camera. Our current solution can handle some of such poses. Eventually, we'll support most of the occlusions with as few mistakes and as little manual post-processing as possible. This will probably incorporate a mix of machine learning and some physics simulation.

Deep learning is here. Do your great stuff, boring stuff will be done by others.