Capturing Facial Expressions with IPhone X

Cory Strassburger, a co-founder of a VR Studio called Kite & Lightning, talked about using IPhone X to create facial animations.

Cory Strassburger, a co-founder of a VR Studio called Kite & Lightning, talked about using IPhone X to create facial animations. 

Intro 

My name is Cory Strassburger and I’m co-founder along with Ikrima Elhassan of a VR Studio called Kite & Lightning.  I started out as a visual effects artist and back in the day worked on a lot of TV shows (X-files, Startrek, Deep Space Nine, etc…) and movies like Minority Report.  Ikrima is a master developer who did a lot of cinematic & real-time graphics R&D at Microsoft and Nvidia. Together we started out doing Augmented Reality (just for fun) after he showed me some of his tests and I fell in love with the possibilities.  We got more serious about cinematic AR (this was 2012) and just as we created Kite & Lighting to pursue this, Ikrima showed off his Oculus DK1 fresh from the kickstarter and we immediately pivoted to VR! 
Since then we created handful of cinematic VR experiences, saved up our earnings and split to Paris (Just because its a cool city and a fun place to get creative) to create our latest project, Bebylon Battle Royale, a VR party brawler set in an outrageous future-world of immortal “Bebies”, with comedic arena combat at its core! Though, Bebylon is a much bigger endeavor with the initial game being more like the gateway drug!  After we release the game next year we’ll continually follow up with episodic releases expanding the game and all the immersive story driven content that ties it all together.  A big satirical social metaverse of fun and entertainment, complete with spectating, gambling, “Bebits” Cryptocurrency, all the good stuff!  Of course our imaginations are way bigger than our resources so it’s going to be a long but fun road for us!

Why iPhone X  

I’ve been building this world of Bebylon for quite a while and developing some really cool, vibrant characters and super fun stories and backstories.  And as we’ve been building the actual game, deep down, i’ve been dying to bring these characters and stories to life for in-game intros, cinematics, traditional trailers, mini vignettes, short movies, etc… but I didn’t want to wait until the core game was finished to start it. (Even though that would be the smart thing to do!)  This of course requires pretty decent facial capture and my problem is I have no time or real budget at this point for facial animation, thats all slated for later.  Despite this, on my recent sundays I’ve been exploring different facial capture solutions with the hope of finding something that was super fast, really good quality and affordable for us. i.e. cheap!   While some of the software I test showed promise it was clear that the time involved in the process would be too much for me to fit into my weekends the amount of content i wanted to make.  So literally I went to sleep on a Friday night (a month ago) really bummed out about my facial capture situation and woke up Saturday wondering if the new iPhone X allowed you access to their depth camera.  Over coffee I checked out Apples ARKit API, saw that it output the 51 blensdhapes and I got really excited and lucky!  It was the launch weekend of the iPhoneX and somehow I ordered a phone online that morning and picked it up later that day.  The next day I made this test which showed a heck of a lot promise.  The main potential was the pipeline speed was fast and the quality of the data was really clean (no serious noise to smooth out) and finally, the quality of the expressivity was in the ball park of what I could expect from a realtime solution.  

Capturing Process 

The whole test was pretty easy thanks to Unity having already rolled in Apples ARKit’s access to the front depth camera. We’re a UE4 shop and I believe they are expanding their ARkit integration to access the face tracking data, at which point i’ll port this over since our characters are already setup in there.  Now, its not essential that you import your character into the system in order to capture the data.  In fact, the raw data is the same regardless of your character (its based on your physical facial movements), but, as a performer you might make very different decisions on how you speak or shape and exaggerate your expressions when you can see the actual character you’re driving. 
 
– Step one is I imported a work-in-progress Bebylon character and hooked its facial expression blend shapes into the facial capture data that ARKit outputs. This let me drive the baby’s face animation based on my own expressions.
A great place to add 120 blendshapes to your character head is http://www.eiskoservices.com.  Their data is really good and is the basis for my baby characters.  I did have to create some new blendshapes and modify a bunch of theirs in order to conform to what ARKit requires. 
 
 
–  Next, I needed to capture this expression data in order to import it into Maya. I added a record function to stream the facial expression data into a text file. This saved locally on the iPhone. Each start and stop take becomes a separate text file and can be named/renamed within the capture app. Eventually I will probably add streaming so i can plug the iPhone in via USB and capture the data right to the desktop.  I imagine someone could write a plugin to stream right into maya.  

– Next I copy the text files from the iPhone X to the desktop via USB.

– The captured data needs to be reformatted for importing into Maya so I wrote a simple desktop app to do this. It takes the chosen text file(s) and converts them to Maya .anim files. This converter needs a lot of work.  Because the .anim file require a lot of specific information about the names of your blendshapes, frame rate,  etc.. I ended up hardwiring that data to my specific setup but if I rename a blendshape or blendshape node I have to change the data.  Its a bit of a pain though that will vanish once I add some features to the converter. 

– Lastly I import the .anim file into Maya (via the shape editor)  and voila, your Maya character is mimicking what you saw on the iPhone during capture.  Of course this data could drive any character assuming it has the same blendshape set. If not you could manually connect the animation data to any channel you want. 

The whole process is so fast to just whip out the phone, plug in a mic and start recording and importing into maya is just a couple clicks.  The beauty for me is the way my facial rig is setup now, I can theoretically import the data, clean up any artifacts pretty quickly (as there are so few), and then with the core part of my facial rig, animate quickly on top of the data to add more expressivity.  I just got all this working so I haven’t tested that theory yet.  Another big improvement thats not shown in either test is the jaw articulation is now way more physically accurate.  In Unity, for quickly testing, I just linked the lower jaw, teeth and tongue to the “JawOpen” Blendshape which is a hack and not the way your mouth works.  However with my actual facial rig, the jaw responds to the mouth movements more accuratly so the next tests should reflect that. 

Performance

Haha, it is exactly the same data thats driving the emoji’s (Well, assuming apple didn’t keep any special treats locked up for themselves) though it’s all exposed via ARKit so no real hacking needed.  I am no expert in facial animation tech but from my experience I’ve found that facial tracking technology that uses a depth camera like the iPhone X or Faceshift (which apple bought to create these features)  can generate a more accurate facial model than one using a mono 2d camera.  And from my observations its not a dramatic difference but usually I see it in the puckering and mouth articulations were the lips protrude.  In 2d I think its hard to tell exactly what the mouth is doing without that extra depth data. I imagine the 2d algorithm has to cleverly make those decisions based on the mouth shape along with what the other facial shapes are doing.  

One interesting discovery I made was there is a distance sweet spot where the depth data is most clean and accurate.  Once I got within this range the stability improved quite a bit.  I also discovered light contamination is a factor.  I’m pretty sure the iPhoneX is generating its depth from capturing structured light which requires projecting an infrared pattern onto the face.  And though the projection is infrared I noticed the light coming from my computer monitor would introduce noise as well as any other hard light sources in my house.  I though a black room would be best but that was wrong too.  It seems like a flat neutral lighting works best but I need to dig into the science a little to understand it better.  

Content for Games

Many of us were really pissed at Apple for buying Faceshift, which in my opinion held a great place in the facial capture spectrum for affordable, easy and good facial capture data.  With the iPhone X it feels like we’ve gotten back a slice of Faceshift and its become mobile which isn’t critical but it has some real advantages.  Of course Faceshift had a lot of important production features that allowed you to dial in your base mesh and its relationship to the capture data, import audio, pre-shot video, etc…  so there would still need to be a solid app written for artists and animators to use that would hopefully bring back some of these key features. 

However, In one sense the iPhoneX is just a portable depth camera and they’ve incorporated the Faceshift tech in order to interpret the depth camera data into useful data to drive the blendshapes. (Or emojis!)  And its the Faceshift tech thats made it easy for someone like me to quickly get results. (interpreting depth data into useable facial capture is not trivial.)  The problem with building a program around this, is one day apple might just change that underlying tech and it would kind of break your capture program.  You could just access the depth data and write your own algorithms for interpreting it but does it need to run on a $1,000 mobile phone?  A desktop depth camera is pretty cheap and there are still some unknowns as to capturing with an iPhone X in real production situations.  

I do think we can get some real use of this tech right now (ghetto style) and actually use it in our game and for marketing content and shorts.  I also thing there is a lot more room for improving the last test I did with Rig & blendshape improvements so once i’ve max out what the iPhoneX data can do, it will be a better gauge for me.  I also want to test out Faceware live and see how it does relative to my needs.  If I can get better quality than what the iPhoneX can produce and its fast, I will definitely go that route.   I’d also love to see what this data can do with something like the Snapper rig!

Ultimately I hope AR and Apples involvement will trigger more development in facial capture. Sadly we might have emoji’s and a pile of poo to thank, but who cares anyway!  

Cory StrassburgerCo-Founder — Kite & Lightning.

Interview conducted by Kirill Tokarev.

Join discussion

Comments 0

    You might also like

    We need your consent

    We use cookies on this website to make your browsing experience better. By using the site you agree to our use of cookies.Learn more