How to extend HoloLens Gestures with Deep Learning AI

A practical guide


Since the dawn of time, only a few limited hand gestures have ever been available to the HoloLens developer, primarily the Air Tap. The Air Tap is given to the developer by a high level API, and is parallel to a click on a PC or a touch on mobile. These interactive limitations have led to many different contextual interfaces that pop up during the HoloLens experience, sometimes moving to stay in the user’s field of view. These interfaces can cause visibility issues, be subject to visibility issues of their own and typically house a limited number of options.

In order to escape this interactive limitation it’s possible to build a gesture interaction system for the HoloLens using Microsoft’s Custom Vision Cognitive Service Machine Learning framework. Once a user starts an Air Tap, the gesture system begins coloring in pixels around a position on the image corresponding to the hand position. When you lift your finger and conclude the Air Tap the generated image is then sent to the Custom Vision endpoint, which classifies the image as one of five gestures and returns that information as a JSON (JavaScript Object Notation) response.


The first concern is that the Azure Custom Vision wouldn’t be able to reliably classify the gestures. To train this it’s possible to create a fake data set of gesture images simply by using Microsoft Paint. Thanks to different drawings, like circles, triangles and squares, it’s possible to train and test the Azure Computer Vision model. In fact, it was able to classify new images correctly over 95% of the time, proving Azure's capability to accurately classify basic line shape recognition on very low-resolution images.

The next challenge was getting the generated image gestures to upload with appropriately formatted meta-data. It’s difficult to find basic examples of uploading an image file to the endpoint with Unity’s networking libraries and in the end it’s necessary to encode the texture to a PNG, save that PNG to disk, load it back into a byte array and then convert that into a string.


As soon as the system is running on the HoloLens, it’s possible to make users test it. The first thing to keep in mind is to teach people how to do the gestures. Moreover, one of the most common problems users run into when trying to make the gestures is that their hands go outside the hand tracking parameters/boundaries. Because of the delay between making a gesture and receiving a prediction for that gesture from Azure, users can have a hard time knowing when or how they made a mistake. To make this clearer it’s possible to add an audible popping noise and make the gesture visual instantly disappear as soon as the user lifts their finger.

This lets them know when the gesture is complete even before it’s classification is returned from the Azure Deep Learning service, making the experience vastly more tactile and usable. If the gesture response comes back successfully a slightly deeper popping sound is played and a picture of the gesture type is shown in the bottom left corner. While the system doesn’t always predict the user’s intended gesture correctly, performing the gesture in a certain way can greatly improve the chance of recognition. By adding the noises and quickly removing in-view gesture tracking and predicted gesture classifications, the user receives clear, instant feedback when the gesture recognition fails and can adjust their technique accordingly.

While there is still plenty of room for improvement, the resulting ability to increase gesture options speaks loudly to 3D developers who have grown accustomed to limitations. Another exciting path that could be explored  is using an ONNX export of it in conjunction with the WinML on-device inference that was introduced with the Windows’ RS4 update. This will help to improve latency and also help to overcome potential issues with backend connectivity and network limitations.