Reply is the place to meet an incredible variety of enthusiastic, passionate, ideas-driven people, who want to make a difference and an impact.Would you like to know more?
Since the dawn of time, only a few limited hand gestures have ever been available to the HoloLens developer, primarily the Air Tap. The Air Tap is given to the developer by a high level API, and is parallel to a click on a PC or a touch on mobile. These interactive limitations have led to many different contextual interfaces that pop up during the HoloLens experience, sometimes moving to stay in the user’s field of view. These interfaces can cause visibility issues, be subject to visibility issues of their own and typically house a limited number of options.
The first concern is that the Azure Custom Vision wouldn’t be able to reliably classify the gestures. To train this it’s possible to create a fake data set of gesture images simply by using Microsoft Paint. Thanks to different drawings, like circles, triangles and squares, it’s possible to train and test the Azure Computer Vision model. In fact, it was able to classify new images correctly over 95% of the time, proving Azure's capability to accurately classify basic line shape recognition on very low-resolution images.
The next challenge was getting the generated image gestures to upload with appropriately formatted meta-data. It’s difficult to find basic examples of uploading an image file to the endpoint with Unity’s networking libraries and in the end it’s necessary to encode the texture to a PNG, save that PNG to disk, load it back into a byte array and then convert that into a string.
This lets them know when the gesture is complete even before it’s classification is returned from the Azure Deep Learning service, making the experience vastly more tactile and usable. If the gesture response comes back successfully a slightly deeper popping sound is played and a picture of the gesture type is shown in the bottom left corner. While the system doesn’t always predict the user’s intended gesture correctly, performing the gesture in a certain way can greatly improve the chance of recognition. By adding the noises and quickly removing in-view gesture tracking and predicted gesture classifications, the user receives clear, instant feedback when the gesture recognition fails and can adjust their technique accordingly.
While there is still plenty of room for improvement, the resulting ability to increase gesture options speaks loudly to 3D developers who have grown accustomed to limitations. Another exciting path that could be explored is using an ONNX export of it in conjunction with the WinML on-device inference that was introduced with the Windows’ RS4 update. This will help to improve latency and also help to overcome potential issues with backend connectivity and network limitations.