What is Tango?

Tango is a technology platform developed and authored by Google that uses computer vision to enable mobile devices, such as smartphones and tablets, to detect their position relative to the world around them without using GPS or other external signals. This allows application developers to create user experiences that include indoor navigation, 3D mapping, physical space measurement, environmental recognition, augmented reality, and windows into a virtual world.

(Source: https://en.wikipedia.org/wiki/Tango_(platform))

Four devices are supporting the Tango technology at the moment we’ve written this article:

  1. The Yellowstone tablet (Project Tango Tablet Development Kit) a 7-inch tablet with full Tango functionality, released in June 2014 (This is the device we are using in our showcase).
  2. The Peanut was the first production Tango device, released in the first quarter of 2014.
  3. Lenovo’s Phab 2 Pro is the first smartphone with the Tango technology, the device was announced at the beginning of 2016.
  4. Asus ZenFone AR is the world’s first 5.7-inch smartphone with Tango and Daydream by Google.

The Idea behind working on a showcase using Tango was to learn more about this smart and powerful device.

The showcase is consisting of two requirements:

  1. Place a virtual Modeso logo on a real-world surface (e.g. wall)
  2. Make the virtual object — the Modeso logo in our case — not cover real world objects in front or rather occlude these
In the red rectangle you can see the goal we want to reach

The Tango project consists of three core technologies:

  • Motion tracking
  • Area learning
  • Depth perception

Our target was required to work with both, area learning and depth perception. Area learning is very important for learning the surrounding environment and it depends on the depth perception. So, you cannot enable the area learning in an application without enabling the depth perception.

First Requirement

Place a virtual Modeso logo on a real-world surface (e.g. wall) by touching the screen. While the device camera is pointing to some flat surface the virtual Modeso logo should be placed on this targeted surface.

We could simplify it through the following approach:

  1. Detecting the position (x,y) of the touch on the screen and convert it to (u,v) coordinates.
  2. Get the color to depth pose by calculating it using the calculateRelativePose method available via Tango support library provided by Google.
  3. Get the last valid depth data provided by the Tango.
  4. Get the current connected Tango camera intrinsics.
  5. From the data above the intersection between the touch point and the plane model can be calculated. This can be easily done using a method from the support library called fitPlaneModelNearClick. This method will return null if there is no clear plane object to calculate the point on it.
  6. Get the transform setting for OpenGL as the base engine and Tango as the target engine. The transform is needed to calculate the virtual object poses in the OpenGL environment.
  7. From the transform and the IntersectionPointPlaneModelPair we can now calculate the object’s pose that will be used in the OpenGL renderer to render the object on the screen.

To make things easier while working with OpenGL we were using the Rajawali library. Google provided examples achieving the same concept with some extra functions and features.

Result demo of first requirement in the Swiss office

Second Requirement

From the first requirement we already know the depth on the logo in the 3d world. With the help of depth perception and the provided cloud points we can filter these points and get the subset of points with a depth value less than the depth of the logo putting in mind the quaternion of the logo.

Using the Tango update listener we can use the callback onXyzIjAvailable. This callback is invoked when new cloud point data gets available from Tango. Important to know is, that this callback is not running on the main thread.

Google issued a warning about these callbacks. Meaning you have to be very restrictive while working inside the callback, because you won’t receive new cloud data until you have returned from the callback.

So, for example if you are working on heavy stuff inside the callback it will affect the performance of your readings.

Every time we receive new cloud points from the Tango we filter these points depending on the the depth of the logo and update the Rajawali renderer with the new data.

Belal covered with the virtual Modeso logo and rendered cloud points
The virtual Modeso logo in front of an iPad, which was detected by Tango

The second step is to show a real object in front of the model to get the occlusion of real-world objects in front of the logo.

In order to make this possible the following three steps are required:

  1. Get depth matrix of the model
  2. Get intersection points
  3. Add mask to intersection points

We used the Rajawali method calculateModelMatrix from the ATransformable3D class which takes the matrix of the current point cloud matrix. With the result points we could add a mask to it.

We were lucky to stumble upon https://github.com/stetro/project-Tango-poc which was a great reference and gave us the right idea on how to implement the masking.

Unmasked hand in front of the flipped virtual Modeso logo on the wall

Approaches

Here we will go a little bit deeper into the different approaches used and proposed for achieving the target goal.

Depth Mapping & 3D Reconstruction

This is the technique used by the repo at https://github.com/stetro/project-Tango-poc and it is based upon depth mapping. In short a depth map is a regular image (occasionally a grey scaled image) that contains information related to the distance of the surfaces of scene objects in the image from a specific viewpoint/camera. The color of each pixel of the depth image represents the distance (depth) of this pixel in real world from the camera e.g. in gray scale depth images dark areas represent closer points to the camera while lighter areas represent further points (or the reverse according to how you encoded the image)

How will depth maps help us achieving the goal?

If we have a depth map for our camera view i.e. if we know the depth of each pixel of the camera view we can perform what is called the “Depth Test” or “Z-Buffering” in Computer Graphics, in this algorithm, while rendering your 3D content, the hardware compares the value of each pixel of the rendered 3d content to the corresponding pixel in the depth map and decides whether it should draw this pixel or not according to whether it is occluded i.e, there is another object at the same pixel closer to the camera, or not. Basically that’s the idea.

How this is done in code?

The first step is that we need to compute the depth map of our view using Tango color camera to achieve that we have to:


1 Initialize the Tango service as documented using the C API

https://developers.google.com/tango/apis/c/

2 Configure the Tango device to use the color camera:

3 Configure the Tango device to use depth camera:

4 Configure the Tango device to connect local callbacks for different feeds from the device:

5 In the “OnFrameAvailableRouter” callback, which is called when a new camera frame arrives and the frame image buffer is passed to it, we construct our depth image from the camera image:

But what are we doing inside this callback exactly?

First we are converting the received image buffer (TangoImageBuffer) format from YUV color space (default) to RGB color format. This is done using the YUV2RGB function for each pixel.

Next we construct an OpenCV Matrix (image) over the resulting RGB image from the first step and create a gray scaled version of it.

After that we apply a GuidedFilter to the OpenCV image. Guided filter is an edge-preserving smoothing filter. See here for more. In effect this filtering smoothes the occluded parts of the 3d object so it looks more real. The resulting filtered gray scaled image is used as our Depth map.

Based on this depth map we will try to construct a 3D representation of the camera view. We now have the z coordinate of each pixel in the camera view and need to know the other two coordinates (X, Y). Here Tango supports us providing an equation to translate 2d coordinates to 3d ones and vice versa using the camera intrinsics.

Given a 3D point (X, Y, Z) in camera coordinates, the corresponding pixel coordinates (x, y) are:
x = X / Z * fx * rd / ru + cx
y = Y / Z * fy * rd / ru + cy

After solving the previous equation for X and Y we now have the three components X,Y and Z. So now we can construct a 3D points version of the camera’s real view. For that we construct a vector of vertices and fill it with the data as last required step.

With that all in place we can perform the “Depth Test” in OpenGl as we have the 3d model to render and a 3d representation of the camera view. To render parts closer to camera and occlude the further ones we only need to check them against each other:

This is what you will get:

3d marker occluded by office chair
Gray scaled depth image (darker means closer)

Conclusion

The results were not as accurate as expected to proceed with the targeted scenario.

We have noticed that the Tango is heating up very heavy while using it for more than five minutes. Google is mentioning this problem already. Area learning is using heavy processing power causing the processor to heat up and to protect the processor the device reduces the processor speed which has a negative effect on the readings and the data produced by the Tango sensors in general.

Furthermore, the object detection is very imprecise, the camera is not yet good enough to produce good results from image processing. Even with a better camera the image processing would be very expensive in processing power and would impact overall performance.

On the other hand it’s very good with augmented reality in general and placing objects on walls or floors without the need for markers. In direct comparison with vuforia and it’s markers it’s the winner without questioning.

But for real world real time object occlusion it seems to still miss needed functionality.

Credits: Modeso’s Mobile Engineers Belal Mohamed, Mahmoud Abd El Fattah Galal & Mahmoud Galal.