Motion & Gesture Recognition for low-res TOF-Sensors


Low-cost Time-Of-Flight sensors like the "ST VL53L5CX" usually have low resolution SPAD arrays (e.g. 4x4 or 8x8) with individual zones, each measuring a distance to an object located in a section of the FOV of the sensor. These measurement can be processed on a MCU like a STM32 to recognize motion and gestures. For this purpose I created a no-std, no-alloc Rust library with C-bindings to be able to integrate it into C-based embedded codebases, such as the ones created by STM32CubeMX.
This blog post explains the algorithms how this library calculates positions and recognizes motions and gestures from the sensor values.


First, let us establish how positions in relation to the sensor are represented as coordinates.

$$ C_{cart} = \begin{pmatrix} x: \text{distance to origin on x-axis}\newline y: \text{distance to origin on y-axis}\newline z: \text{distance to origin on z-axis}\newline \end{pmatrix} $$

$$ C_{spher} = \begin{pmatrix} r: \text{distance to the origin}\newline \theta \text{ (theta): angle to x-axis (azimuth)}\newline \phi \text{(phi): angle to y-axis (zenith)}\newline \end{pmatrix} $$

This spherical notation is the mathematical convention.


Convert between the two: From spherical to cartesian coordinates $$ \begin{align} x &= r \cdot \cos(\theta) \cdot \sin(\phi)\newline y &= r \cdot \sin(\theta) \cdot \sin(\phi)\newline z &= r \cdot \cos(\phi) \end{align} $$

and from cartesian to spherial $$ \begin{align} \textcolor{green}{r} &= \sqrt{x^2 + y^2 + z^2}\newline \theta &= \arctan(y / x)\newline \phi &= \arccos( z / \textcolor{green}{r} ) \end{align} $$

Sensor Grid

We have for each zone:

$$ \begin{align} dist_{i_x, i_y} &: \text{the measured distance}\newline i_x &: \text{horizontal index in the sensor grid}\newline i_y &: \text{vertical index in the sensor grid}\newline \end{align} $$

To determine the position of the measurement in each zone, we need to calculate it with the Sensor-FOV:

$$ \begin{align} Res_{hor} &: \text{horizontal resolution}\newline Res_{vert} &: \text{vertical resolution}\newline Fov_{hor} &: \text{horizontal sensor field-of-view}\newline Fov_{vert} &: \text{vertical sensor field-of-view} \end{align} $$

We can determine the zone angle deltas:

$$ \begin{align} a_{zone, hor} &= \frac{Fov_{hor}}{Res_{hor}}\newline a_{zone, vert} &= \frac{Fov_{vert}}{Res_{vert}}\newline \end{align} $$ and then we can get the spherical coordinates: $$ C_{spher, i_x, i_y} = \begin{pmatrix} \begin{align} r &= dist_{i_x, i_y}\newline \theta &= (i_x - \frac{Res_{hor}}{2}) \cdot a_{zone, hor}\newline \phi &= \frac{\pi}{2} - (i_y - \frac{Res_{vert}}{2}) \cdot a_{zone, vert}\newline \end{align} \end{pmatrix} $$

Object Position

To determine the position of an object in front of the sensor relatively accurately even with the low resolution, we use the following method:
We calculate the coordinates for all zones. Then we take the one with the smallest distance $r$ as the initial position $C_{min}$.
Because that alone is not very accurate, we need to weigh in the other measurements. We use the weighted average mean:

$$ C_{obj} = \frac{ \sum_{i_x = 0, i_y = 0}^{n_x = Res_{hor}, n_y = Res_{vert}} \omega_{i_x, i_y} \cdot C_{i_x, i_y} }{ \sum_{i_x = 0, i_y = 0}^{n_x = Res_{hor}, n_y = Res_{vert}} \omega_{i_x, i_y} } $$

Now, how should this weight $\omega_{i_x, i_y}$ look like? Because the object is likely larger than the FOV of a single zone, it should count in neighboring zones as well. The further away a measurement is to $ C_{min} $, the less it should be weighted. With this criteria, we can construct a formula like this:

$$ \omega_{i_x, i_y}(C_{min}, C_{i_x, i_y}) = f \cdot \frac{1}{| C_{min} - C_{i_x, i_y}|} $$

$f$ is a factor that determines how much the distance weighs in. This can be adjusted based on the size and distance of the to be detected objects and is an entirely subjective value.

Motion and Gesture Recognition

Now that we have determined the object position, we save it for every sensor readout in a vector, with the time when it was measured. To be able to reliably recognize gestures, this vector should hold at least ca 2 seconds of position data, so for a 15Hz sensor it should hold at least 30 elements, for a 60Hz sensor it should at least hold 120 elements. The size of course also depends on the memory constraints of the device that runs the library.
To avoid accidental recognition of far away objects, detection can be aborted if the distances are above a certain threshold.


To detect swipes, we look at the measurements newer than a specified time (ca 600ms) and look if the object position has moved horizontally more than a certain distance. To make this more robust, we only recognize the gesture if the object also has not moved much towards or away from the sensor.

The recognition of a swipe gesture

Static Holds

Static holds are determined when the object position has not changed above a distance threshold for a specified amount of time.

The recognition of a hold gesture


stay tuned for another post that shows the implementation as a no-std, no-alloc Rust library with integration into an existing STM32 C-based project. It will also show how HID keyboard reports are sent when a gesture is recognized, and how to debug the Rust library on an STM32-F4 MCU.