PLACE Method: Human Body Generator on a 3D Scenes

Emil Bogomolov
7 min readJul 30, 2021

Researchers from Max Planck Institute for Intelligent Systems and ETH Zurich proposed very elegant way to generate plausible humans on a given 3D scene. Let us discuss why it’s relevant and dive into details.

Current conjuncture

Industry and academia are fueling up the digitalization of real-world environments. While Apple (and other smartphone / tablet producers) introduces new devices with depth sensors and LiDAR (check the demo here), researchers from all over the world struggle to find new ways to use new type of data. This data mostly consists of 3D scans of commonly available living-rooms and bath-rooms, offices and canteens with quiet good level of detailization. There are plenty of task tracks that work with such data, e.g. semantic and instance segmentation of the furniture as it was done in OccuSeg paper, or part completion of the furniture as it was done in the paper of my scientific group Part-Based Understanding of RGB-D Scans.
These task (and many others) are crucial for creation intelligent assistants that may help people with dementia, or elderly people, or disabled persons to live fully inside their own houses.
We know that real-world indoor environments may contain people, but most of existing datasets don’t have them. Authors of the PLACE paper (PLACE: Proximity Learning of Articulation and Contact in 3D), they are addressing this significant limitation of existing 3D virtual reality environments such as Habitat. They offer a method that generates human meshes inside 3D environments

Renders from the paper (source)

Architecture

Let’s have a closer look on components of the method

Basis Points Set

PLACE piggy-backs the idea of basis points proposed in this article of ICCV 2019.

source

They encode raw body point clouds with distances to the fixed basis points set in the space, then use the encoding to infer full body high-resolution mesh from SMPL-X using simple MLP. Fixed basis points mean that for each different input body point cloud we will calculate distances to the same points in space. In original paper number of fixed points was 1024 and in the PLACE paper this number goes up to 10k. As you can see from the slide, it is possible to create such encoding without any knowledge of the scene where the human is.
But authors go further and give us the method to create a human-scene interaction representation.

Scene BPS and BPS features (source)

Given a scene mesh (a 3D environment) and a human body mesh it is possible to have basis points on the scene vertices. From here derive two definitions scene BPS — fixed basis points set on the scene and BPS features — distances from basis points to vertices of human body mesh. On this video you can see that fixed set of points in the environment and different bodies actually have different set of distances, due that different poses have different BPS features.

Distance-based human mesh generator

Human mesh generator architecture (source)

To create a generator that will produce plausible human bodies, authors propose the following pipeline. Given the scene and human meshes they compute BPS features (distances) and train variational autoencoder (VAE) to reconstruct such distances. From that point reconstructed body features go to MLP to be regressed into full body vertices. MLP outputs two things: a global 3D translation for all the body vertices and intermediate reconstructed vertices, their sum gives reconstructed body vertices. And the initial data source used for all trainings of this paper is PROX dataset.

Generation pipeline (source)

During test one can sample random vectors from Gaussian distribution and pass it to the decoder of the VAE. That give us a human meshes generating pipeline.

One might notice that this model only works for a single scene and single fixed set of points on the scene mesh. We’ll discuss how authors propose to overcome this issue after the short introduction to VAE mentioned above

A short intro to Variational Autoencoders

This work uses VAEs for reconstructing human meshes and as we’ll see later for the environment representation. In order to be on the same page, let’s quickly discuss how VAE works.

Work example of autoencoder on one of the MNIST images

The standard autoencoder consist of two parts encoder and decoder. Encoder compress high-dimensional data such as 3D mesh to a low-dimensional representation, usually to a vector of size N. Decoder, on the contrary, expand this vector to the original data. These two are neural networks which are trained jointly using reconstruction loss. Such loss ameliorates the ability of encoder to discard unnecessary information and ability of decoder to create output close to the source data.

One can use autoencoders as generator of new data samples similar to the training data. But decoders of vanilla autoencoders produce valid data only from latent vectors that was present during training. To overcome such limitations we can use variational autoencoders or VAEs

VAE encodes parameters of distribution instead of samples

Have a look at the picture from this post. In the middle of the network we see μ, σ and Sample. The main difference between two models consists in the fact that while vanilla AE for each input produces one latent representation of length 30 for example, VAE creates two vectors of normal distribution parameters (mu and sigma). The exact latent representation of the piece of input will be a realization of 30 random variables obtained at Sample layer.

This makes our encoding deviation robust, because in that case decoder is taught to predict same output not only for the one vector, but for the set of close points (distributed around μ with deviation σ) in the latent space. This whole thing is trained with two losses: reconstruction and KL divergence. The latter in that very case is used to force all parameters μ, σ to diverge less from standard normal deviation (means with parameters 0, I). You can learn more about how these losses are weighted together here.

Two-stage distance-based human-scene encoding

Moving back to the human generation task, it is impossible to disagree with that it highly preferable to be able to generate humans not on the only scene but on the any given scene. To overcome such issue afters propose to encode both human body on the scene and the scene itself using the same technique of using basis points and VAE.

Distance based encoding of the scene (in blue, upper most) and of the human body in (in orange, bottom) (source)

Basis points are fixed on the walls and ceiling of a cubic cage in 3D space. The same set of basis points is used for any crop of input scene from PROX dataset. This approach helps learn both the context around the human mesh and the body features itself. The human autoencoder becomes a conditional VAE, because it’s conditioned with latent vector of the scene. In practice this conditioning can be achieved with concatenation of latent vectors from two networks before passing it to the decoder of human generator.

The rest of the network stays the same (source)

On top of proposed two VAEs there’s still MLP that regress high-resolution human mesh. According to authors this approach leads to good results, but results become even better when they add one more conditioning.

VAE for absolute (x,y,z) locations of the scene (in green, upper most) (source)

Researchers propose to encode the scene not only with distances but also with absolute (x,y,z) coordinates of mesh surface. Other VAE responsible for such encoding and it’s latent vector is fed to the regression MLP at the end of pipeline

Interaction-based optimization

The last step of the whole PLACE method is the interaction-based optimization.

Top row: results before optimization, bottom row: after optimization (source)

They introduce complex loss on the theta parameter of the body shape. This loss helps to overcome in the one hand the interpenetrations (as in the third column from the right), on the other hand it force the network to produce more natural body poses (see first column).

Results evaluation

To evaluate the quality of results authors involve assessors. They write a tool in which user can compare two different human meshes and decide which of models is “better”

Comparison tool (source)

That leads to the following results

Source

During the evaluation stage around 70% of users considered proposed model better then previous from than PSI model (from the same authors) on two datasets. And very encouraging and interesting that 48.5% of users considered generated humans more plausible then ground truth itself. One might also notice that results available only for PROX, and the model was trained on PROX.

Show me the code!

This paper demonstrates us the great results in the field of human body shape generation in 3D. The results might be useful for researchers and developers which work with indoor 3D models, may be we’ll see the similar approaches used in virtual assistance technologies or computer games in the near future, who knows. But you can reproduce the results today, because authors provided the code base of their experiments on github.

References

[1] PLACE: Proximity Learning of Articulation and Contact in 3D Environments https://arxiv.org/abs/2012.12877

[2] Intuitively Understanding Variational Autoencoders https://medium.com/r?url=https%3A%2F%2Ftowardsdatascience.com%2Fintuitively-understanding-variational-autoencoders-1bfe67eb5daf

--

--

Emil Bogomolov

Deep Learning enthusiast working as a Research Engineer at ADASE Group at Skoltech University