Learning Structured Latent Space Encoding for Part Based Generative Modeling of 3D Objects

Learning Structured Latent Space Encoding for

Part Based Generative Modeling of 3D Objects

Cormac OMeadhra Stephen Landers

Abstract

Real-world operation of autonomous systems ne-

cessitates modeling and inference in 3D environ-

ments, that are complex, varied and highly clut-

tered. This work achieves increased generaliz-

ability through the use of a part based genera-

tive surface model that is capable of generating

high ﬁdelity surfaces from an un-ordered point

cloud. The expectation-maximization algorithm

is employed to learn the latent correspondences

between the input data and the inferred parts at

test-time, enabling the approach to better gener-

alize to unseen parts by focusing on capturing

surface patch regularity rather than part distribu-

tions.

1. Introduction

Real-world operation of autonomous systems necessitates

modeling and inference in 3D environments. Point clouds

are a widely used representation of 3D environments that

are compatible with modern 3D sensors, such as RGB-D

and LIDAR. However, the point clouds produced by these

sensors are typically large (

> 10

points) and, as they are

samples from 2D surfaces, highly redundant due to missing

information about local neighborhoods and shared surfaces.

From a representational efﬁciency stand-point, it is desirable

to identify a model of the more compact surface from which

the point cloud was sampled.

To that end, this work aims to develop a part-based genera-

tive surface model that is capable of generating high ﬁdelity

surfaces from an un-ordered point cloud. Given a point

cloud

representing an object model

, consisting of parts

}

, such that

∪

= O

, we wish to construct a genera-

tive model

))

, acting on a subset of the input point

cloud

⊆ X

, where

)

is an encoder that generates a

latent encoding

of the input and

(·)

is a decoder that

outputs a surface model

. The parts,

}

, are learned

such that their union forms a surface

that well models

the generating surface of the input point set. As such, the

parts are not necessarily semantically meaningful, but a se-

mantically meaningful decomposition can be enforced when

labeled training data is provided. In the absence of semantic-

part-segmented data, we train the the generative model in

an unsupervised manner, with a heuristic based labeling, to

minimize reconstruction error as detailed in section 3.

We are interested in exploring the effect of employing

the expectation-maximization (EM) algorithm to determine

point-set partitioning at test time, such that a set of

identi-

cal part models each operate on a set of, potentially partially,

overlapping point clouds. Thus, at training time, we learn a

relatively small autoencoder model that captures local struc-

ture for a

subsets of the input point cloud. We hypothesize

that removing the need for the network to learn input-point

partitioning will enable increased model capacity and there-

fore ﬁdelity.

2. Related Work

While several representations of 3D objects are commonly

used, point clouds are the most easily obtained from com-

mon commercially available sensors. Learning features

directly on point clouds is a desirable tool for analysing

point clouds. However, unlike images, point clouds do not

exist on a structured lattice and any function that operates

on point clouds must be invariant to permutation of the

input points. A successful solution to this problem was

found in the form of PointNet and its successors (Qi et al.,

2017), which utilize point-wise features and max-pooling

to achieve a permutation invariant feature extractor. Subse-

quently, the requirement for point-wise feature extraction

has been relaxed to enable learning of 3D convolutional

ﬁlters as a direct analogy to the 2D image case (Xu et al.,

2018). These 3D convolutional ﬁlters enable efﬁcient ex-

traction of features from unordered point clouds, which can

be leveraged to create an expressive latent space encoding.

The most direct method for generating models of 3D ob-

jects from latent codes is to output a point cloud. However,

other representations, such as triangulated meshes, are better

suited to many applications, e.g. graphics and computational

geometry. Recently, approaches have been presented to pre-

dict mesh models from a latent space encoding, using, for

example, planar patch deformation (Groueix et al., 2018) to

map set of 2D planes onto the surface of a 3D object. How-

Surface Reconstruction from Point Clouds through Mixtures of Learned Primitives

ever, these approaches have demonstrated limited ﬁdelity.

An alternative strategy that has yield much higher ﬁdelity is

to utilize the decoder as a classiﬁer and implicitly encode

the object surface as a classiﬁcation boundary (Mescheder

et al., 2018). A mesh model can then be obtained using

standard reconstruction techniques.

Due to the structure inherent in part-based models, it is de-

sirable to embed explicit knowledge of this structure into

the point-cloud encoding architecture. This can be achieved

through the use factorized graphical model in the form of

a conditional random ﬁeld, from which the latent space

distributions can be extracted using mean-ﬁeld inference

(Huang et al., 2015). Improved performance has been ob-

served through the use of VAE architecture, with additional

structure enforced in the latent space. Speciﬁcally, the VAE

latent encoding is combined with a learned Binomial exis-

tence distribution to model the presence of parts within an

object instance, which enables capturing discontinuous part

removal and addition actions (Nash & Williams, 2017). Gen-

erative Adversarial Networks (GANs) offer an alternative

strategy to VAEs for generative modeling. Li et al. propose

a multi-stage framework that utilizes a part-based bounding

box GAN to achieve an object skeleton latent encoding fol-

lowed by a second stage, which learns to ﬁll the bounding

boxes based on the part encodings. While the GAN provides

an alternative optimization formulation, a general principle

from this approach is the introduction of part-based skele-

tons into the latent space to yield increased accuracy by

progressively growing the model complexity. As an alter-

native strategy to combining parts within the same latent

space, Dubrovina et al. propose learning disjoint sub-spaces

for each part, which can be linearly combined to reconstruct

a composite object. Furthermore, Dubrovina et al. tackle

the problem of part alignment through the use of a spatial

transformer network (Jaderberg et al., 2015), which learns

to transform parts from a canonical frame to the desired

object pose.

The problem of part-based 3D modelling has strong parallels

to 3D indoor scene synthesis. However, indoor scenes often

exhibit greater diversity in relative part placement, grouping

and orientation, which requires additional latent space struc-

ture (Li et al., 2019). Transferring these techniques to the

problem of the 3D part-based modeling can enable increased

robustness and diversity in the reconstructed models.

Variational autoencoders (VAE) are a widely used technique

for generating latent space encoding of an input distribution

(Kingma & Welling, 2013). The VAE learns a mapping from

a data distribution to a Normal distribution with diagonal

covariance over the latent variables. The mapping is learned

by minimizing a lower bound on the reconstruction error

and the deviation of the latent space distribution from the

desired diagonal Gaussian. Achieving a diagonal covariance

is referred to as disentangling the latent variables. This can

be achieved more reliably through the use of constrained op-

timization, where the deviation between the diagonal latent

prior and the learned latent distribution is bounded (Burgess

et al., 2018). This concept can be extended further through

the introduction hierarchies of independent variables, which

can achieve a more expressive latent space factorization

(Esmaeili et al., 2018).

Several surface model variations were considered for this

work. Ultimately, we elected to use an explicit surface

model achieved through a parametric surface function.

Implicit Surface Models

Learning implicit models of

object surfaces has recently been shown to achieve high-

ﬁdelity surface models (Mescheder et al., 2018; Park et al.,

2019). One particularly interesting implicit representation

is using a signed distance ﬁeld (SDF) (Park et al., 2019).

Given a network

SDF

(x, θ)

that outputs a signed distance

ﬁeld from a surface

, we can extract the surface via the

level-set

SDF

(x, θ) = 0

. Training such a network simply

requires regressing to an SDF learned over the input shape.

We follow the approach of Park et al. (2019) to construct

an SDF for each mesh in the input dataset and use a cost

function of the form

SDF

(x, θ), s) = |clamp(f

SDF

, τ)−clamp(s, τ)| (1)

where

clamp(x, τ) = min(τ, max(−τ, x))

and

is a pa-

rameter controlling the maximum extent of the SDF.

Explicit Surface Models

While implicit surface model-

ing has the beneﬁt that it outputs the distance ﬁeld directly

which is useful for the E-step calculation (

), explicit sur-

face modeling has the beneﬁt of directly outputting a usable

shape model such as a mesh. We follow the approach of

Groueix et al. (2018) and minimize the symmetric Chamfer

distance between the learned mesh,

MESH

(x, θ)

, and the

ground truth

. The loss function is evaluated over a point

cloud

resampled from

MESH

(x, θ)

and a point cloud

resampled from S.

MESH

(

X , S) =

s∈S

min

x∈

kx−sk

x∈

min

s∈S

kx−sk

(2)

Primitive Surface Models

The ﬁnal approach we con-

sider is learning primitive based models. Speciﬁcally, we

consider superellipsoids which implicitly deﬁne a surface

by the equation

(| x |

+ | y |

)

t/r

+ | z |

= 1 (3)

A signiﬁcant beneﬁt to this model is that it only requires

the prediction of two parameters

r, t

. The loss function

Surface Reconstruction from Point Clouds through Mixtures of Learned Primitives

is identical to

MESH

, but with points resampled from the

surface of the primitive in this case.

3. Methodology

3.1. Probabilistic Surface Model

We wish to model a surface in

, which we represent in

parameterized form

(Θ) = f

(u, v; Θ) = S

: R

→ R

(4)

where the parameters

deﬁne the chart from a two-

dimensional space on which

u, v

are deﬁned. We model a

point cloud,

, as a set of noise-corrupted samples from a

uniform distribution over the surface S(Θ)

x = s

min

+ n

(5)

where

min

is the point closest to

lying on

is the unit-

normal vector at

min

and



is zero-mean Gaussian noise

 ∼ N (0, σ

)

. The distribution over point cloud samples

can then be expressed as (dropping the dependency on

for brevity)

p(x) = p(x|s

min

; )p(s

min

) = N (d

min

(x), σ

)

|S|

(6)

where

min

(x) = kx − s

min

and

|S|

is the surface area

. We note that the modelling the noise as being directed

normal to the surface is a choice made for mathematical

convenience, in that it results in a tractable form for the

point likelihood function.

We further elect to model

S(Θ)

as the union of

surface

parts

(θ

)

, where

Θ = {θ

, . . . , θ

}

. Each parts

associated with a corresponding probabilistic model

(x)

which is deﬁned according to Equation (6). By constraining

the surface parts to be disjoint, i.e.

x ∈ S

⇒ x 6∈ S

∀j 6=

i, we can express the union of surface parts as a sum

p(x) =

j=1

(x) =

j=1

|S|

exp

−

min,j

(x)

(7)

where

{π

}

are a set of normalizing coefﬁcients, which are

given by

|S|

. in order to achieve the desired uniform

surface distribution.

3.2. Learning Surface Models via the EM algorithm

Given a point set

, the parameters

can be learned

through maximum likelihood estimation, which is achieved

by maximizing the log-likelihood of the probabilistic sur-

face model,

log p(x; Θ)

. However, due to the presence of

the sum inside the log, this objective is difﬁcult to optimize.

The expectation-maximization (EM) algorithm (Dempster

et al., 1977) provides an iterative method for optimizing the

log-likelihood. We proceed by introducing binary, latent

correspondence variables,

, and using the complete data

log-likelihood to optimize a lower bound

`(x) =

i=1

log

p(x

, z) ≥

i=1

j=1

log

p(x

, z = j)

where

= q(z = j|x

)

and

q(z|x)

is a proxy posterior

distribution over the latent correspondence variables. It can

be shown (Bishop, 2006) that the lower bound is tight when

the γ

terms are selected as

= p(z

, Θ) =

ˆp

(x; θ

)

ˆp

(x; θ

)

(8)

where

ˆp

(x; θ

) = exp



−

min,k

(x)



However, in the case

of distributions with non-compact support, eq. (7) this will

assign

points to each of the

surface parts. Under an

alternative interpretation of EM, the E-step can be viewed

as minimizing the divergence between the true latent vari-

able posterior

p(z|x)

and the proxy

q(z|x)

(Neal & Hinton,

1998). By placing constraints on families of valid posteri-

ors

q(z|x)

, we can retain the guarantees of monotonically

increasing the lower bound, while achieving signiﬁcant spar-

siﬁcation of the assignment distribution γ

Maximal Assignment In maximal assignment, we set

= 1 if j = arg max

ˆp

(x; θ

) (9)

and zero otherwise. This results in each point being assigned

to a single part, which maximizes numerical efﬁciency and

avoids the need for the distance function

min

(·)

to reason

about point-wise weights. However, with this approach it is

more difﬁcult to impose inter-part smoothness constraints,

which may result in a more jagged ﬁnal surface.

δ-responsibility Assignment

To overcome this risk of a

non-smooth output surface, we propose to select the parts

with the largest assignments such that a fraction

of the

responsibility is accounted for. We denote this set as

∆

The latent distribution is then

ˆp

(x; θ

)

k∈∆

ˆp

(x; θ

)

if j ∈ ∆

(10)

and zero otherwise. A beneﬁt of this approach is that as the

parts are no-longer disjoint, it becomes possible to enforce

smoothness constraints during the training stage (Williams

et al., 2018). However, this also necessitates the reasoning

about point-wise weights, which will be discussed later.

M-Step

Given the responsibilities

, the maximization

step becomes

Θ = arg max

,...,θ

i=1

j=1

log ˆp

(x; θ

) (11)

Surface Reconstruction from Point Clouds through Mixtures of Learned Primitives

Figure 1.

Proposed network architecture. From left to right, input point cloud and point-wise weights are transformed and passed through

a PointNet (Qi et al., 2016) encoder to achieve a feature representation of the patch points. The surface is then reconstructed by passing

samples from a 2D grid along with the patch feature through a decoder network.

Each of the

can be optimized independently, giving the

following part-wise updates

= arg max

i=1



kx − s

min

+ σ

log |S

(θ



(12)

The ﬁrst term of this objective seeks to ﬁnd the minimum-

error interpolating surface and the second term acts as a

surface area regularizer. The noise variance,

, controls

the regularization, which is as expected.

3.3. Learning Surface Patches

In this work, we follow Groueix et al. (2018) and use an auto-

encoder to model the function

(u, v; θ

)

. Speciﬁcally, we

train an encoder modeled after the PointNet (Qi et al., 2016)

with a latent space of size 1024. However, in order to

assess the increased capacity of the proposed model, we

train smaller PointNet encoders with reduced latent space

codes. The encoder

(e)

)

learns a local surface code

describing an individual surface part deﬁned by X

Each of the

latent codes are then passed through identi-

cal decoder networks, with tied weights, which output the

learned surface representation

(d)

. The decoder consists

of a stack of fully-connected layers, mapping from the la-

tent code of size

to the output size, which corresponds

to a point in

. An illustration of the proposed network is

shown in Figure 1.

We train this network using the squared Chamfer distance

loss function, which is given by

(x, S) = min

s∈S

kx − sk

(13)

The Chamfer distance computes an approximation of the

normal distance to the surface as shown in Figure 2.

In order to prevent the network overﬁtting to the training

data, we regularize the surface area of the reconstructed

min

(u,v)

f(u,v)

Figure 2.

Visualization of the parametric surface representation

and the relationship between the Chamfer distance and the normal

surface distance.

patch. Given the parametric surface representation

(u, v)

the surface area of a patch is given by

| =

(0,1)



∂f

∂u



∂f

∂v



−

∂f

∂u

∂f

∂v

dudv

≈

i=1

u,j

v,j

−



u,j

v,j



(14)

where we made use of the Monte Carlo estimate of the

surface integral evaluated using the

samples drawn from

the 2D latent space when reconstructing the surface S

Given this choice of regularized training loss, we see that

Surface Reconstruction from Point Clouds through Mixtures of Learned Primitives

Algorithm 1 Patch Initialization, K-init

Input: Point cloud X , Target # patches J,

Initial # patches N

INIT

Result: Point-part labels Γ

Init: Γ

init

= k-means(N

INIT

, X )

While N

patches

> J

Get patch k with min C

= N

+ ω

min

Merge patch k with neighbor with minimum C

Algorithm 2 Patch EM

Input: Point cloud X , # patchesJ

Result: K-patch surface reconstruction X

recon

, {S

}

Init: Γ

init

= k-init(K, X)

While not converged:

M-step:

PatchNet(X, Γ(k)) Equation (12)

E-step:

Γ(k) = p(z

|X , Θ) Equation (8)

the network is trained to optimize the same function as

in Equation (12) from the M-step of the EM algorithm,

where the regularization parameter can be interpreted as the

variance of the Gaussian sampling noise.

3.4. Initialization Strategies

In order to initialize the EM algorithm and to provide the

network with suitable training data, we utilize a heuristic

based partitioning algorithm. The algorithm The algorithm

pseudocode is presented in Algorithm 1. The algorithm

proceeds by learning a k-means partitioning of the data

with some number of patches

INIT

, which is greater than

the target number of patches

. The k-means patches are

progressively merged until the target number of patches

has been reached. The merging strategy selects patches

according to a weighting function deﬁned by the weighted

sum the patch size

and

min

, the minimum eigenvalue

of the patch sample covariance. The term

is a mixing

term. An example partitioning for a chair model taken from

the ShapeNet dataset is shown in ﬁg. 3.

3.5. Algorithm

The full EM algorithm used to ﬁt a patch model requires

relatively few steps. Given and point cloud and a target

number of patches, initialization is performed using the k-

init algorithm. The initialization is the passed through the

learned surface parameterization network, which outputs a

new surface parameterization (M-step). The responsibility

weights are then recomputed (E-step) and the process is

repeated until the responsibility converges. The pseudocode

for the algorithm is given in Algorithm 2.

Figure 3.

Heuristic patch initialization using the K-init algorithm

with K = 25.

4. Experiments

4.1. Dataset

Evaluation was performed on the ShapeNet dataset (Chang

et al., 2015), which provides a

40k

instances over

dif-

fered objects classes. The ShapeNet dataset was used in the

modiﬁed form provided by (Groueix et al., 2018), which

extracts the subset of each shape that is visible by raycasting

from an external viewpoint. This removes points lying on

the interior of the object which could give the point cloud

volumetric properties. The ShapeNet training data used con-

siders the classes plane, bench, cabinet, car, chair, monitor,

lamp, speak, ﬁrearm, couch, table cellphone and watercraft.

4.2. Results

The training loss, presented in the Figure 4, shows near

identical performance across training and validation data for

both networks. This suggests that at the scales of the patches

used, the objects show signiﬁcant regularity. This is in keep-

ing with the central hypothesis of this work - that at a local

level, surfaces exhibit substantial similarities, despite stark

global differences. Figure 4 also demonstrates the expected

result that increasing the number of trainable parameters in

the network results in improved performance. However, we

note that the performance for the smaller network remains

competitive even at a substantially reduced size.

A qualitative evaluation for the small network is presented

in Figure 5 and Figure 6. These ﬁgures demonstrate that

Surface Reconstruction from Point Clouds through Mixtures of Learned Primitives

Figure 4.

Training loss for the small and large variations of the

auto-encoder architecture. Note the similarity between the test and

validation error.

the training loss achieved in Figure 4 corresponds to strong

reconstruction performance, which the majority of the data

structure being preserved. However, we note that the re-

constructed surfaces exhibit some deviation from the input

point clouds, which can be attributed to two short-comings

of this version of the proposed algorithm. Firstly, in the

hard-partitioning approach, patches are independent, which

does not impose any constraints on smoothness across patch

transitions; using the

-responsibility E-step may alleviate

this. Secondly, the noise variance

was set to zero for

this trials, which results in the networks being unregular-

ized. This may result in overﬁtting to some of the training

data. However, since the training and validation losses in

Figure 4 are very similar, it does not seem likely that this is

a signiﬁcant issue.

5. Conclusion

Surface reconstruction can be achieved in an efﬁcient and

generalizable manner using a collection of small, learned

patch primitives. By encoding these patches through a neu-

ral network, a ﬂexible representation is achieved that is

capable of generalizing across classes ranging from planes

to cars. The ﬂexibility of the approach can be further in-

creased through the use of the EM algorithm to learn the

latent correspondences between the patches and the input

data. The validity of the approach was demonstrated quan-

titatively and qualitatively on a modiﬁed version of the

ShapeNet dataset. Further work is required to evaluate the

performance of the complete algorithm and to determine the

optimal architecture and hyperparameter selection.

References

Bishop, C. M. Pattern recognition and machine learning.

springer, 2006.

Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters,

N., Desjardins, G., and Lerchner, A. Understanding dis-

entangling in

-vae. arXiv preprint arXiv:1804.03599,

2018.

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P.,

Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S.,

Su, H., et al. Shapenet: An information-rich 3d model

repository. arXiv preprint arXiv:1512.03012, 2015.

Dempster, A. P., Laird, N. M., and Rubin, D. B. Maxi-

mum likelihood from incomplete data via the em algo-

rithm. Journal of the Royal Statistical Society: Series B

(Methodological), 39(1):1–22, 1977.

Dubrovina, A., Xia, F., Achlioptas, P., Shalah, M., and

Guibas, L. Composite shape modeling via latent space

factorization. arXiv preprint arXiv:1901.02968, 2019.

Esmaeili, B., Wu, H., Jain, S., Bozkurt, A., Siddharth, N.,

Paige, B., Brooks, D. H., Dy, J., and van de Meent, J.-W.

Structured disentangled representations. stat, 1050:12,

2018.

Groueix, T., Fisher, M., Kim, V. G., Russell, B. C., and

Aubry, M. A papier-m

ach

e approach to learning 3d sur-

face generation. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pp. 216–224,

2018.

Huang, H., Kalogerakis, E., and Marlin, B. Analysis and

synthesis of 3d shape families via deep-learned genera-

tive models of surfaces. In Computer Graphics Forum,

volume 34, pp. 25–38. Wiley Online Library, 2015.

Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatial

transformer networks. In Advances in neural information

processing systems, pp. 2017–2025, 2015.

Kingma, D. P. and Welling, M. Auto-encoding variational

bayes. arXiv preprint arXiv:1312.6114, 2013.

Li, J., Xu, K., Chaudhuri, S., Yumer, E., Zhang, H., and

Guibas, L. Grass: Generative recursive autoencoders for

shape structures. ACM Transactions on Graphics (TOG),

36(4):52, 2017.

Li, M., Patil, A. G., Xu, K., Chaudhuri, S., Khan, O.,

Shamir, A., Tu, C., Chen, B., Cohen-Or, D., and Zhang,

H. Grains: Generative recursive autoencoders for indoor

scenes. ACM Transactions on Graphics (TOG), 38(2):12,

2019.

Surface Reconstruction from Point Clouds through Mixtures of Learned Primitives

Figure 5. Table class training data input (left) and reconstruction (right). Colored points correspond to different patches.

Figure 6. Table class validation data input (left) and reconstruction (right). Colored points correspond to different patches.

Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin,

S., and Geiger, A. Occupancy networks: Learning

3d reconstruction in function space. arXiv preprint

arXiv:1812.03828, 2018.

Nash, C. and Williams, C. K. The shape variational autoen-

coder: A deep generative model of part-segmented 3d

objects. In Computer Graphics Forum, volume 36, pp.

1–12. Wiley Online Library, 2017.

Neal, R. M. and Hinton, G. E. A view of the em algorithm

that justiﬁes incremental, sparse, and other variants. In

Learning in graphical models, pp. 355–368. Springer,

1998.

Park, J. J., Florence, P., Straub, J., Newcombe, R., and

Lovegrove, S. Deepsdf: Learning continuous signed dis-

tance functions for shape representation. arXiv preprint

arXiv:1901.05103, 2019.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep

learning on point sets for 3d classiﬁcation and segmenta-

tion. 2016.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. Pointnet++: Deep

hierarchical feature learning on point sets in a metric

space. In Advances in Neural Information Processing

Systems, pp. 5099–5108, 2017.

Williams, F., Schneider, T., Silva, C., Zorin, D., Bruna,

J., and Panozzo, D. Deep geometric prior for surface

reconstruction. arXiv preprint arXiv:1811.10943, 2018.

Xu, Y., Fan, T., Xu, M., Zeng, L., and Qiao, Y. Spidercnn:

Deep learning on point sets with parameterized convolu-

tional ﬁlters. In Proceedings of the European Conference

on Computer Vision (ECCV), pp. 87–102, 2018.

A. Alternative Grouping Strategies

Initializing our EM point cloud groupings randomly or using

k-means causes the point cloud to be split along arbitrary

lines. It would be ideal if this initial grouping made an

attempt to adhere to the natural groupings humans perceive.

While human perceptual groupings are a complex topic and

a complete treatment is out of the scope of this project, one

simple way we group points is by looking for smooth sur-

faces. The goal of this section is to produce an initialization

Surface Reconstruction from Point Clouds through Mixtures of Learned Primitives

method that will group some parts of a point cloud by esti-

mating how well they are represented by smooth surfaces,

resulting in a more natural looking split.

Perceptual grouping

This method operates by building

up groupings from a low level. It can build up arbitrarily

large groupings given that it can do the following:

•

Smooth surfaces can be detected in small patches of

the point cloud.

•

Smooth surfaces can be detected in pairs of small

patches of the point cloud.

The method for detecting smooth surfaces will be covered

in the next section. Assuming that such a method exists, we

can break the point cloud into small arbitrary pieces and

look for smooth surfaces in those pieces. We then attempt

to combine each of these pieces with its neighbors, making

pairwise smooth surfaces. Once we have done this, we can

build larger pieces by looking for pairwise smooth surfaces

that share the small smooth surfaces that we detected. The

collection of all pairwise smooth surfaces that are connected

by small smooth surface subsets is a perceptual grouping,

and could grow to represent very complex surfaces given

that the smaller pieces of those complex surfaces ﬁt our

smooth surface model.

To formalize this concept, given

: s

∈ surfaces} (15)

and

: p

∪ {s

, s

}, i 6= j, p

∈ surfaces}, (16)

a perceptual grouping is deﬁned as

∪ {p

: ∃{p

, s

} | k 6= m, s

⊂ p

, s

⊂ p

}. (17)

Smooth surface model

There are many ways to represent

a 3D surface. Here are the criteria we had for this model:

•

This surface model should take inputs in 2 dimensions

and give values in the third. This prevents the surface

from representing geometry that is more complex than

we intend. For example, we do not want to be able

to represent a sphere, because we do not expect small

patches of a larger object to resemble spheres.

•

Fitting the surface to a point cloud should be efﬁcient.

For that reason, we restrict ourselves to surfaces that

can be produced by sequences of linear solutions.

To meet these criteria, we ﬁt a surface to a point cloud using

the following method:

We ﬁt a plane to the point cloud using the standard

Ax + By + Cz + D = 0 equation.

We use the solved plane parameters to create a new

reference frame aligned with the plane. To recover the

other two axes (the plane normal is the third), we ﬁnd

the most direct rotation that brings the Z-axis to meet

our normal, and rotate the X and Y axes by the same

rotation. The origin of our reference frame is the mean

of the point cloud, projected onto the plane.

We transform the point cloud into this reference frame.

To standardize the size of the point cloud, we divide by

the largest of the standard deviations of the three axes.

We don’t want to apply different scales for each axis,

as this would change the curvature of the point cloud.

We solve the system

+ By

+ Cxy + Dx + Ey +

F = z

. In order to limit ourselves to surfaces without

heavy curvature, we throw out any point clouds where

A, B, or C have an absolute value over 0.1.

All linear solves are performed either using standard linear

least squares, or SVD when the target values are all zero.

Determining if a point cloud is a smooth surface

start by determining what small point cloud patches to at-

tempt to ﬁt smooth surfaces to. We do the following:

We split up the point cloud using k-means M times.

This lets us look at M different splits of the point cloud.

We fail a patch immediately if it does not have enough

points. This limit is set based on expected density. In

our case, for a point cloud with 2000 points, this limit

was set to 15.

We use RANSAC to ﬁt our smooth surface to the patch.

If 75% of the points in the patch ﬁt the model, we pass

it. We consider a point as ﬁtting the model if the Z

value predicted by the model is within a threshold of

the Z value of the point in the surface’s reference frame.

The threshold is computed as a function of the 3D size

of the patch, and in general needs to require that the

surface is much less thick than it is long or wide.

We compute the eigenvalues of the covariance matrix

of the patch, and reject any patches where more than

one of the values is similar in size to the inlier threshold.

This allows us to avoid ﬁtting surfaces to lines.

Once we have a list of patches that were successfully repre-

sented by smooth surfaces, we do the following:

Surface Reconstruction from Point Clouds through Mixtures of Learned Primitives

We make a neighbor list for each patch that includes

the N closest other patches. In our implementation, we

used 30 for N.

For every pair of patches that are neighbors, we com-

bine their points and use RANSAC to ﬁt a surface to

them. The pair is considered a surface if 95% of the

points are inliers.

We then ﬁnd the unions of pairs with patch subsets in

common to produce our perceptual groupings.

Figure 7. Input box point cloud.

Figure 8. Box point cloud and perceptual split.

Figure 9. Input hourglass point cloud.

Sampling of results

Figure 10. Hourglass point cloud and perceptual split.

Future work

The perceptual grouping method rarely

crosses perceptual lines, but it does happen. This is possible

when a smooth surface can be found in two patches from

two different surfaces that have a gap in between. This could

be solved by adding a requirement for surfaces to have their

points evenly distributed.

There are portions of each point cloud which are ignored

by this method, as random sampling is involved in ﬁnding

these surfaces. Also, we make the assumption that each

patch only contains a single surface. A more principled

method of splitting the point cloud into patches or a method

for extracting multiple distinct surfaces from a patch could

help to alleviate this.

The PartNet data set used to test this method has many

instances of double walling, where two versions of the same

surface exist side by side. This is difﬁcult for our method

to deal with, due to the assumption that each patch only has

one surface. Removing this requirement should help in this

case.

Long, thin objects, such as the stem of a desk lamp, are not

well represented by these surfaces. Adding additional types

of objects such as cylinders may help.