1 Introduction
While images admit a standard representation in the form of a scalar function uniformly discretized on a grid, the curse of dimensionality has prevented the effective usage of analogous representations for learning 3D geometry. Voxel representations have shown some promise at low resolution
[10, 18, 33, 54, 59, 66, 71], while hierarchical representations have attempted to reduce the memory footprint required for training [55, 61, 70], but at the significant cost of complex implementations. Rather than representing the volume occupied by a 3D object, one can resort to modeling its surface via a collection of points [1, 17], polygons [29, 53, 68], or surface patches [24]. Alternatively, one might follow Cezanne’s advice and “treat nature by means of the cylinder, the sphere, the cone, everything brought into proper perspective”, and think to approximate 3D geometry as geons [4] – collections of simple to interpret geometric primitives [65, 74], and their composition [57, 19]. Hence, one might rightfully start wondering “why so many representations of 3D data exist, and why would one be more advantageous than the other?” One observation is that multiple equivalent representations of 3D geometry exist because realworld applications need to perform different operations and queries on this data ( [9, Ch.1]). For example, in computer graphics, points and polygons allow for very efficient rendering on GPUs, while volumes allow artists to sculpt geometry without having to worry about tessellation [48] or assembling geometry by smooth composition [2], while primitives enable highly efficient collision detection [63] and resolution [64]. In computer vision and robotics, analogous tradeoffs exist: surface models are essential for the construction of lowdimensional parametric templates essential for tracking
[6, 8], volumetric representations are key to capturing 3D data whose topology is unknown [45, 44], while partbased models provide a natural decomposition of an object into its semantic components. This creates a representation useful to reason about extent, support, mass, contact, … quantities that are key in order to describe the scene, and eventually design action plans [27, 26].Contributions
In this paper, we propose a novel representation for geometry based on primitive decomposition. The representation is parsimonious, as we approximate geometry via a small number of convex elements, while we seek to allow lowdimensional representation to be automatically inferred from data – without any human supervision. More specifically, inspired by recent works [65, 19, 41] we train our pipeline in a selfsupervised manner: predicting the primitive configuration as well as their parameters by checking whether the reconstructed geometry matches the geometry of the target. We note how we inherit a number of interesting properties from several of the aforementioned representations. As it is partbased it is naturally locally supported, and by training on a shape collection, parts have a semantic association (i.e. the same element is used to represent the back of chairs). Although partbased, each of them is not restricted to belong to the class of boxes [65], ellipsoids [19], or spheremeshes [64], but to the more general class of convexes. As a convex is defined by a collection of halfspace constraints, it can be simultaneously decoded into an explicit (polygonal mesh), as well as implicit (indicator function) representation. Because our encoder decomposes geometry into convexes, it is immediately usable in any application requiring realtime physics simulation, as collision resolution between convexes is efficiently decided by GJK [21] (Figure 1). Finally, parts can interact via structuring [19] to generate smooth blending between parts.
2 Related works
One of the simplest highdimensional representations is voxels, and they are the most commonly used representation for discriminative [40, 51, 58] models, due to their similarity to image based convolutions. Voxels have also been used successfully for generative models [72, 14, 22, 54, 59, 71]. However, the memory requirements of voxels makes them unsuitable for resolutions larger than . One can reduce the memory consumption significantly by using octrees that take advantage of the sparsity of voxels [55, 69, 70, 61]. This can extend the resolution to , for instance, but comes at the cost of more complicated implementation.
Surfaces
In computer graphics, polygonal meshes are the standard representation of 3D objects. Meshes have also been considered for discriminative classification by applying graph convolutions to the mesh [39, 11, 25, 43]. Recently, meshes have also been considered as the output of a network [24, 30, 68]. A key weakness of these models is the fact that they have a tendency to produce selfintersecting meshes. Another natural highdimensional representation that has garnered some traction in vision is the point cloud representation. Point clouds are the natural representation of objects if one is using sensors such as depth cameras or LiDar, and they require far less memory than voxels. Qi et al. [50, 52]
used point clouds as a representation for discriminative deep learning tasks. Hoppe et al.
[28] used point clouds for surface mesh reconstruction (see also [3] for a survey of other techniques). Fan et. al. [17] and Lin et. al. [35] used point clouds for 3D reconstruction using deep learning. However, these approaches require additional nontrivial postprocessing steps to generate the final 3D mesh.Primitives
Far more common is to approximate the input shape by set of volumetric primitives. With this perspective in mind, representing shapes as voxels will be a special case, where the primitives are unit cubes in a lattice. Another fundamental way to describe 3D shapes is via Constructive Solid Geometry [31]. Sherma et. al. [57] presents a model that will output a program (i.e. set of Boolean operations on shape primitives) that generate the input image or shape. In general, this is a fairly difficult task. Some of the classical primitives used in graphics and computer vision are blocks world [56], generalized cylinders [5], geons [4], and even Lego pieces [67]. In [65], a deep CNN is used to interpret a shape as a union of simple rectangular prisms. They also note that their model provides a consistent parsing across shapes (i.e. the head is captured by the same primitive), allowing some interpretability of the output. In [47], they extended cuboids to superquadrics, showing that the extra flexibility will result in much lower error and qualitatively better looking results.
Implicit surfaces
If one generalizes the shape primitives to analytic surfaces (i.e. level sets of analytic functions), then new analytic tools become available for generating shapes. In [41], for instance, they train a model to discriminate inside coordinates from outside coordinates (referred to as an occupancy function in the paper, and as an indicator function in the graphics community). Park et. al. [46] used the signed distance function to the surface of the shape to achieve the same goal. One disadvantage of the implicit description of the shape is that most of the geometry is missing from the final answer. In [19], they take a more geometric approach and restrict to level sets of restricted mixed Gaussians. Partly due to the restrictions of the functions they use, their representation struggles on shapes with angled parts, but they do recover the interpretability of [65] by considering the level set of each Gaussian component individually.
Convex decomposition
In graphics, a common method to representing shapes is to describe them as a collection of convex objects. Several methods for convex decomposition of meshes have been proposed [23, 49]
. In machine learning, however, we only find some initial attempts to approach convex hull computation via neural networks
[32]. Splitting the meshes into exactly convex parts generally produces too many pieces [13]. As such, it is more prudent to seek small number of approximately convex objects that generate the input shape [20, 34, 36, 38, 37]. Recently [63] also extended convex decomposition to the spatiotemporal domain, by considering moving geometry. Our method is most related to [65] and [19], in that we train an occupancy function for the input shapes. However, we choose our space of functions so that their level sets are approximately convex, and use these as building blocks.3 Method – CvxNets
Our object is represented via an indicator , and with we indicate the surface of the object. The indicator function is defined such that defines the outside of the object and the inside. Given an input (e.g.
an image, point cloud, or voxel grid) an encoder estimates the parameters
of our template representation with primitives (indexed by ). We then evaluate the template at random sample points , and our training loss ensures . In the discussion below, without loss of generality, we use 2D illustrative examples where . Our representation is a differentiable convex decomposition, which is used to train an image encoder in an endtoend fashion. We begin by describing a differentiable representation of a single convex object (Section 3.1). Then we introduce an autoencoder architecture to create a lowdimensional family of approximate convexes (Section 3.2). These allow us to represent objects as spatial compositions of convexes (Section 3.3). We then describe the losses used to train our networks (Section 3.4) and mention a few implementation details (Section 3.5).3.1 Differentiable convex indicator – Figure 2
We define a decoder that given a collection of (unordered) halfspace constraints constructs the indicator function of a single convex object; such a function can be evaluated at any point . We define as the signed distance of the point from the th plane with normal and offset . Given a sufficiently large number of halfplanes the signed distance function of any convex object can be approximated by taking the intersection ( operator) of the signed distance functions of the planes. To facilitate gradient learning, instead of maximum, we use the smooth maximum function LogSumExp and define the approximate signed distance function, :
(1) 
Note this is an approximate SDF, as the property is not necessarily satisfied . We then convert the signed distance function to an indicator function :
(2) 
We denote the collection of hyperplane parameters as , and the overall set of parameters for a convex as . We treat
as a hyperparameter, and consider the rest as the learnable parameters of our representation. As illustrated in Figure
2, the parameter controls the smoothness of the generated convex, while controls the sharpness of the transition of the indicator function. Similar to the smooth maximum function, the soft classification boundary created by Sigmoid facilitates gradient learning.In summary, given a collection of hyperplane parameters, this differentiable module generates a function that can be evaluated at any position .
3.2 Convex encoder/decoder – Figure 3
A sufficiently large set of hyperplanes can represent any convex object, but one may ask whether it would be possible to discover some form of correlation between their parameters. Towards this goal, we employ the bottleneck autoencoder architecture illustrated in Figure 3. Given an input, the encoder derives a latent representation from the input. Then, a decoder derives the collection of hyperplane parameters. While in theory permuting the hyperplanes generates the same convex, the decoder correlates a particular hyperplane with a corresponding orientation. This is visible in Figure 4, where we colorcode different 2D hyperplanes and indicate their orientation distribution in a simple 2D autoencoding task. As ellipsoids and oriented cuboids are convexes, we argue that the architecture in Figure 3 allows us to generalize the core geometric primitives proposed in VP [65] and SIF [19]; we verify this claim in Figure 5.
3.3 Multi convex decomposition – Figure 6
Having a learnable pipeline for a single convex object, we can now generalize the expressivity of our model by representing generic nonconvex objects as compositions of convex elements [63]. To achieve this task an encoder outputs a lowdimensional latent representation of all convexes that decodes into a collection of parameter tuples. Each tuple (indexed by ) is comprised of a shape code , and corresponding transformation that transforms the point from world coordinates to local coordinates. is the predicted translation vector (Figure 6).
3.4 Training losses
First and foremost, we want the (ground truth) indicator function of our object to be well approximated:
(3) 
where , and . The application of the operator produces a perfect union of indicator functions. While constructive solid geometry typically applies the operator to compute the union of signed distance functions, note that we apply the operator to indicator functions instead with the same effect; see Appendix A for more details. We couple the approximation loss with a small set of auxiliary losses that enforce the desired properties of our decomposition.
Decomposition loss (auxiliary).
We seek a parsimonious decomposition of an object alike Tulsiani et al. [65]. Hence, overlap between elements should be discouraged:
(4) 
where we use a permissive , and note how the relu activates the loss only when an overlap occurs.
Unique parameterization loss (auxiliary).
While each convex is parameterized with respect to the origin, there is a nullspace of solutions – we can move the origin to another location within the convex, and update offsets and transformation accordingly – see Figure 7(left). To remove such a nullspace, we simply regularize the magnitudes of the offsets for each of the elements:
(5) 
This loss further ensures that “inactive” hyperplane constraints can be readily reactivated during learning. Because they fit tightly around the surface they are therefore sensitive to shape changes.
Guidance loss (auxiliary).
As we will describe in Section 3.5, we use offline sampling to speedup training. However, this can cause severe issues. In particular, when a convex “falls within the cracks” of sampling (i.e. ), it can be effectively removed from the learning process. This can easily happen when the convex enters a degenerate state (i.e. ). Unfortunately these degenerate configurations are encouraged by the loss (5). We can prevent collapses by ensuring that each of them represents a certain amount of information (i.e. samples):
(6) 
where is the subset of samples from the set with the smallest distance value from . In other words, each convex is responsible for representing at least the closest samples.
Localization loss (auxiliary).
When a convex is far from interior points, the loss in (6
) suffers from vanishing gradients due to the sigmoid function. We overcome this problem by adding a loss with respect to
, the translation vector of the th convex:(7) 
Observations.
Note that we supervise the indicator function rather than , as the latter does not represent the signed distance function of a convex (e.g. ). Also note how the loss in (4) is reminiscent of SIF [19, Eq.1], where the overall surface is modeled as a sum of metaball implicit functions [7] – which the authors call “structuring”. The core difference lies in the fact that SIF [19] models the surface of the object as an isolevel of the function post structuring – therefore, in most cases, the isosurface of the individual primitives do not approximate the target surface, resulting in a slight loss of interpretability in the generated representation.
3.5 Implementation details
To increase training speed, we sample a set of points on the groundtruth shape offline, precompute the ground truth quantities, and then randomly subsample from this set during our training loop. For volumetric samples, we use the samples from OccNet [41], while for surface samples we employ the “nearsurface” sampling described in SIF [19]. We draw random samples from the bounding box of and samples from each of to construct the points samples and labels. We use a subsample set (at training time) with points for both sample sources. Although Mescheder et al. [41] claims that using uniform volumetric samples are more effective than surface samples, we find that balancing these two strategies yields the best performance – this can be attributed to the complementary effect of the losses in (3) and (4).
Architecture details.
In all our experiments, we use the same architecture while varying the number of convexes and hyperplanes. For the {Depth}to3D task, we use 50 convexes each with 50 hyperplanes. For the RGBto3D task, we use 50 convexes each with 25 hyperplanes. Similarly to OccNet [41], we use ResNet18 as the encoder for both the Depthto3D and the RGBto3D experiments. A fully connected layer then generates the latent code that is provided as input to the decoder . For the decoder we use a sequential model with four hidden layers with units respectively. The output dimension is where for each of the elements we specify a translation ( DOFs) and each hyperplane is specified by the (unit) normal and the offset from the origin ( DOFs). In all our experiments, we use a batch of size and train with Adam with a learning rate of , and and . As determined by gridsearch on the validation set, we set the weight for our losses .
4 Experiments
We use the ShapeNet [12] dataset in our experiments. We use the same voxelization, image renderings, and train/test split as in Choy et. al. [14], but we further divide the training set into a training set and a validation set to select our hyperparameters as in [41]. Moreover, we use the same multiview depth renderings as [19] for our {Depth}to3D experiments, where we render each example from cameras placed on the vertices of a dodecahedron. At training time we need ground truth inside/outside labels, so we employ the watertight meshes from [41] – this also ensures a fair comparison to this method. For the quantitative evaluation of semantic decomposition, we use labels from PartNet [42] and exploit the geometric overlap with respect to ShapeNet.
Methods.
We quantitatively compare our method to a number of selfsupervised algorithms with different characteristics. First, we consider VP [65] that learns a parsimonious approximation of the input via (the union of) oriented boxes. We also compare to the Structured Implicit Function SIF [19] method that represents solid geometry as an isolevel of a sum of weighted Gaussians; like VP [65], and in contrast to OccNet [41], this methods provides an interpretable encoding of geometry. Finally, from the class of techniques that directly learn noninterpretable representations of implicit functions, we select OccNet [41], P2M [68], and AtlasNet [24]; in contrast to the previous methods, these solutions do not provide any form of shape decomposition. As OccNet [41] only report results on RGBto3D tasks, we extend the original codebase to also solve {Depth}to3D tasks. We follow the same preprocessing, architecture and trained hyperparameters used by SIF [19].
Metrics.
With and we respectively indicate the indicator and surface of the union of our primitives. We then use three quantitative metrics to evaluate the performance of 3D reconstruction: \⃝raisebox{0.6pt}{1} The Volumetric IoU; note that with uniform samples to estimate this metric, our estimation is more accurate than the voxel grid estimation used by [14]. \⃝raisebox{0.6pt}{2} The ChamferL1 distance, a smooth relaxation of the symmetric Hausdorff distance measuring the average between reconstruction accuracy and completeness [16]. \⃝raisebox{0.6pt}{3} Following the arguments presented in [62], we also employ Fscore to quantitatively assess performance. It can be understood as “the percentage of correctly reconstructed surface”.
4.1 Abstraction – Figure 8, 9, 10
As our convex decomposition is learnt on a shape collection, the convex elements produced by our decoder are in natural correspondence – e.g. we expect the same th convex to represent the leg of a chair in the chairs dataset. We analyze this quantitatively on the PartNet dataset. We do so by verifying whether the th component is consistently mapped to the same PartNet part label; see Figure 8 (left) for the distribution of PartNet labels within each component. We can then assign the most commonly associated label to a given convex to segment the PartNet point cloud, achieving a relatively high accuracy; see Figure 8 (right). This reveals how our latent representation captures the semantic structure in the dataset. We also evaluate our shape abstraction capabilities by varying the number of components and evaluate the tradeoff between representation parsimony and reconstruction accuracy; we visualize this via Paretooptimal curves in the plot of Figure 9. We compare with SIF [19], and note that thanks to the generalized shape space of our model, our curve dominates theirs regardless of the number of primitives chosen. We further investigate the use of natural correspondence in a partbased retrieval task. We first encode an input into our representation, allow a user to select a few parts of interest, and then use this (incomplete) shapecode to fetch the elements in the training set with the closest (partial) shapecode; see Figure 10.
4.2 Reconstruction – Table 1 and Figure 12
We quantitatively evaluate the reconstruction performance against a number of stateoftheart methods given inputs as multiple depth map images ({Depth}to3D) and a single color image (RGBto3D); see Table 1. A few qualitative examples are displayed in Figure 12. We find that CvxNet is: \⃝raisebox{0.6pt}{1} consistently better than other part decomposition methods (SIF, VP, and SQ) which share the common goal of learning shape elements; \⃝raisebox{0.6pt}{2} in general comparable to the stateoftheart reconstruction methods; \⃝raisebox{0.6pt}{3} significantly better than the leading technique (OccNet [41]
) when evaluated in terms of Fscore, and tested on multiview depth input. Note that SIF
[19] first trains for the template parameters on ({Depth}to3D) with a reconstruction loss, and then trains the RGBto3D image encoder with a parameter regression loss; conversely, our method trains both encoder and decoder of the RGBto3D task from scratch.Category  IoU  Chamfer  FScore  

OccNet  SIF  Ours  OccNet  SIF  Ours  OccNet  SIF  Ours  
airplane  0.728  0.662  0.739  0.031  0.029  0.025  79.52  71.40  84.68 
bench  0.655  0.533  0.631  0.041  0.058  0.043  71.98  58.35  77.68 
cabinet  0.848  0.783  0.830  0.138  0.039  0.048  71.31  59.26  76.09 
car  0.830  0.772  0.826  0.071  0.022  0.031  69.64  56.58  77.75 
chair  0.696  0.572  0.681  0.124  0.102  0.115  63.14  42.37  65.39 
display  0.763  0.693  0.762  0.087  0.049  0.065  63.76  56.26  71.41 
lamp  0.538  0.417  0.494  0.678  0.216  0.352  51.60  35.01  51.37 
speaker  0.806  0.742  0.784  0.440  0.067  0.112  58.09  47.39  60.24 
rifle  0.666  0.604  0.684  0.033  0.028  0.023  78.52  70.01  83.63 
sofa  0.836  0.760  0.828  0.052  0.039  0.036  69.66  55.22  75.44 
table  0.699  0.572  0.660  0.152  0.112  0.121  68.80  55.66  71.73 
phone  0.885  0.831  0.869  0.022  0.024  0.018  85.60  81.82  89.28 
vessel  0.719  0.643  0.708  0.070  0.041  0.052  66.48  54.15  70.77 
mean  0.744  0.660  0.731  0.149  0.064  0.080  69.08  59.02  73.49 
{Depth}to3D 
Category  IoU  Chamfer  FScore  

P2M  AtlasNet  OccNet  SIF  Ours  P2M  AtlasNet  OccNet  SIF  Ours  AtlasNet  OccNet  SIF  Ours  
airplane  0.420    0.571  0.530  0.598  0.187  0.104  0.147  0.065  0.093  67.24  62.87  52.81  68.16 
bench  0.323    0.485  0.333  0.461  0.201  0.138  0.155  0.131  0.133  54.50  56.91  37.31  54.64 
cabinet  0.664    0.733  0.648  0.709  0.196  0.175  0.167  0.102  0.160  46.43  61.79  31.68  46.09 
car  0.552    0.737  0.657  0.675  0.180  0.141  0.159  0.056  0.103  51.51  56.91  37.66  47.33 
chair  0.396    0.501  0.389  0.491  0.265  0.209  0.228  0.192  0.337  38.89  42.41  26.90  38.49 
display  0.490    0.471  0.491  0.576  0.239  0.198  0.278  0.208  0.223  42.79  38.96  27.22  40.69 
lamp  0.323    0.371  0.260  0.311  0.308  0.305  0.479  0.454  0.795  33.04  38.35  20.59  31.41 
speaker  0.599    0.647  0.577  0.620  0.285  0.245  0.300  0.253  0.462  35.75  42.48  22.42  29.45 
rifle  0.402    0.474  0.463  0.515  0.164  0.115  0.141  0.069  0.106  64.22  56.52  53.20  63.74 
sofa  0.613    0.680  0.606  0.677  0.212  0.177  0.194  0.146  0.164  43.46  48.62  30.94  42.11 
table  0.395    0.506  0.372  0.473  0.218  0.190  0.189  0.264  0.358  44.93  58.49  30.78  48.10 
phone  0.661    0.720  0.658  0.719  0.149  0.128  0.140  0.095  0.083  58.85  66.09  45.61  59.64 
vessel  0.397    0.530  0.502  0.552  0.212  0.151  0.218  0.108  0.173  49.87  42.37  36.04  45.88 
mean  0.480    0.571  0.499  0.567  0.216  0.175  0.215  0.165  0.245  48.57  51.75  34.86  47.36 
RGBto3D 
4.3 Latent space analysis – Figure 11
We investigate the latent representation learnt by our network by computing a tSNE embedding. Notice how \⃝raisebox{0.6pt}{1} nearby (distant) samples within the same class have a similar (dissimilar) geometric structure, and \⃝raisebox{0.6pt}{2} the overlap between cabinets and speakers is meaningful as they both exhibit a cuboid geometry. Our interactive exploration of the tSNE space revealed how our method produces more meaningful embeddings than OccNet [41]; we illustrate this qualitatively in Figure 11.
4.4 Ablation studies
We summarize the results of several ablation studies found in the supplementary material. Our analysis reveals that the method is relatively insensitive to the chosen size of the latent space . We also investigate the effect of varying the number of convexes and number of hyperplanes in terms of reconstruction accuracy and inference/training time. Finally, we perform an ablation study with respect to our losses, and verify that each is beneficial towards effective learning.
5 Conclusions
We propose a differentiable representation of convex primitives that is amenable to learning. The inferred representations are directly usable in graphics/physics pipelines; see Figure 1. Our selfsupervised technique provides more detailed reconstructions than very recently proposed partbased techniques (SIF [19] in Figure 9), and even consistently beats the leading reconstruction technique on multiview input (OccNet [41] in Table 1). In the future we would like to generalize the model to be able to predict a variable number of parts [65], understand symmetries and modeling hierarchies [73], and include the modeling of rotations [65]. Leveraging the invariance of hyperplane ordering, it would be interesting to investigate the effect of permutationinvariant encoders [60], or remove encoders altogether in favor of autodecoder architectures [46].
6 Acknowledgements
We would like to acknowledge Luca Prasso and Timothy Jeruzalski for his help with preparing the rigidbody simulations, Avneesh Sud for reviewing our draft, and Simon Kornblith, Anton Mikhailov, Tom Funkhouser, Erwin Coumans and Bill Freeman for fruitful discussions.
Appendix A Union of Smooth Indicator Functions
We define the smooth indicator function for the th object:
(8) 
where is the th object signed distance function. In constructive solid geometry the union of signed distance function is defined using the operator. Therefore the union operator for our indicator function can be written:
(9)  
(10)  
(11)  
(12) 
Note that the operator is commutative with respect to monotonically increasing functions allowing us to extract the operator from the function in (11).
References
 [1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. 2018.
 [2] Baptiste Angles, Marco Tarini, Loic Barthe, Brian Wyvill, and Andrea Tagliasacchi. Sketchbased implicit blending. ACM Transaction on Graphics (Proc. SIGGRAPH Asia), 2017.
 [3] Matthew Berger, Andrea Tagliasacchi, Lee M Seversky, Pierre Alliez, Gael Guennebaud, Joshua A Levine, Andrei Sharf, and Claudio T Silva. A survey of surface reconstruction from point clouds. In Computer Graphics Forum, volume 36, pages 301–329. Wiley Online Library, 2017.
 [4] Irving Biederman. Recognitionbycomponents: a theory of human image understanding. Psychological review, 1987.
 [5] Thomas Binford. Visual perception by computer. In IEEE Conference of Systems and Control, 1971.
 [6] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3D faces. In ACM Trans. on Graphics (Proc. of SIGGRAPH), 1999.
 [7] James F Blinn. A generalization of algebraic surface drawing. ACM Trans. on Graphics (TOG), 1(3):235–256, 1982.
 [8] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Proc. of the European Conf. on Comp. Vision, 2016.
 [9] Mario Botsch, Leif Kobbelt, Mark Pauly, Pierre Alliez, and Bruno Lévy. Polygon mesh processing. AK Peters/CRC Press, 2010.
 [10] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Generative and discriminative voxel modeling with convolutional neural networks. arXiv preprint arXiv:1608.04236, 2016.
 [11] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 [12] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.

[13]
Bernard M Chazelle.
Convex decompositions of polyhedra.
In
Proceedings of the thirteenth annual ACM symposium on Theory of computing
, pages 70–79. ACM, 1981.  [14] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. In Proc. of the European Conf. on Comp. Vision. Springer, 2016.
 [15] Erwin Coumans and Yunfei Bai. PyBullet, a python module for physics simulation for games, robotics and machine learning. pybullet.org, 2016–2019.
 [16] Siyan Dong, Matthias Niessner, Andrea Tagliasacchi, and Kevin Kai Xu. Multirobot collaborative dense scene reconstruction. ACM Trans. on Graphics (Proc. of SIGGRAPH), 2019.

[17]
Haoqiang Fan, Hao Su, and Leonidas J Guibas.
A point set generation network for 3d object reconstruction from a
single image.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 605–613, 2017.  [18] Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape induction from 2d views of multiple objects. In International Conference on 3D Vision (3DV), 2017.
 [19] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna, William T Freeman, and Thomas Funkhouser. Learning shape templates with structured implicit functions. arXiv preprint arXiv:1904.06447, 2019.
 [20] Mukulika Ghosh, Nancy M Amato, Yanyan Lu, and JyhMing Lien. Fast approximate convex decomposition using relative concavity. ComputerAided Design, 45(2):494–504, 2013.
 [21] Elmer G Gilbert, Daniel W Johnson, and S Sathiya Keerthi. A fast procedure for computing the distance between complex objects in threedimensional space. IEEE Journal on Robotics and Automation, 1988.
 [22] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative vector representation for objects. In Proc. of the European Conf. on Comp. Vision, pages 484–499. Springer, 2016.
 [23] Ronald L. Graham. An efficient algorithm for determining the convex hull of a finite planar set. Info. Pro. Lett., 1:132–133, 1972.
 [24] Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. A papiermache approach to learning 3d surface generation. In Proc. of Comp. Vision and Pattern Recognition (CVPR), 2018.

[25]
Kan Guo, Dongqing Zou, and Xiaowu Chen.
3d mesh labeling via deep convolutional neural networks.
ACM Transactions on Graphics (TOG), 35(1):3, 2015.  [26] Eric Heiden, David Millard, and Gaurav Sukhatme. Real2sim transfer using differentiable physics. Workshop on Closing the Reality Gap in Sim2real Transfer for Robotic Manipulation, 2019.
 [27] Eric Heiden, David Millard, Hejia Zhang, and Gaurav S Sukhatme. Interactive differentiable simulation. arXiv preprint arXiv:1905.10706, 2019.
 [28] Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDonald, and Werner Stuetzle. Surface reconstruction from unorganized points. In ACM SIGGRAPH Computer Graphics, volume 26, pages 71–78. ACM, 1992.
 [29] Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Learning categoryspecific mesh reconstruction from image collections. In Proc. of the European Conf. on Comp. Vision, 2018.
 [30] Chen Kong, ChenHsuan Lin, and Simon Lucey. Using locally corresponding cad models for dense 3d reconstructions from a single image. In Proc. of Comp. Vision and Pattern Recognition (CVPR), pages 4857–4865, 2017.
 [31] David H Laidlaw, W Benjamin Trumbore, and John F Hughes. Constructive solid geometry for polyhedral objects. In ACM Trans. on Graphics (Proc. of SIGGRAPH), 1986.
 [32] Yee Leung, JiangShe Zhang, and ZongBen Xu. Neural networks for convex hull computation. IEEE Transactions on Neural Networks, 8(3):601–611, 1997.
 [33] Yiyi Liao, Simon Donne, and Andreas Geiger. Deep marching cubes: Learning explicit surface representations. In Proc. of Comp. Vision and Pattern Recognition (CVPR), 2018.
 [34] JyhMing Lien and Nancy M Amato. Approximate convex decomposition of polyhedra. In Proceedings of the 2007 ACM symposium on Solid and physical modeling, pages 121–131. ACM, 2007.

[35]
ChenHsuan Lin, Chen Kong, and Simon Lucey.
Learning efficient point cloud generation for dense 3d object
reconstruction.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [36] Guilin Liu, Zhonghua Xi, and JyhMing Lien. Nearly convex segmentation of polyhedra through convex ridge separation. ComputerAided Design, 78:137–146, 2016.
 [37] Khaled Mamou and Faouzi Ghorbel. A simple and efficient approach for 3d mesh approximate convex decomposition. In 2009 16th IEEE international conference on image processing (ICIP), pages 3501–3504. IEEE, 2009.
 [38] Khaled Mamou, E Lengyel, and Ed AK Peters. Volumetric hierarchical approximate convex decomposition. Game Engine Gems 3, pages 141–158, 2016.
 [39] Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, pages 37–45, 2015.
 [40] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE, 2015.
 [41] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. arXiv preprint arXiv:1812.03828, 2018.
 [42] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A largescale benchmark for finegrained and hierarchical partlevel 3d object understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 909–918, 2019.
 [43] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. of Comp. Vision and Pattern Recognition (CVPR), pages 5115–5124, 2017.
 [44] Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of nonrigid scenes in realtime. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
 [45] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Realtime dense surface mapping and tracking. In Proc. ISMAR. IEEE, 2011.
 [46] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. arXiv preprint arXiv:1901.05103, 2019.
 [47] Despoina Paschalidou, Ali Osman Ulusoy, and Andreas Geiger. Superquadrics revisited: Learning 3d shape parsing beyond cuboids. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
 [48] Jason Patnode. Character Modeling with Maya and ZBrush: Professional polygonal modeling techniques. Focal Press, 2012.
 [49] Franco P Preparata and Se June Hong. Convex hulls of finite sets of points in two and three dimensions. Communications of the ACM, 20(2):87–93, 1977.
 [50] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [51] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. Volumetric and multiview cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016.
 [52] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
 [53] Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J Black. Generating 3d faces using convolutional mesh autoencoders. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
 [54] Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In Advances in Neural Information Processing Systems, 2016.
 [55] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3577–3586, 2017.
 [56] Lawrence G Roberts. Machine perception of threedimensional solids. PhD thesis, Massachusetts Institute of Technology, 1963.
 [57] Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji. Csgnet: Neural shape parser for constructive solid geometry. In Proc. of Comp. Vision and Pattern Recognition (CVPR), 2018.
 [58] Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object detection in rgbd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 808–816, 2016.
 [59] David Stutz and Andreas Geiger. Learning 3d shape completion from laser scan data with weak supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1955–1964, 2018.
 [60] Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, and Kwang Moo Yi. Attentive context normalization for robust permutationequivariant learning. arXiv preprint arXiv:1907.02545, 2019.
 [61] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient convolutional architectures for highresolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, pages 2088–2096, 2017.
 [62] Maxim Tatarchenko, Stephan R Richter, René Ranftl, Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do singleview 3d reconstruction networks learn? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3405–3414, 2019.
 [63] Daniel Thul, Sohyeon Jeong, Marc Pollefeys, et al. Approximate convex decomposition and transfer for animated meshes. In SIGGRAPH Asia 2018 Technical Papers, page 226. ACM, 2018.
 [64] Anastasia Tkach, Mark Pauly, and Andrea Tagliasacchi. Spheremeshes for realtime hand modeling and tracking. ACM Transaction on Graphics (Proc. SIGGRAPH Asia), 2016.
 [65] Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. Learning shape abstractions by assembling volumetric primitives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [66] Ali Osman Ulusoy, Andreas Geiger, and Michael J Black. Towards probabilistic volumetric reconstruction using ray potentials. In International Conference on 3D Vision (3DV), 2015.
 [67] Anton van den Hengel, Chris Russell, Anthony Dick, John Bastian, Daniel Pooley, Lachlan Fleming, and Lourdes Agapito. Partbased modelling of compound scenes from images. In Proc. of Comp. Vision and Pattern Recognition (CVPR), pages 878–886, 2015.
 [68] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and YuGang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proc. of the European Conf. on Comp. Vision, 2018.
 [69] PengShuai Wang, Yang Liu, YuXiao Guo, ChunYu Sun, and Xin Tong. Ocnn: Octreebased convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36(4):72, 2017.
 [70] PengShuai Wang, ChunYu Sun, Yang Liu, and Xin Tong. Adaptive ocnn: a patchbased deep representation of 3d shapes. In SIGGRAPH Asia 2018 Technical Papers, page 217. ACM, 2018.
 [71] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Advances in neural information processing systems, pages 82–90, 2016.
 [72] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proc. of Comp. Vision and Pattern Recognition (CVPR), pages 1912–1920, 2015.
 [73] Fenggen Yu, Kun Liu, Yan Zhang, Chenyang Zhu, and Kai Xu. Partnet: A recursive part decomposition network for finegrained and hierarchical shape segmentation. Proc. of Comp. Vision and Pattern Recognition (CVPR), 2019.

[74]
Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem.
3dprnn: Generating shape primitives with recurrent neural networks.
In Proceedings of the IEEE International Conference on Computer Vision, 2017.
Comments
There are no comments yet.