A Moving Least Squares Material Point Method with Displacement Discontinuity and Two-Way Rigid Body Coupling (Supplementary Document)

Size: px

Start display at page:

Download "A Moving Least Squares Material Point Method with Displacement Discontinuity and Two-Way Rigid Body Coupling (Supplementary Document)"

Jordan Warner
5 years ago
Views:

1 A Moving Least Squares Material Point Method with Dislacement Discontinuity and Two-Way Rigid Body Couling (Sulementary Document) Yuanming Hu Yu Fang Ziheng Ge Ziyin Qu Yixin Zhu Andre Pradhana Chenfanfu Jiang 1 Equivalence of APIC/PolyPIC and MLS on velocities 1.1 MLS with a linear olynomial basis In the simle case of a comlete linear olynomial basis (m = l = 1) for meshless solids, MLS is a 1 st -orderconsistent interolation scheme for scattered data and derivatives. With P(x i x) = [1, (x i x) T ] T, the reconstructed function value and its gradient estimation at x are [ ] û û = M (x)q T Ξ(x) u 1. u N, (1) where N is the total number of samle data oints, Ξ(x) is the diagonal weighting matrix with Ξ ii = ξ i (x), and Q(x) = [P(x 1 x),..., P(x N x)] T, M(x) = Q T ΞQ. 1.2 APIC and MLS APIC [Jiang et al., 2015] reresents the local velocity field near each article with v and matrix C. At time n, the momentum transfer m n i v n i = m N i (x n ) ( v n + C n (x i x n ) ) is equivalent to accumulating momentum contributions from article-wise local olynomial (affine) velocity fields to the grid. When we choose the weighting function ξ i (x) to be N i (x) where N i (x) is the B-sline kernel, grid-to-article transfer is exactly evaluating Eq. 1 after grid velocity is integrated from vi n to ˆv n+1 i. To see that, we first write down APIC transfers v n+1 C n+1 = = i ( i N i (x n )ˆv n+1 i, (2) N i (x n )ˆv n+1 i (x i x n ) T ) D, (3) where D = i N i(x n )(x i x n )(x i x n ) T. The equivalence to Eq. 1 relies on realizing that, in the MLS ersective, v n+1 is the reconstructed function value û and C n+1 is the reconstructed function gradient û. 1

2 In this case, the samle data {u 1,..., u N } in Eq. 1 corresonds to the velocities of all grid nodes influenced by article. Corresondingly we have Q T (x n ) = [ 1 (x 1 x n ) T 1 (x 2 x n ) T... ] and M(x n ) = i [ ξ i (x ) 1 (x i x n ) T (x i x n ) (x i x n )(x i x n ) T ]. (4) Substituting them into Eq. 1 and relacing ξ i with N i results in a block-wise result containing Eq. 2 and Eq. 3. It is not immediately clear what the relationshi is between D in APIC and M(x ) in MLS. Note that in other meshless methods like EFG, M can be singular if neighboring data samles are co-lanar, co-linear, or insufficient, in which case the seudo-inverse is required. Note that M(x ) in 3D is a general 4 4 matrix (in linear olynomial sace) if no restriction is ut on ξ i (x). However when i ξ i(x) = 1 and i ξ i(x)(x i x) = 0, we can see that M(x ) in Eq. 4 is block diagonal with its to left entry being just 1. The choice of quadratic/cubic B-slines for N i (x) makes M(x ) even simler. In fact, it is diagonal for APIC. The to left element is simly 1, while the bottom right block D is 1 4 x2 I for quadratic N i (x) and 1 3 x2 I for cubic N i (x), where x is the grid sacing. The simle form for D comes from the roerties of B-slines, and is not generally true for other interolation functions. 1.3 PolyPIC and MLS A related discussion without exlicitly mentioning the notion of MLS is given by Fu et al. [2017] for PolyPIC. Here we briefly oint out the main idea using the MLS view. PolyPIC uses a higher order basis for P(x). While article-to-grid is also accumulating momentum contributions from article-wise local olynomial velocity fields, the grid-to-article transfer for each article is essentially evaluating basis coefficients c(x) for minimizing the functional J x (c). As discussed in the aer, J x (c) reresents a weighted average error encoding the difference between actual data samles ˆv n+1 i and their aroximation through the olynomial basis. In ractice PolyPIC stores coefficients c(x ) on each article. The size of this coefficient vector deends on the dimension of the olynomial sace. When only linear functions are used, PolyPIC reduces to APIC, and the coefficients simly contain v n+1 and all comonents in C n+1. When higher order basis functions are added to P(x) (such as multilinear functions and quadratic functions), PolyPIC rovides a more accurate olynomial fit to the local velocity field. Notice that Fu et al. [2017] additionally use Gram-Schmidt to enforce a mass orthogonal quadratic basis to reach a diagonal moment matrix M(x), which is highly relevant to the classical idea of using an orthogonal olynomial basis in EFG methods [Lu et al., 1994]. 2 MLS-MPM force differential Comuting the MLS-MPM force differential requires treating the force as a function of fictitiously deformed grid node ositions (x i = x n i + tˆv i where ˆv i are the imlicit unknown grid velocities and x n i are regular Cartesian lattice locations). Secifically, f i (x i ) = N i (x n ) D Ψ F (F (x i ))F nt (x n i x n ), (5) where the imlicit deformation gradient udate is related to x i as F (x i ) = (I + tc (x i )) F n, (6) 2

3 and C is given by C (x i ) = ( i N i (x n ) x i x n i t (x n i x n ) T ) M, (7) where M is a constant in the case of quadratic/cubic B-sline N i (x). Based on these relationshis, the differentiation can be carried out directly using the chain rule. Imlicit time steing in MLS-MPM requires comuting the action of the energy Hessian on an arbitrary grid increment δu (just like traditional MPM). f iα = δf iα = j δf iα = δf iα = δf iα = Ψ F nt F γσd N i (x n )(x n iσ x n σ) αγ 2 E x iα x jβ δu jβ V 0 2 Ψ F ωη δu jβ F nt F j αγ F ωη x γσd jβ 2 Ψ j j N i (x n )(x n iσ x n σ) ( D N j (x n )(x n jξ x n F αγ F ξ)f n ) ξηδ ωβ δujβ F nt γσd N i (x n )(x n iσ x n σ) ωη 2 Ψ ( D F αγ F ωη N j (x n )(x n jξ x n ξ)fξη n ) δujω F nt γσd N i (x n )(x n iσ x n σ), (8) therefore, δf i = A F nt D N i (x n )(x n i x n ), (9) where 2 Ψ A = F F : j D N j (x n )δu j (x n j x n ) T F n. (10) 3 High-erformance imlementation details Benchmark settings We uniformly samle 8 articles er cell, with x = , in region [0.1, 0.9] 3. This results in a grid and 8M articles. APIC transfer and a linear elasticity constitutive model is used for simlicity. To get more accurate results and cover the total run time better, we make P2G and G2P cover more oerations: P2G : Calculate force based on deformation gradient Scatter mass, article elasticity force and APIC affine momentum to grid nodes Normalize grid velocity and aly gravity G2P : Gather velocity from grid Gather velocity gradient (if necessary) and the APIC affine velocity field Udate deformation gradient Advect articles 3

4 Imlementation details Our discussion is focused on 3D cases, which are significantly more comutationally exensive than 2D ones. To make discussions concrete, we further assume 32-bit floating-oint recision and an Intel Core CPU with an architecture later than Haswell (released 2013), though most ideas can be used in other settings or architectures. We further focus the imlementation on quadratic B-slines, the smallest kernel that makes all constitutive models stable. 3.1 Algorithmic imrovement As discussed in the main body of our aer, MLS-MPM halves the number of FLOPs needed for each article. The unification of affine velocity field and deformation gradient eliminates the need for N i (x n ), which seeds u both P2G and G2P. During P2G, MLS-MPM fuses the scattering of the affine momentum and article force contribution into N i (x n )Q (x i x n ), where Q = t M Ψ F (Fn )F nt + m C, so that only one matrix-vector multilication is needed for the inner loo (27 iterations for 3D and quadratic B-slines). During deformation gradient udate in G2P, MLS-MPM avoids evaluating v x with N i(x) and can directly re-use C from APIC. 3.2 Performance engineering When the algorithm kees the same, the FLOPs required for each article is at the same level for comarable imlementations. In this case, the erformance of such a comute-intensive rogram is roughly roortional to floating oint unit (FPU) utilization, which means how many floating-oint oerations are actually done er cycle in the back-end of CPUs. Our otimization goal is to (1) reduce the memory bandwidth consumtion and to (2) imrove floating oint unit (FPU) utilization, while (3) keeing the imlementation simle. Our high-erformance MPM solver is designed based on these three rinciles. Note that due to issues such as memory alignment, comilers may not automatically generate vectorized instructions. In our imlementation, we do that manually via CPU intrinsics. As a result, our low-level erformance engineering significantly imroves FPU utilization by reducing memory bandwidth bound and unnecessary run-time comutations such as memory addresses generation. As noted in the main aer, our P2G and G2P imlementations lead to 1.93 and 7.38 FPU utilization comared with [Tamubolon et al., 2017]. Sarse grid storage and blocked transfer SPGrid [Setaluri et al., 2014] is used for background grid storage. SPGrid cleverly uses Morton coding and the translation lookaside buffer (TLB) to maximize locality and minimize access latency. Grid nodes are arranged in the unit of blocks, and our block size is Notably, the blocked nature of SPGrid effectively reduces memory bandwidth usage during transfer. Padded local grids of size nodes are used as arenas for transfers, where articles within the block region will only read from and write to their arenas. Before P2G or G2P, we gather local arena information from the global SPGrid and write back the change only after P2G. This strategy fits grid data into the L1 cache for maximum throughut and minimum latency. We call different otimized kernels for kernels with and without rigid boundary articles. Doing so raises the branching from article level to block level, which maximizes the ower of seculative execution. Practically, 4

5 only a narrow band of rigid boundaries needs CPIC. Tyically, around 90% blocks use regular MLS-MPM transfers. Avoiding data race in P2G Even though P2G will be executed on local grid arenas only, data race will still be an issue when two neighboring local grid arenas are writing back to the same global grid node. Locking is one straightforward solution. We, however, want to avoid locking for erformance and ease of imlementation. To this end, one solution is to divide the blocks into searate grous where no global nodes are shared by two or more blocks in one grou. In this way, rocessing blocks in a single grou in arallel has no data race. One way to achieve such grouing is via the block coordinates, which are defined to be a Cartesian coordinate system in the unit of SPGrid blocks. We grou blocks according to the arity of block coordinates (all axes considered jointly), leading to 4 grous in 2D and 8 grous in 3D. Low-level erformance engineering We exlicitly write CPU intrinsics to force the comiler to generate desired instructions. Fused multily-add (FMA) instructions that comute a b + c are articularly useful in our case. Note that a, b and c are vectors of four 32-bit floating oint numbers, which can be used as 3D vectors. The remaining entry can be mass, which is fully utilized during P2G. Comilers are conservative regarding unrolling, esecially for a relatively large loo (27 elements for 3D quadratic B-sline transfers). In our case, manually unrolling the loos not only saves the cost of maintaining loo variables but also allows the comiler to generate array access offsets at comile time. 4 3D rinted robot Overall, the robot erforms one degree-of-freedom rotation in an alternating triod gait. A stable 4.5V is alied to the motor to ensure constant rotation seed. Secifically, following the descritions in [Li et al., 2013], we use the same robot (RoboXlorer, Smart Lab) as the main body. Six legs are 3D rinted using ABS material and attached to the main body. These legs have the same length of 2R = 4.1 cm, width = 1.0 cm, and thickness = 0.3 cm, but only two different curvatures 1/r = [, 1]/R are used in our exeriments. The belly area of the main body is 13 2cm 2, and the total weight is 85g. The track used for the exeriment is 180 cm long and 40 cm wide, filled with fine-grained sand 8 cm dee. Two cameras are used to record the motion of the robot from the front and the to at 30 FPS. References C. Fu, Q. Guo, T. Gast, C. Jiang, and J. Teran. A olynomial article-in-cell method. ACM Trans Grah, 36(6):222:1 222:12, C. Jiang, C. Schroeder, A. Selle, J. Teran, and A. Stomakhin. The affine article-in-cell method. ACM Trans Grah, 34(4):51:1 51:10, C. Li, T. Zhang, and D. Goldman. A terradynamics of legged locomotion on granular media. Science, 339 (6126): , Y. Lu, T. Belytschko, and L. Gu. A new imlementation of the element-free galerkin method. Comuter methods in alied mechanics and engineering, 113(3-4): , R. Setaluri, M. Aanjaneya, S. Bauer, and E. Sifakis. Sgrid: A sarse aged grid structure alied to adative smoke simulation. ACM Trans Grah, 33(6):205, A. P. Tamubolon, T. Gast, G. Klár, C. Fu, J. Teran, C. Jiang, and K. Museth. Multi-secies simulation of orous sand and water mixtures. ACM Trans Grah, 36(4),

Kinematics. Why inverse? The study of motion without regard to the forces that cause it. Forward kinematics. Inverse kinematics

Kinematics. Why inverse? The study of motion without regard to the forces that cause it. Forward kinematics. Inverse kinematics Kinematics Inverse kinematics The study of motion without regard to the forces that cause it Forward kinematics Given a joint configuration, what is the osition of an end oint on the structure? Inverse