Optimizing Irregular Adaptive Applications on Multi-threaded Processors: The Case of Medium-Grain Parallel Delaunay Mesh Generation

Size: px

Start display at page:

Download "Optimizing Irregular Adaptive Applications on Multi-threaded Processors: The Case of Medium-Grain Parallel Delaunay Mesh Generation"

Clifton Greene
5 years ago
Views:

1 Optimizing Irregular Adaptive Applications on Multi-threaded rocessors: The Case of Medium-Grain arallel Delaunay Mesh Generation Filip Blagojević The College of William & Mary CSci 710 Master s roject filip@cs.wm.edu July 12, 2006 Abstract The Importance of parallel mesh generation and emerging growth of SMT architectures raise an important question of adapting parallel mesh generation software to the SMT architecture. In this work we focus on arallel Constrained Delaunay Mesh Generation. We explore medium grain parallelism at the sub-domain level. This parallel approach targets commercially available SMT processors. Our goal is to improve the performance of the existing, MI-based, parallel mesh generation software (CDT) by exploiting multi-threading inside a single SMT chip. This report presents a parallel mesh generation software based on medium grain parallelism, which we developed on top of the existing CDT program. By extending CDT instead of creating completely new software, we reduced the development complexity. We achieved 100% code reuse. Experimental evaluation shows that using different contexts of an SMT processor can improve the performance of a parallel mesh generation software. However, the medium grain approach suffers from significant synchronization overhead caused by different threads working on the same sub-domain. There are 2 general contributions of this work. First, we extended the coarse-grain parallel mesh generation software (CDT), combining the coarse-grain approach with a medium-grain approach. The second contribution is that we significantly improved the performance of the CDT software using the optimizations developed for the MCDT code and the SMT architecture. These changes made CDT faster than Triangle, the best publicly available 2D Delaunay mesh generation software, when executed on a single physical SMT processor. 1 Introduction Simultaneous Multi-threaded(SMT) Technology makes a single physical processor appear as multiple logical processors and achieves higher throughput via simultaneous execution from multiple instruction streams and overlapping of memory latencies. It allows multiple threads to issue instructions each cycle. The die area of a hyper-threaded chip is about 5% larger than the size of an ordinary processor. However, this technology can provide benefits much greater than 5% [16]. When executing on an SMT processor, multiple threads are sharing hardware resources. 1

2 Typical shared hardware resources are the execution units, the branch prediction unit, the bus, and all levels of cache. The successful development of SMT processors [13, 17] introduces new challenges for application developers. Exploiting on-chip parallelism, in general, can be significantly different than exploiting inter-processor parallelism. For example, for an SMT architecture, a problem can occur when one thread experiences a very long latency operation, such as a load miss. That thread will stall for a certain number of cycles, holding the shared resources that could be used by the other thread [23]. Mesh generation algorithms are essential in many scientific computing applications, in health care, engineering, and science. Mesh generation applications expose multiple level of parallelism, at different granularities. This could be a big advantage when they are executed on the SMT architecture. In this project we explore the multilevel parallelization of mesh generation software to SMT architectures. The focus of this project is on arallel Constrained Delaunay mesh triangulation [7]. The Delaunay meshes are composed from triangles and they satisfy the empty circle criteria: for any triangle a that is part of a Delaunay mesh, there is no vertex that belongs to the same mesh, such that is inside the circumcircle of a [19]. In the constrained Delaunay triangulation, the final mesh has to contain the external boundaries of a domain and has to be as close to a Delaunay triangulation as possible [8]. This paper presents steps in developing a parallel mesh generation software that is based on a medium grain approach, and it is an extension of the coarse grain and fine grain parallel software, previously developed by Chernikov [6]. Building a stable and reliable mesh generation code is a difficult and labor intensive task. Therefore, we want to achieve code reuse, by building our program on top on the existing coarse grain code, CDT, developed by Chernikov [6]. CDT is a scalable, MI based code whose performance on a single SMT chip is significantly worse than the performance of Triangle, the state-of-the-art sequential meshing software [21]. By introducing medium grain parallelism in CDT, we intend to improve the single chip performance of this program. The final version of the extended CDT code is a scalable multi-threaded code whose performance improves as the number of thread increases, especially inside a single SMT chip. We achieved 100% code reuse. As we will see later in the paper, the medium grain approach suffered significant overhead caused by synchronization between the threads in a single MI process. However, during the design process, a large set of optimizations developed for SMT architectures, was also suitable for CDT. These optimizations reduced the execution time of CDT by half and made it 6% faster than Triangle when executed on a single SMT processor with two hardware contexts. 2 Related work In this section we will mention some papers that are closely related to this research. A more detailed description of different parallel meshing techniques can be found in [9]. In [18, 10], the authors developed a distributed memory parallel Delaunay refinement algorithm with the guaranteed quality of each triangle that belongs to the mesh. In [18, 10], the 2

3 initial domain is divided into many sub-domains. Adjacent sub-domains share faces of the mesh, and the set of all shared faces between two sub-domains is called an interface. Elements of each interface are replicated in each sub-domain that contains them. The most important feature of the algorithm is that sub-domain interfaces can be changed after insertion of the vertices that affect the faces in the interfaces. Changing the interfaces allows possible conflicts among different cavities, which are resolved by allowing rollbacks. The algorithm used in our project is basically the same as the algorithm described in [18, 10]. The main difference is that in our case an interface cannot contain faces, only edges that belong to the mesh. Also, [18, 10] targets distributed memory systems, while our approach combines distributed and shared memory systems. In [5], the authors developed the theoretical framework for parallel Delaunay refinement, where multiple vertices are concurrently inserted into the mesh. Based on the theoretical framework, the authors developed an algorithm that avoids the difficult domain decomposition problem. Instead, they used buffer zones as a guarantee that there will be no conflicts among concurrently developed cavities. However, as the authors presented in the evaluation section, this approach introduces noticeable communication costs. Nevertheless, we used the theoretical framework developed in [5] in order to exploit parallelism inside a single sub-domain. In order to reduce communication and synchronization costs that can occur when a domain is divided into multiple sub-domains, Linardakis and Chrisochoides present a arallel Delaunay Domain Decoupling method [15]. Here, a domain is divided into sub-domains that are meshed independently, and there is no communication among processes working on different sub-domains. The algorithm contains two steps: 1. The domain is divided into sub-domains using the medial axis domain decomposition, 2. New points are inserted into the newly created separators. The medial axis domain decomposition creates sub-domains of approximately the same size. The newly created separators have good quality in terms of shape and size. New points are inserted in the separators, after which each sub-domain can be meshed with the existing sequential mesh generation code without inserting any new points in separators. In this project, a similar algorithm is used, but instead of inserting points in the separators at the beginning, we allow communication among processes, and therefore the new points are inserted in the separators at the runtime. Further in this section we will describe the work that directly motivated the development of the medium grain meshing software, specifically optimized for the SMT architecture. 2.1 CDT CDT is an MI-based code. It uses a coarse grain approach for meshing given domains. Before a meshing procedure of CDT starts, a domain is decomposed into multiple sub-domains (Figure 1). This decomposition is done by Metis [14]. The number of created sub-domains is 3

4 usually much greater than the number of MI processes [3]. The actual input for CDT is the decomposition created by Metis. Figure 1: Decomposition of the pipe domain into 32 sub-domains. In order to better understand the algorithm that is implemented in CDT, we will define the term conformal triangulation [11], Definition 1. Let V be a set of points in the domain Ω R 2, and let T be a set of triangles whose vertices are in V. We call T = (V,T) a conformal triangulation if the following conditions hold: 1. The union of the vertices of all triangles in T is exactly V. 2. The union of all triangles in T is exactly Ω. 3. There are no empty (degenerate) triangles in T. 4. The intersection of any two triangles is either the empty set, a vertex, or an edge. After the initialization of MI processes in CDT, each process is assigned a single subdomain. The mesh generation procedure of each sub-domain starts with the creation of an initial Delaunay mesh that conforms input vertices and segments. After the creation, the initial mesh of each sub-domain is refined until the quality criteria is satisfied. The quality criteria is specified by the user and it contains two conditions: The area of each triangle in a final mesh is smaller than a user specified value, and The minimum angle of each triangle in the final mesh is greater than a user specified value. To refine the initial mesh, CDT uses a Delaunay refinement. The general idea of Delaunay refinement is to insert new vertices into the mesh and to slightly change the part of the mesh around each new vertex inserted, improving the quality of the mesh. Newly inserted vertices 4

5 are actually the circumcenters of the triangles that do not satisfy the quality criteria. New vertices are inserted as long as there are triangles that do not satisfy the quality criteria. For the vertex insertion procedure, CDT uses Bowyer-Watson s (BW) algorithm [4, 25], which is based on deleting the triangles that are no longer Delaunay (after the new vertex was inserted), and inserting new triangles that satisfy the Delaunay property. In order to explain the BW algorithm, we will introduce the following definitions: Definition 2. The cavity C M ( i ) of point i with respect to mesh M, is a set of triangles whose circumcircles include i In other words, if k, l, m M and if we denote the circumcircle of k l m as O( k l m ), then C M ( i ) = { k l m M i O( k l m ). Definition 3. We will say that B M (p i ) is the set of external edges of some cavity C M (p i ). In other words set B M (p i ) represents edges that are not shared between any two triangles in the cavity C M (p i ). In Figure 2, if 6 is a newly inserted vertex, then C M ( 6 ) = { 1 2 5, 2 3 5, B M ( 6 ) = { 1 2, 2 3, 3 4, 4 5, Figure 2: Cavity Definition. Knowing these two definitions, we can easily describe the BW algorithm for refining a Delaunay mesh M. 1. Select a triangle from the set of bad triangles. Bad triangles are those which do not satisfy the quality criteria. 2. Compute the circumcenter i of this triangle. 3. Find C M (i) and B M (i). This is the cavity creation phase. 5

6 4. Delete all triangles contained in C M (i) from M. 5. Add triangles obtained by connecting i with every edge in B M ( i ) to M. This is the cavity re-triangulation phase. A Delaunay refinement treats constrained segments (edges that need to be in the final mesh and cannot be changed) differently from triangle edges [20, 22]. A vertex encroaches upon a segment s if it lies within the open diametral circle of s [20]. When a new point is about to be inserted and it happens to encroach upon a constrained segment s, another point is inserted in the middle of s instead [20], and a cavity of the midpoint of segment s is constructed and triangulated as described in the BW algorithm. 2.2 CDT implementation of the Bowyer-Watson algorithm In this section we will describe how the 5 steps of the BW algorithm are implemented in CDT [6]: 1. In order to have knowledge of all bad triangles, CDT maintains two STL structs (deques) which keep the pointers to all triangles that do not satisfy the quality criteria. Two STL dequeues are maintained because the quality criteria contains two conditions: (a) bigtris list keeps track of all the triangles whose area is larger than the user specified value. (b) badtris list keeps track of all the triangles that contain at least one angle that is smaller than the user specified value Both of these queues are maintained in FIFO order. They are accessed only at two points in the code: When a cavity creation phase starts, a base triangle for a cavity creation is obtained from one of these lists. After the cavity was re-triangulated all the new triangles that do not satisfy the quality criteria are put on one of these two lists. 2. The circumcenter i of the base triangle is calculated using the circumcenter() function, taken from Triangle [21]. The circumcenter is calculated only for the base triangle of each cavity. 3. Finding C M ( i ) and B M ( i ) sets is the most time demanding part of the algorithm. Using a breadth first search, all triangles whose circumcircle contains the circumcenter of the base triangle are added to the C M ( i ) set. All edges of the triangles in the C M ( i ) set that are not shared between any two triangles are added to the B M ( i ) set. In order to determine if a circumcircle of a triangle contains a given point, CDT uses the incircle() test, used in Triangle [21]. This is an adaptive test; its running time depends on the degree of uncertainty of the result, and is usually small. 6

7 4. All triangles from C M ( i ) are set to be deleted. They are never actually deleted and memory for these triangles is never returned to the system. Instead, they are put onto the recycling list, and they are later reused. When new memory for more triangles is required, CDT will first check if the recycling list is empty and only if it is, it will allocate more memory from the system. Otherwise, a triangle from the recycling list is retrieved. This is an optimization that reduces the number of memory allocation calls. 5. After the triangles from C M ( i ) are deleted, new triangles are created. CDT also maintains an STL dequeue tris which keeps track of all triangles in the mesh. As soon as the new triangles are initialized, they are also added to the tris list. Also, one by one, the new triangles are checked, and if they do not satisfy the quality criteria they are added to the badlist or to the biglist. As mentioned at the beginning of this section, before being refined by CDT, a domain is divided in many sub-domains. The BW algorithm is used for meshing each sub-domain. The interfaces (sub-domain boundary edges) are treated as constrained segments, which means that they have to be in the final mesh. When a new vertex encroaches upon an interface, another point is inserted in the middle of that interface. However, some interfaces are shared between two sub-domains. In that case when one MI process splits a shared interface, it also has to notify the process that is working on the adjacent sub-domain that their shared interface is split. This is the only type of communication among the MI processes in CDT. Compared to the size of the mesh, these messages are very rare. In all our experiments with CDT, there was no noticeable overhead introduced by interprocess communication. 2.3 CDT performance The architecture used for all experiments in this paper is an SM node with 4 Intel entium 4 Xeon processors. Each processor is 2-way Hyper-threaded, working at 2GHz. The size of the L1 cache is 8KB, 64B line. L2 cache is 512KB, 64B line. L3 cache is 1MB, 64B line. The total size of available memory is 2 GB. In all our experiments, CDT scales extremely well. When only one context of each chip is used, CDT scales linearly, as we can see in Table1. In these experiments, we used a pipe domain (Figure 1) that was decomposed in 32 sub-domains, and the total mesh of 10 million triangles was created. Table 1: CDT, each physical CU is reserved for only one MI process. CDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process 54s 2 MI processes 27s 4 MI processes 13.9s When two MI processes are executed on different contexts of the same SMT processor, the 7

8 CDT obtains around 30% speedup compared to only one MI process. In Table 2 we can see the execution times of CDT, in the same experiments as above, when all available contexts on our SM node are used. Table 2: CDT, 2 MI process are bound on the same SMT chip. CDT, pipe, 32 sub-domains, 10 7 el Execution Time 2 MI processes 45.14s 4 MI processes 23s 8 MI processes 12s 2.4 Fine grain CDT CDT scales well. However, the performance of CDT when only one MI process is executed is still significantly worse than Triangle. Antonopoulos et al. [2] tried to exploit fine grain parallelism inside CDT. Multiple threads were allowed to expand a single cavity concurrently (Figure 3). This work was targeting SMT architectures, and it was attempting to improve the single processor performance of CDT and bring it closer to the performance of Triangle. 5 THREAD THREAD 2 THREAD Figure 3: Fine grain algorithm. As it is stated in [2], there are two major limitations of the fine grain parallelization of CDT: 1. fine grain implementation of CDT can effectively use up to two or three hardware execution contexts, at best. 2. synchronization overhead among threads and contention for shared data structures is to high. 8

9 Even though in [2] proposed architectural changes could significantly improve the performance of the fine-grain CDT software, their experimental results on Intel Hyper-threaded processors indicated that the overhead related to fine-grain parallelism management and execution overrun potential benefits, resulting often in performance degradation. This motivated exploring a medium-grain, optimistic parallelization strategy which increases the granularity and concurrency of CDT within each sub-domain. 3 Medium-Grain Algorithm The medium-grain optimistic Delaunay algorithm is based on the concurrent insertion of new vertices into the existing mesh, first presented in [18]. In other words, the BW algorithm is performed concurrently on the same mesh, by different threads. However, there are some constraints when two or more vertices are inserted concurrently [9]. The reason is a possible conflict between two or more expanding cavities which can cause a non-conformal mesh. Assume that two vertices 8 and 9 are inserted concurrently into the existing mesh M (Figure 4). As specified in the BW algorithm, two cavities are created, one around each inserted vertex. If belongs to both cavities C( 8 ) and C( 9 ), then concurrent insertion of 8 and 9 results in a non-conformal mesh. In other words, the edges of triangles and will intersect. p 7 p 6 p 8 p 9 p 5 p 1 p p Figure 4: Conflict between cavities. p Even if the cavities of two points do not share triangles, it could happen that the resulting mesh (after retriangulating cavities) does not satisfy the Delaunay condition, i.e. there are new triangles whose circumcircles are not empty of mesh vertices. Assume that two vertices 8 and 10 are inserted concurrently into the existing mesh M (Figure 5). Assume that cavities created around these vertices are C( 8 ) = { 1 2 7, 2 3 7, and C( 1 0) = { 3 5 6, If edge 3 6 is shared by C( 8 ) and C( 1 0), then the new triangle can have point 8 inside its circumcircle, thus violating the Delaunay property. The theoretical framework for concurrent vertex insertion was established by Chrisochoides [10], and the following lemma was provided in [5]: 9

10 p 7 p 6 p 8 p 5 p 1 p 10 p p Figure 5: Cavities that share an edge. p Lemma 1. Let i and j be the vertices that are concurrently inserted into the existing mesh M. If C M ( i ) and C M ( j ) have no common triangles and do not share any triangle edges, then independent insertion of i and j will result in a mesh which is both conformal and Delaunay. According to Lemma 1, in medium-grain algorithm a mechanism for resolving possible conflicts among cavities is necessary. Standard way of resolving conflicts is that as soon as a conflict is detected, a thread which detects a conflict will cancel cavity expansion [10, 18]. However, this approach sometimes can result in cancellation of both conflicting cavities [10]. Consider the situation presented in the Figure 6. Assume that the triangles 1 2 7, and belong to the cavity C 1. C 1 is in the expansion phase, and it is expanded by Thread1. Assume that the triangles 1 7 6, and belong to the cavity C 2, which is also in the expansion phase. C 2 is expanded by the Thread2. If the next triangle that Thread1 will check is 1 7 6, and the next triangle that Thread2 will check is 4 7 3, it could happen that both threads will detect a conflict at the same time Thread Thread Figure 6: Threads can detect conflict at the same time. After both cavities have been canceled, it could happen that the same situation will occur again, and again both cavities will be canceled. This is called livelock [10]. In [10], this problem was resolved by developing a non-deterministic algorithm, where one cavity expansion 10

11 is delayed long enough so that the other cavity can complete its expansion and retriangulation. This approach greatly reduced the likelihood that livelock will occur. In this project, the livelock problem is resolved by introducing the canceling flag. The canceling flag is global, and therefore accessible by both threads. If there are no conflicts the canceling flag is set to be 0. Now, in the previous example, as soon as Thread1 detects a conflict it also checks the canceling flag. If the canceling flag is 0, Thread1 sets it to be 1. These two operations (checking and setting the flag) are done atomically. Only after setting the canceling flag, Thread1 can start cavity canceling. However, it can happen that the canceling flag has already been set to 1 when Thread1 checks its value. In that case, Thread2 has already detected the conflict and started canceling its own cavity. Thread1 can continue its own cavity creation. The thread who canceled the cavity will set the canceling flag to be 0 after the cancellation is done. This is a deterministic approach, and it guarantees that no livelock will occur. A Similar approach can be used when there are more than 2 threads working concurrently. The only important thing is that in this case there will be more than one canceling flag, but also the number of canceling flags will be 1 less than the number of threads. A more precise description of setting and reseting the canceling flag is presented in Appendix B. 4 MCDT Multi-threaded CDT (MCDT) extends the MI implementation of CDT to exploit SMTlevel parallelism inside each MI process. In the MCDT software, multiple threads work on the same domain. The threads can independently expand and re-triangulate cavities, but two cavities processed independently by different threads are not allowed to share a triangle or an edge of a triangle. Each thread is executed on a single context of an SMT processor. Executing threads that share the same address space inside of a single SMT chip is likely to give better performance than executing different MI processes inside of a single SMT chip. The reason is faster data communication that can occur between threads, by avoiding MI layers. Our implementation of MCDT includes an extensive set of optimizations, including algorithmic optimizations, modified data structures and synchronization mechanisms to reduce contention, and techniques to reduce conflicts between threads in shared resources of an SMT processor. Our implementation of MCDT is built on top of CDT [6]. The discussion in this section assumes an implementation of MCDT for SMT processors with two thread contexts. Most of the techniques described here are applicable to SM systems without modifications. However, some techniques target specifically the bottlenecks in shared resources of SMT architectures. 4.1 Implementation As with CDT, an input for MCDT is a set of sub-domains previously created by Metis or some other domain decomposer. Again, one sub-domain is assigned to each MI process. Before 11

12 processing an assigned sub-domain, in MCDT, an MI process is split into two threads. Both threads are executing exactly the same code (BW algorithm) which is described in Section Conflicts The most important algorithmic problem of MCT is the potential occurrence of conflicts while threads are expanding cavities. Multiple threads may work on different cavities at the same time, within the same domain. According to Lemma 1, a problem can occur if two cavities, processed by different threads, have a common triangle, or there is an edge of a triangle shared between two cavities. In this situation one cavity has to be canceled, while the other cavity can continue its expansion. The algorithm needs to have a mechanism to detect conflicts and a technique to minimize their occurrences. In order to detect conflicts, we tag each triangle with a taken flag. When a triangle becomes a part of a cavity, the taken flag is set. During cavity expansion, if a thread touches a triangle whose taken flag has already been set, the thread knows that a conflict has occurred and that the whole cavity must be canceled. Updates of the taken flag need to be atomic since two or more threads may access the same triangle. Also, if a triangle is rejected from a cavity, i.e. its circumcircle does not contain the new vertex (the center of a cavity), its taken flag will be set. This extra layer of triangles that surround the cavity but are not part of the cavity prevents two cavities from sharing an edge of a triangle (Figure 7). In [18] this is referred as the closure of the cavity taken=1 12 taken=1 8 taken=1 6 7 taken=1 1 taken=1 2 4 taken=1 5 Cavity 3 Triangles that protect edges of the cavity Figure 7: Layer of triangles that surround a cavity. Since it is possible for a triangle to be accessed simultaneously by different threads, every access to the triangle s taken flag has to be protected. In our first implementation, we used synchronization variables provided by the pthread library to protect accesses to a taken flag. Due to the high overhead caused by these synchronization variables, we decided to use special atomic fetch and store() operations provided on Intel architectures. These instructions incur less overhead than conventional locks or semaphores under high contention while providing other advantages such as immunity to preemption [1]. The time execution of MCDT with pthread 12

13 locks and Intel s atomic instructions is presented in Tables 3 and 4. More detailed explanation of setting and resetting the taken flag can be found in Appendix A. Table 3: Execution time of MCDT when ptrhead locks are used, input domain: pipe with 32 sub-domains. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 155s 2 MI processes, each process had 2 threads 80.4s 4 MI processes, each process had 2 threads 42.23s Table 4: Execution time of MCDT when atomic operations are used, input domain: pipe with 32 sub-domains. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 103.6s 2 MI processes, each process had 2 threads 52.24s 4 MI processes, each process had 2 threads 25.9s In order to reduce the number of conflicts, we split each sub-domain in two areas. Each thread is allowed to process only the triangles that belong to a certain area (Figure 8). The separator that creates two areas is a line x = midx, where midx is equal to 0.5 (minx + maxx). minx and maxx are the leftmost and the rightmost coordinate of a sub-domain. With this decomposition, conflicts are likely to occur only around the border between the areas. The probability that conflicts will occur is getting smaller as the size of the mesh grows. The reason is that the size of the triangles and cavities will be smaller as the quality of the mesh is improving. Thread 1 working area Thread 2 working area Separator Conflict robability Small Large Figure 8: Separator that splits a sub-domain into different areas. Table 5 and Table 6 represent the number of conflicts among cavities in MCDT before and after each sub-domain was divided in two areas. These experiments are conducted on a Xeon SMT architecture using a single SMT processor with 2 contexts. MCDT was running only one MI process with two threads. The input domain was a pipe (Figure 1), decomposed into 32 sub-domains. In Table 6 we can see that introducing a separator greatly reduces the number 13

14 Table 5: Number of conflicts before splitting the sub-domains. Number of committed Cavities Number of Conflicts Thread1 2,453,034 1,199,184 Thread2 2,462,935 1,142,578 Total 4,915,969 2,341,762 Table 6: Number of conflicts after splitting the sub-domains. Number of committed Cavities Number of Conflicts Thread1 2,487,237 3,005 Thread2 2,427,565 2,603 Total 4,914,802 5,608 of conflicts. Splitting a domain can introduce load imbalance among the threads. Our mechanism for resolving load imbalance is described in Section Lists Bad triangles in CDT are the triangles that do not meet the quality criteria. As mentioned in Section 2.2, CDT maintains two global lists of bad triangles, called badtris and bigtris. When a cavity is re-triangulated, the quality of each new triangle is checked. A new triangle that does not satisfy the quality criteria is put on one of the two lists. After re-triangulation of a cavity, a thread checks if bigtris and badtris are not empty. If at least one of the lists is not empty, a triangle from the top of the list is retrieved and a new cavity creation starts. If both lists are empty, there are no more triangles that do not satisfy the quality criteria and the mesh refinement stops. Therefore, before any cavity creation and refinement, bigtris or badtris needs to be accessed. In MCDT, these lists are accessed by multiple threads and they need to be protected. rotecting the lists can cause significant overhead under high contention. In our experiments, we found that bigtris is accessed much more frequently than badtris. For example, during the creation of the mesh of 10,000,000 triangles for the domain pipe, badtris is accessed 40,000 times in total, while bigtris list is accessed 12,800,000 times. Therefore, we focused on reducing the contention of the bigtris list. One potential solution for reducing contention and lock overhead is to use local lists of big triangles per thread. Big triangles that belong to a specific area in a sub-domain are inserted in the local list of the thread working in that area. Since threads can produce big triangles that belong to other threads areas, local lists of big triangles need to be protected in such an implementation. The technique applied to further reduce locking overhead and contention is to have two local lists of big triangles per thread. One list is strictly private to the owning thread, while the other 14

15 list can be shared with other threads, and therefore needs to be protected. If a thread, after a cavity re-triangulation, creates a new big triangle that belongs to its own area, the new big triangle is inserted in the private local list. If the triangle belongs to an area of some other thread, it is inserted in the shared local list of that other thread (Figure 9). Each thread can retrieve triangles from its private list as long as the private list is not empty. Only after the private list becomes empty will a thread start retrieving triangles from its shared local list. Separator Thread1 working area Triangle created by Thread2 Thread2 working area R I V A T E S H A R E D Thread1 lists S H A R E D R I V A T E Thread2 lists Figure 9: Local shared and private lists for each thread. Our experiments show that in each thread, the private local list of big triangles is accessed much more frequently than the shared local list of big triangles. During the creation of the mesh of 10,000,000 triangles for the domain pipe, the shared list of big triangles is accessed 800,000 times, while the private list of big triangles is accessed 12,000,000 times. Therefore, the locking overhead that comes from the shared big list is not significant. With this implementation, locking overhead and list contention do not become bottlenecks. In Tables 7 and 8, we presented the execution times MCDT after introducing the private lists for each thread. The domain meshed in these experiments was pipe (Figure 1). Each MI process in this experiment was using a single SMT processor, and each thread inside an MI process was using one context of an SMT processor. A more detailed explanation of maintaining the lists in MCDT is provided in Appendix C. 4.4 Load Balancing As stated in Section 4.2, each domain is initially divided in two areas, and each thread is working on a single area. When two threads are executed inside an MI process, a domain is simply split with the line x = midx, where midx is the middle point in a straight line between the leftmost and the rightmost x coordinates of the domain. This type of decomposition can introduce significant load imbalance between threads. With irregular domains, it is possible that one thread has to do much more work than the other 15

16 Table 7: Execution time of MCDT with only 1 local (shared) list per thread. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 64.29s 2 MI processes, each process had 2 threads 32.7s 4 MI processes, each process had 2 threads 16.5s Table 8: Execution time of MCDT with only 2 local lists per thread: one private and one shared. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 61s 2 MI processes, each process had 2 threads 31s 4 MI processes, each process had 2 threads 15.6s thread (Figure 10). This problem is solved by adjusting the value of midx at runtime. Separator Thread1 working area Thread2 working area MinX MidX MaxX Figure 10: Uneven work distribution between threads. In previous work [2], we showed that the size of badtris is proportional to the work done in each thread (Figure 11). In our implementation of MCDT, when the badtris list of one thread becomes 50% larger than the badtris list of another thread, midx is moved towards the area that is processed by the thread with the longer badtris list (Figure 12). This will introduce more overhead since the size of badtris has to be checked, but it will distribute work more evenly between threads. A similar technique can be applied when MCDT uses more than two threads per MI process. In this case, we can virtually draw multiple vertical lines that separate the areas processed by different threads. The size of badtris in one thread would need to be compared only to the lists of threads that work on adjacent areas. In Figure 13 and Figure 14, we can see the difference in the number of committed cavities between the two threads that are processing the same sub-domains. In this experiment, the domain is a pipe, and it is decomposed into 32 sub-domains. A mesh of 10 million triangles was created. In both figures, the x-axis represents what sub-domain is processed, while the y-axis represents the number of committed cavities. As we can see in Figure 13, without our load balancing there are sub-domains where one thread can commit twice as many cavities as 16

17 1e Size of the bigtris list e e+06 2e e+06 3e+06 Number of cavities created by a thread Figure 11: Size of the bigtris list during the program execution. New Separator Separator Thread1 working area Thread2 working area MinX MidX dx MidX MaxX Figure 12: Moving separator is used for fixing the load imbalance. the other thread. On the other hand, with our load balancing (Figure 14) both threads do approximately the same amount of work for each sub-domain. A more detailed explanation of how the load balancing is handled can be found in Appendix D. 4.5 Memory Management CDT uses a custom memory manager. After a cavity is created, the triangles that belong to the cavity are marked as deleted and removed from the global mesh. The cavity is then re-triangulated into a new set of triangles. These new triangles become part of the global mesh. The memory allocated for deleted triangles is never returned to the system. Instead, deleted triangles are inserted in a recycling list. The next time the program requests memory for a new triangle, the code looks for deleted triangles in the recycling list. Memory is allocated from the system only when the recycling list is empty. The simple recycling mechanism of CDT is inefficient in MCDT for two reasons: The recycling list is shared between threads and needs to be protected. Allocating memory from different threads causes contention inside the memory allocator, which needs to be thread-safe. We addressed the first problem by allocating a local recycling list in each thread. Each local list contains only deleted triangles from a single thread. Having private recycling lists raises 17

18 Total Number of Commited Cavities Difference in Number of Commited Cavities Number of subdomains Thread 1 Thread 2 Figure 13: Difference in number of committed cavities without any load balancing. Total Number of Commited Cavities Difference in Number of Commited Cavities Number of subdomains Thread 1 Thread 2 Figure 14: Difference in number of committed cavities with our load balancer. an important question: Can it happen that one private recycling list will be significantly larger than the other one? In other words, is it possible that one thread will delete more triangles than the other? If that is possible, one thread could always have enough memory on its own recycling list, while the other thread would frequently have to request memory from the system. To better understand the solution for this problem, we present the following example: Assume that is a big triangle, and 7 is the circumcenter of (Figure 15a). Assume that triangles 1 2 3, 6 3 5, and are all the triangles whose circumcenter contains 7. According to the Definition 1, 4 triangles: 1 2 3, 1 3 6, and define the cavity C M ( 7 ). Cavity C M ( 7 ) has 6 external edges: 1 2, 2 3, 3 4, 4 5, 5 6, 6 1. At the re-triangulation phase, after the deletion of the existing triangles and creation of new triangles, cavity C M ( 7 ) will contain 6 triangles (Figure 15b). The next remark makes more general conclusion: Remark 1. The number of newly created triangles in a cavity is always greater than or equal to the number of deleted triangles from that cavity. roof. Following the previous example, we can conclude: 1. After a cavity s creation, the number of external edges of the cavity will always be greater than or equal to the number of triangles in the cavity. 2. After the re-triangulation, the number of triangles in the cavity will be equal to the number of external edges of the cavity. 18

19 (a) (b) Figure 15: a) Cavity creation b) Cavity re-triangulation 1 and 2 imply that the number of newly created triangles in a cavity is always greater than or equal to the number of deleted triangles from that cavity. Deleted triangles from a cavity will be inserted in the recycling list, but according to Remark1, during the re-triangulation phase of the same cavity, they will all be retrieved from the recycling list and used for the creation of new triangles. Consequently, for each thread, the size of its recycling list will always be less than or equal to the size of the cavity currently being processed. Theoretically the size of a cavity is not bounded, but in all our experiments the size of any cavity never extended 5-6 triangles. Therefore, the recycling lists will always be approximately the same size. In Table 9, we can see the execution time of MCDT after we made the recycling list to be private for each thread. Again, the domain we used in this experiment was the pipe (Figure 1), and the created mesh was 10 million triangles. Table 9: Execution time for MCDT when recycling list is private for each thread. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 54.9s 2 MI processes, 2 threads 27.9s 4 MI processes, 2 threads 14.2s To address the second problem, contention in the memory allocator, a local, system-allocated memory pool for each thread is used. When the creation of a new triangle needs memory, a thread obtains that memory from its private pool. The memory pools do not have to be protected since they are private to each thread. Therefore, they do not incur contention. The use of local memory pools also reduces the number of times threads need to obtain memory from the system. We empirically determined that memory pools of size 4Kb give the best performance. We repeated the experiment with the pipe domain. The results are presented in Table 10. The memory management implementation details can be found in Appendix E. 19

20 Table 10: Execution time for MCDT when memory pools are used. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 52s 2 MI processes, 2 threads 26.8s 4 MI processes, 2 threads 13.5s 4.6 Removing STL structures and fixing MinAngle() The original version of CDT used STL structures. Although using the STL has several advantages in terms of code readability and code reuse, the STL itself introduces unacceptable overhead when used in multi-threaded code. The reason for this overhead can be traced to the inefficient mechanism used by STL to protect its data structures when the code requires STL structures to be thread-safe. During the cavity expansion phase and depth-search-first algorithm, CDT keeps all the triangles that belong to a single cavity in one STL vector. The reason is that the size of the cavity is not known and STL vectors can be extended dynamically. Also, during cavity retriangulation, newly created triangles are kept in an STL vector. Although there is no upper bound for the number of triangles contained in a single cavity, our experiments show that the typical size of a cavity is around 5 6 triangles. In MCDT, instead of using the two STL vectors, we use statically allocated arrays with a maximum of 20 elements each. When the size of a cavity needs to extend the size of the arrays, we allocate additional memory. This happens rarely, if at all. Replacing the two STL structures with statically allocated memory improves execution time by approximately 40% (Table 11). Table 11: Execution times for MCDT when STL structs used during the cavity creation and re-triangulation, are replaced by statically allocated arrays. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 31.4s 2 MI processes, 2 threads 16s 4 MI processes, 2 threads 8.6s After the cavity expansion phase is done, in CDT, triangles that belong to a cavity, as well as the triangles that are rejected from a cavity, are stored in 2 STL vectors (trisselected and trisrejected. Replacing those two vectors with statically allocated arrays improves the execution time of MCDT by approximately 10%. We have also modified some critical computational kernels in CDT in order to remove expensive floating operations. Besides their cost, such operations can be a bottleneck in SMT processors that share the floating point unit between threads [24]. computequality(), a function that checks the quality of a triangle, uses expensive acos() and sqrt() operations. We modified the code to replace these operations, while preserving correctness. 20

21 Table 12: Execution times for MCDT when STL vectors used for saving he information about a cavity are replaced by statically allocated arrays. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 28.4s 2 MI processes, 2 threads 14.6s 4 MI processes, 2 threads 7.6s Modifications are made on the part of the code that compares the minimum angle of a triangle versus the lower bound for the minimum angle. Instead of comparing the angles, we are able to compare the cosinus of the angles. Lets say that α is the minimum angle of some triangle, and β is the lower bound for the minimum angle specified by the user. We know that α < π 2, since the sum of all angles in a triangle is less than or equal to π. This implies that 0 < cos α < 1 and cos α is a monotonically decreasing function for α [0, π 2 ]. Therefore, we can compare cos α and cos β instead of comparing α and β (as it is done in CDT). Furthermore, we are able to compare cos 2 α and cos 2 β instead of comparing cos α and cos β. This is useful since α is calculated as a result of the cosinus theorem applied to the edges of a triangle, and the cosinus theorem is using the expensive sqrt() operation. Using multiplications and calculating cos 2 α is less expensive than using sqrt() and calculating cos α. This modification improved the execution time of MCDT by approximately 9% (Table 13). Table 13: Execution time of MCDT when MinAngle() function is optimized. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 25.6s 2 MI processes, 2 threads 13.2s 4 MI processes, 2 threads 6.7s 5 Optimizations applied to CDT In the previous section we described a detailed implementation of MCDT. If we compare the execution time of MCDT (Section 4.6 ) and the execution time of CDT (Section 2.3) we can see that MCDT is about 1.8 times faster. However, several design solutions that we used in the MCDT implementation can also be applied to CDT. This section describes the effects that some of the optimizations from the last section had on CDT. 5.1 Using static arrays instead of STL structs As described in section 4.6, after a cavity is created, all triangles that belong to a cavity are kept in the STL vector trisselected. Also, all the triangles that do not belong to the cavity but 21

22 are adjacent to triangles that belong to the cavity are kept in the STL struct trisrejected. Replacing these two vectors with statically allocated arrays improves the execution time of CDT approximately 3-6%. The new execution time of CDT, when the pipe domain is meshed, and 10 million triangles are created, is shown in Table 14. Table 14: Execution time for CDT, when STL vectors used for saving the information about a cavity are replaced by statically allocated arrays. CDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process 52.4s 2 MI processes, 1 physical CU is used 43.9s 4 MI processes, 2 physical CU are used 22s 8 MI processes, 4 physical CU are used 10.9s 5.2 Removing STL structs from the cavity expansion phase As described in section 4.6, in the cavity expansion phase, all triangles that are touched during the breadth search algorithm are stored in one STL vector. Also, in the cavity re-triangulation phase, all new triangles that are inserted in the cavity are stored in the STL vector. As in MCDT, these two vectors are replaced with the statically allocated arrays. Again, if the size of the cavity extends the size of the statically allocated array, we designed a mechanism that extends the existing array. This optimization improves the execution time of CDT approximately 35%. The new execution time of CDT is shown in Table 15. Table 15: Execution times for CDT when STL structs used during the cavity creation and re-triangulation, are replaced by statically allocated arrays. CDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI processes 34s 2 MI processes, 1 physical CU is used 28.4s 4 MI processes, 2 physical CU is used 14.3s 8 MI processes, 4 physical CU is used 7.25s 5.3 Memory Management Instead of using the original memory management in CDT, we built in the memory management designed for MCDT, where instead of obtaining memory from the system every time a new triangle is created, the memory for new triangles is allocated from a big pool of memory. The size of the pool is 4KB. When all the memory from the pool is used, a new pool is allocated from the system. This optimization improves the execution time of CDT approximately 6%. The new execution time of CDT is shown in the Table

23 Table 16: Execution times for CDT when we used MCDT memory management. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI processes 32s 2 MI processes, 1 physical CU is used 26.4s 4 MI processes, 2 physical CU is used 13.5s 8 MI processes, 4 physical CU is used 6.7s 5.4 minangle() function As in MCDT, we changed the function that calculates the smallest angle of a triangle. The goal was to reduce the number of floating point operations that can conflict in functional units of a single SMT processor. All the changes in the minangle() function were already described in the section 4.6. This optimization improves the execution time of CDT for approximately 9%. The new execution time of CDT is represented in the Table 17. Table 17: Execution times for CDT, when minangle() function is optimized. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process 26.88s 2 MI processes, 1 physical CU is used 23.5s 4 MI processes, 2 physical CU is used 11.8s 8 MI processes, 4 physical CU is used 5.9s After all the optimizations, the total execution time of CDT is improved by a factor of 2. This significant improvement brought the performance of CDT very close to Triangle. To create a mesh of approximately 10 million triangles for the pipe domain on the Intel Xeon architecture described in section 2.3, Triangle needs 24.9 seconds. When CDT is run with only one MI process, for the same problem size it needs seconds. However, using both contexts of a single SMT processor, CDT outperforms Triangle by about 6%. 6 MCDT Overhead After intensive optimizations, the CDT software again performed better than MCDT. If we compare the results from Table 17 and Table 13, we can see that the difference in execution times is about 9%. This section describes the problems that caused MCDT to run slower than CDT. MCDT is a multi-threaded code, and it contains variables and objects that are shared among the threads running inside a single MI process. For example, in Section 4.2, it is described that an update of the taken flag has to be atomic. Also, the MI library used by CDT does not support multi-threading. Therefore, all communication among processes in this code has to be protected by locks. 23

Algorithm, Software, and Hardware. Optimizations for Delaunay Mesh Generation. on Simultaneous Multithreaded Architectures

Algorithm, Software, and Hardware Optimizations for Delaunay Mesh Generation on Simultaneous Multithreaded Architectures Christos D. Antonopoulos a Filip Blagojevic d Andrey N. Chernikov c, Nikos P. Chrisochoides