Optimizing Irregular Adaptive Applications on Multi-threaded Processors: The Case of Medium-Grain Parallel Delaunay Mesh Generation

Size: px
Start display at page:

Download "Optimizing Irregular Adaptive Applications on Multi-threaded Processors: The Case of Medium-Grain Parallel Delaunay Mesh Generation"

Transcription

1 Optimizing Irregular Adaptive Applications on Multi-threaded rocessors: The Case of Medium-Grain arallel Delaunay Mesh Generation Filip Blagojević The College of William & Mary CSci 710 Master s roject filip@cs.wm.edu July 12, 2006 Abstract The Importance of parallel mesh generation and emerging growth of SMT architectures raise an important question of adapting parallel mesh generation software to the SMT architecture. In this work we focus on arallel Constrained Delaunay Mesh Generation. We explore medium grain parallelism at the sub-domain level. This parallel approach targets commercially available SMT processors. Our goal is to improve the performance of the existing, MI-based, parallel mesh generation software (CDT) by exploiting multi-threading inside a single SMT chip. This report presents a parallel mesh generation software based on medium grain parallelism, which we developed on top of the existing CDT program. By extending CDT instead of creating completely new software, we reduced the development complexity. We achieved 100% code reuse. Experimental evaluation shows that using different contexts of an SMT processor can improve the performance of a parallel mesh generation software. However, the medium grain approach suffers from significant synchronization overhead caused by different threads working on the same sub-domain. There are 2 general contributions of this work. First, we extended the coarse-grain parallel mesh generation software (CDT), combining the coarse-grain approach with a medium-grain approach. The second contribution is that we significantly improved the performance of the CDT software using the optimizations developed for the MCDT code and the SMT architecture. These changes made CDT faster than Triangle, the best publicly available 2D Delaunay mesh generation software, when executed on a single physical SMT processor. 1 Introduction Simultaneous Multi-threaded(SMT) Technology makes a single physical processor appear as multiple logical processors and achieves higher throughput via simultaneous execution from multiple instruction streams and overlapping of memory latencies. It allows multiple threads to issue instructions each cycle. The die area of a hyper-threaded chip is about 5% larger than the size of an ordinary processor. However, this technology can provide benefits much greater than 5% [16]. When executing on an SMT processor, multiple threads are sharing hardware resources. 1

2 Typical shared hardware resources are the execution units, the branch prediction unit, the bus, and all levels of cache. The successful development of SMT processors [13, 17] introduces new challenges for application developers. Exploiting on-chip parallelism, in general, can be significantly different than exploiting inter-processor parallelism. For example, for an SMT architecture, a problem can occur when one thread experiences a very long latency operation, such as a load miss. That thread will stall for a certain number of cycles, holding the shared resources that could be used by the other thread [23]. Mesh generation algorithms are essential in many scientific computing applications, in health care, engineering, and science. Mesh generation applications expose multiple level of parallelism, at different granularities. This could be a big advantage when they are executed on the SMT architecture. In this project we explore the multilevel parallelization of mesh generation software to SMT architectures. The focus of this project is on arallel Constrained Delaunay mesh triangulation [7]. The Delaunay meshes are composed from triangles and they satisfy the empty circle criteria: for any triangle a that is part of a Delaunay mesh, there is no vertex that belongs to the same mesh, such that is inside the circumcircle of a [19]. In the constrained Delaunay triangulation, the final mesh has to contain the external boundaries of a domain and has to be as close to a Delaunay triangulation as possible [8]. This paper presents steps in developing a parallel mesh generation software that is based on a medium grain approach, and it is an extension of the coarse grain and fine grain parallel software, previously developed by Chernikov [6]. Building a stable and reliable mesh generation code is a difficult and labor intensive task. Therefore, we want to achieve code reuse, by building our program on top on the existing coarse grain code, CDT, developed by Chernikov [6]. CDT is a scalable, MI based code whose performance on a single SMT chip is significantly worse than the performance of Triangle, the state-of-the-art sequential meshing software [21]. By introducing medium grain parallelism in CDT, we intend to improve the single chip performance of this program. The final version of the extended CDT code is a scalable multi-threaded code whose performance improves as the number of thread increases, especially inside a single SMT chip. We achieved 100% code reuse. As we will see later in the paper, the medium grain approach suffered significant overhead caused by synchronization between the threads in a single MI process. However, during the design process, a large set of optimizations developed for SMT architectures, was also suitable for CDT. These optimizations reduced the execution time of CDT by half and made it 6% faster than Triangle when executed on a single SMT processor with two hardware contexts. 2 Related work In this section we will mention some papers that are closely related to this research. A more detailed description of different parallel meshing techniques can be found in [9]. In [18, 10], the authors developed a distributed memory parallel Delaunay refinement algorithm with the guaranteed quality of each triangle that belongs to the mesh. In [18, 10], the 2

3 initial domain is divided into many sub-domains. Adjacent sub-domains share faces of the mesh, and the set of all shared faces between two sub-domains is called an interface. Elements of each interface are replicated in each sub-domain that contains them. The most important feature of the algorithm is that sub-domain interfaces can be changed after insertion of the vertices that affect the faces in the interfaces. Changing the interfaces allows possible conflicts among different cavities, which are resolved by allowing rollbacks. The algorithm used in our project is basically the same as the algorithm described in [18, 10]. The main difference is that in our case an interface cannot contain faces, only edges that belong to the mesh. Also, [18, 10] targets distributed memory systems, while our approach combines distributed and shared memory systems. In [5], the authors developed the theoretical framework for parallel Delaunay refinement, where multiple vertices are concurrently inserted into the mesh. Based on the theoretical framework, the authors developed an algorithm that avoids the difficult domain decomposition problem. Instead, they used buffer zones as a guarantee that there will be no conflicts among concurrently developed cavities. However, as the authors presented in the evaluation section, this approach introduces noticeable communication costs. Nevertheless, we used the theoretical framework developed in [5] in order to exploit parallelism inside a single sub-domain. In order to reduce communication and synchronization costs that can occur when a domain is divided into multiple sub-domains, Linardakis and Chrisochoides present a arallel Delaunay Domain Decoupling method [15]. Here, a domain is divided into sub-domains that are meshed independently, and there is no communication among processes working on different sub-domains. The algorithm contains two steps: 1. The domain is divided into sub-domains using the medial axis domain decomposition, 2. New points are inserted into the newly created separators. The medial axis domain decomposition creates sub-domains of approximately the same size. The newly created separators have good quality in terms of shape and size. New points are inserted in the separators, after which each sub-domain can be meshed with the existing sequential mesh generation code without inserting any new points in separators. In this project, a similar algorithm is used, but instead of inserting points in the separators at the beginning, we allow communication among processes, and therefore the new points are inserted in the separators at the runtime. Further in this section we will describe the work that directly motivated the development of the medium grain meshing software, specifically optimized for the SMT architecture. 2.1 CDT CDT is an MI-based code. It uses a coarse grain approach for meshing given domains. Before a meshing procedure of CDT starts, a domain is decomposed into multiple sub-domains (Figure 1). This decomposition is done by Metis [14]. The number of created sub-domains is 3

4 usually much greater than the number of MI processes [3]. The actual input for CDT is the decomposition created by Metis. Figure 1: Decomposition of the pipe domain into 32 sub-domains. In order to better understand the algorithm that is implemented in CDT, we will define the term conformal triangulation [11], Definition 1. Let V be a set of points in the domain Ω R 2, and let T be a set of triangles whose vertices are in V. We call T = (V,T) a conformal triangulation if the following conditions hold: 1. The union of the vertices of all triangles in T is exactly V. 2. The union of all triangles in T is exactly Ω. 3. There are no empty (degenerate) triangles in T. 4. The intersection of any two triangles is either the empty set, a vertex, or an edge. After the initialization of MI processes in CDT, each process is assigned a single subdomain. The mesh generation procedure of each sub-domain starts with the creation of an initial Delaunay mesh that conforms input vertices and segments. After the creation, the initial mesh of each sub-domain is refined until the quality criteria is satisfied. The quality criteria is specified by the user and it contains two conditions: The area of each triangle in a final mesh is smaller than a user specified value, and The minimum angle of each triangle in the final mesh is greater than a user specified value. To refine the initial mesh, CDT uses a Delaunay refinement. The general idea of Delaunay refinement is to insert new vertices into the mesh and to slightly change the part of the mesh around each new vertex inserted, improving the quality of the mesh. Newly inserted vertices 4

5 are actually the circumcenters of the triangles that do not satisfy the quality criteria. New vertices are inserted as long as there are triangles that do not satisfy the quality criteria. For the vertex insertion procedure, CDT uses Bowyer-Watson s (BW) algorithm [4, 25], which is based on deleting the triangles that are no longer Delaunay (after the new vertex was inserted), and inserting new triangles that satisfy the Delaunay property. In order to explain the BW algorithm, we will introduce the following definitions: Definition 2. The cavity C M ( i ) of point i with respect to mesh M, is a set of triangles whose circumcircles include i In other words, if k, l, m M and if we denote the circumcircle of k l m as O( k l m ), then C M ( i ) = { k l m M i O( k l m ). Definition 3. We will say that B M (p i ) is the set of external edges of some cavity C M (p i ). In other words set B M (p i ) represents edges that are not shared between any two triangles in the cavity C M (p i ). In Figure 2, if 6 is a newly inserted vertex, then C M ( 6 ) = { 1 2 5, 2 3 5, B M ( 6 ) = { 1 2, 2 3, 3 4, 4 5, Figure 2: Cavity Definition. Knowing these two definitions, we can easily describe the BW algorithm for refining a Delaunay mesh M. 1. Select a triangle from the set of bad triangles. Bad triangles are those which do not satisfy the quality criteria. 2. Compute the circumcenter i of this triangle. 3. Find C M (i) and B M (i). This is the cavity creation phase. 5

6 4. Delete all triangles contained in C M (i) from M. 5. Add triangles obtained by connecting i with every edge in B M ( i ) to M. This is the cavity re-triangulation phase. A Delaunay refinement treats constrained segments (edges that need to be in the final mesh and cannot be changed) differently from triangle edges [20, 22]. A vertex encroaches upon a segment s if it lies within the open diametral circle of s [20]. When a new point is about to be inserted and it happens to encroach upon a constrained segment s, another point is inserted in the middle of s instead [20], and a cavity of the midpoint of segment s is constructed and triangulated as described in the BW algorithm. 2.2 CDT implementation of the Bowyer-Watson algorithm In this section we will describe how the 5 steps of the BW algorithm are implemented in CDT [6]: 1. In order to have knowledge of all bad triangles, CDT maintains two STL structs (deques) which keep the pointers to all triangles that do not satisfy the quality criteria. Two STL dequeues are maintained because the quality criteria contains two conditions: (a) bigtris list keeps track of all the triangles whose area is larger than the user specified value. (b) badtris list keeps track of all the triangles that contain at least one angle that is smaller than the user specified value Both of these queues are maintained in FIFO order. They are accessed only at two points in the code: When a cavity creation phase starts, a base triangle for a cavity creation is obtained from one of these lists. After the cavity was re-triangulated all the new triangles that do not satisfy the quality criteria are put on one of these two lists. 2. The circumcenter i of the base triangle is calculated using the circumcenter() function, taken from Triangle [21]. The circumcenter is calculated only for the base triangle of each cavity. 3. Finding C M ( i ) and B M ( i ) sets is the most time demanding part of the algorithm. Using a breadth first search, all triangles whose circumcircle contains the circumcenter of the base triangle are added to the C M ( i ) set. All edges of the triangles in the C M ( i ) set that are not shared between any two triangles are added to the B M ( i ) set. In order to determine if a circumcircle of a triangle contains a given point, CDT uses the incircle() test, used in Triangle [21]. This is an adaptive test; its running time depends on the degree of uncertainty of the result, and is usually small. 6

7 4. All triangles from C M ( i ) are set to be deleted. They are never actually deleted and memory for these triangles is never returned to the system. Instead, they are put onto the recycling list, and they are later reused. When new memory for more triangles is required, CDT will first check if the recycling list is empty and only if it is, it will allocate more memory from the system. Otherwise, a triangle from the recycling list is retrieved. This is an optimization that reduces the number of memory allocation calls. 5. After the triangles from C M ( i ) are deleted, new triangles are created. CDT also maintains an STL dequeue tris which keeps track of all triangles in the mesh. As soon as the new triangles are initialized, they are also added to the tris list. Also, one by one, the new triangles are checked, and if they do not satisfy the quality criteria they are added to the badlist or to the biglist. As mentioned at the beginning of this section, before being refined by CDT, a domain is divided in many sub-domains. The BW algorithm is used for meshing each sub-domain. The interfaces (sub-domain boundary edges) are treated as constrained segments, which means that they have to be in the final mesh. When a new vertex encroaches upon an interface, another point is inserted in the middle of that interface. However, some interfaces are shared between two sub-domains. In that case when one MI process splits a shared interface, it also has to notify the process that is working on the adjacent sub-domain that their shared interface is split. This is the only type of communication among the MI processes in CDT. Compared to the size of the mesh, these messages are very rare. In all our experiments with CDT, there was no noticeable overhead introduced by interprocess communication. 2.3 CDT performance The architecture used for all experiments in this paper is an SM node with 4 Intel entium 4 Xeon processors. Each processor is 2-way Hyper-threaded, working at 2GHz. The size of the L1 cache is 8KB, 64B line. L2 cache is 512KB, 64B line. L3 cache is 1MB, 64B line. The total size of available memory is 2 GB. In all our experiments, CDT scales extremely well. When only one context of each chip is used, CDT scales linearly, as we can see in Table1. In these experiments, we used a pipe domain (Figure 1) that was decomposed in 32 sub-domains, and the total mesh of 10 million triangles was created. Table 1: CDT, each physical CU is reserved for only one MI process. CDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process 54s 2 MI processes 27s 4 MI processes 13.9s When two MI processes are executed on different contexts of the same SMT processor, the 7

8 CDT obtains around 30% speedup compared to only one MI process. In Table 2 we can see the execution times of CDT, in the same experiments as above, when all available contexts on our SM node are used. Table 2: CDT, 2 MI process are bound on the same SMT chip. CDT, pipe, 32 sub-domains, 10 7 el Execution Time 2 MI processes 45.14s 4 MI processes 23s 8 MI processes 12s 2.4 Fine grain CDT CDT scales well. However, the performance of CDT when only one MI process is executed is still significantly worse than Triangle. Antonopoulos et al. [2] tried to exploit fine grain parallelism inside CDT. Multiple threads were allowed to expand a single cavity concurrently (Figure 3). This work was targeting SMT architectures, and it was attempting to improve the single processor performance of CDT and bring it closer to the performance of Triangle. 5 THREAD THREAD 2 THREAD Figure 3: Fine grain algorithm. As it is stated in [2], there are two major limitations of the fine grain parallelization of CDT: 1. fine grain implementation of CDT can effectively use up to two or three hardware execution contexts, at best. 2. synchronization overhead among threads and contention for shared data structures is to high. 8

9 Even though in [2] proposed architectural changes could significantly improve the performance of the fine-grain CDT software, their experimental results on Intel Hyper-threaded processors indicated that the overhead related to fine-grain parallelism management and execution overrun potential benefits, resulting often in performance degradation. This motivated exploring a medium-grain, optimistic parallelization strategy which increases the granularity and concurrency of CDT within each sub-domain. 3 Medium-Grain Algorithm The medium-grain optimistic Delaunay algorithm is based on the concurrent insertion of new vertices into the existing mesh, first presented in [18]. In other words, the BW algorithm is performed concurrently on the same mesh, by different threads. However, there are some constraints when two or more vertices are inserted concurrently [9]. The reason is a possible conflict between two or more expanding cavities which can cause a non-conformal mesh. Assume that two vertices 8 and 9 are inserted concurrently into the existing mesh M (Figure 4). As specified in the BW algorithm, two cavities are created, one around each inserted vertex. If belongs to both cavities C( 8 ) and C( 9 ), then concurrent insertion of 8 and 9 results in a non-conformal mesh. In other words, the edges of triangles and will intersect. p 7 p 6 p 8 p 9 p 5 p 1 p p Figure 4: Conflict between cavities. p Even if the cavities of two points do not share triangles, it could happen that the resulting mesh (after retriangulating cavities) does not satisfy the Delaunay condition, i.e. there are new triangles whose circumcircles are not empty of mesh vertices. Assume that two vertices 8 and 10 are inserted concurrently into the existing mesh M (Figure 5). Assume that cavities created around these vertices are C( 8 ) = { 1 2 7, 2 3 7, and C( 1 0) = { 3 5 6, If edge 3 6 is shared by C( 8 ) and C( 1 0), then the new triangle can have point 8 inside its circumcircle, thus violating the Delaunay property. The theoretical framework for concurrent vertex insertion was established by Chrisochoides [10], and the following lemma was provided in [5]: 9

10 p 7 p 6 p 8 p 5 p 1 p 10 p p Figure 5: Cavities that share an edge. p Lemma 1. Let i and j be the vertices that are concurrently inserted into the existing mesh M. If C M ( i ) and C M ( j ) have no common triangles and do not share any triangle edges, then independent insertion of i and j will result in a mesh which is both conformal and Delaunay. According to Lemma 1, in medium-grain algorithm a mechanism for resolving possible conflicts among cavities is necessary. Standard way of resolving conflicts is that as soon as a conflict is detected, a thread which detects a conflict will cancel cavity expansion [10, 18]. However, this approach sometimes can result in cancellation of both conflicting cavities [10]. Consider the situation presented in the Figure 6. Assume that the triangles 1 2 7, and belong to the cavity C 1. C 1 is in the expansion phase, and it is expanded by Thread1. Assume that the triangles 1 7 6, and belong to the cavity C 2, which is also in the expansion phase. C 2 is expanded by the Thread2. If the next triangle that Thread1 will check is 1 7 6, and the next triangle that Thread2 will check is 4 7 3, it could happen that both threads will detect a conflict at the same time Thread Thread Figure 6: Threads can detect conflict at the same time. After both cavities have been canceled, it could happen that the same situation will occur again, and again both cavities will be canceled. This is called livelock [10]. In [10], this problem was resolved by developing a non-deterministic algorithm, where one cavity expansion 10

11 is delayed long enough so that the other cavity can complete its expansion and retriangulation. This approach greatly reduced the likelihood that livelock will occur. In this project, the livelock problem is resolved by introducing the canceling flag. The canceling flag is global, and therefore accessible by both threads. If there are no conflicts the canceling flag is set to be 0. Now, in the previous example, as soon as Thread1 detects a conflict it also checks the canceling flag. If the canceling flag is 0, Thread1 sets it to be 1. These two operations (checking and setting the flag) are done atomically. Only after setting the canceling flag, Thread1 can start cavity canceling. However, it can happen that the canceling flag has already been set to 1 when Thread1 checks its value. In that case, Thread2 has already detected the conflict and started canceling its own cavity. Thread1 can continue its own cavity creation. The thread who canceled the cavity will set the canceling flag to be 0 after the cancellation is done. This is a deterministic approach, and it guarantees that no livelock will occur. A Similar approach can be used when there are more than 2 threads working concurrently. The only important thing is that in this case there will be more than one canceling flag, but also the number of canceling flags will be 1 less than the number of threads. A more precise description of setting and reseting the canceling flag is presented in Appendix B. 4 MCDT Multi-threaded CDT (MCDT) extends the MI implementation of CDT to exploit SMTlevel parallelism inside each MI process. In the MCDT software, multiple threads work on the same domain. The threads can independently expand and re-triangulate cavities, but two cavities processed independently by different threads are not allowed to share a triangle or an edge of a triangle. Each thread is executed on a single context of an SMT processor. Executing threads that share the same address space inside of a single SMT chip is likely to give better performance than executing different MI processes inside of a single SMT chip. The reason is faster data communication that can occur between threads, by avoiding MI layers. Our implementation of MCDT includes an extensive set of optimizations, including algorithmic optimizations, modified data structures and synchronization mechanisms to reduce contention, and techniques to reduce conflicts between threads in shared resources of an SMT processor. Our implementation of MCDT is built on top of CDT [6]. The discussion in this section assumes an implementation of MCDT for SMT processors with two thread contexts. Most of the techniques described here are applicable to SM systems without modifications. However, some techniques target specifically the bottlenecks in shared resources of SMT architectures. 4.1 Implementation As with CDT, an input for MCDT is a set of sub-domains previously created by Metis or some other domain decomposer. Again, one sub-domain is assigned to each MI process. Before 11

12 processing an assigned sub-domain, in MCDT, an MI process is split into two threads. Both threads are executing exactly the same code (BW algorithm) which is described in Section Conflicts The most important algorithmic problem of MCT is the potential occurrence of conflicts while threads are expanding cavities. Multiple threads may work on different cavities at the same time, within the same domain. According to Lemma 1, a problem can occur if two cavities, processed by different threads, have a common triangle, or there is an edge of a triangle shared between two cavities. In this situation one cavity has to be canceled, while the other cavity can continue its expansion. The algorithm needs to have a mechanism to detect conflicts and a technique to minimize their occurrences. In order to detect conflicts, we tag each triangle with a taken flag. When a triangle becomes a part of a cavity, the taken flag is set. During cavity expansion, if a thread touches a triangle whose taken flag has already been set, the thread knows that a conflict has occurred and that the whole cavity must be canceled. Updates of the taken flag need to be atomic since two or more threads may access the same triangle. Also, if a triangle is rejected from a cavity, i.e. its circumcircle does not contain the new vertex (the center of a cavity), its taken flag will be set. This extra layer of triangles that surround the cavity but are not part of the cavity prevents two cavities from sharing an edge of a triangle (Figure 7). In [18] this is referred as the closure of the cavity taken=1 12 taken=1 8 taken=1 6 7 taken=1 1 taken=1 2 4 taken=1 5 Cavity 3 Triangles that protect edges of the cavity Figure 7: Layer of triangles that surround a cavity. Since it is possible for a triangle to be accessed simultaneously by different threads, every access to the triangle s taken flag has to be protected. In our first implementation, we used synchronization variables provided by the pthread library to protect accesses to a taken flag. Due to the high overhead caused by these synchronization variables, we decided to use special atomic fetch and store() operations provided on Intel architectures. These instructions incur less overhead than conventional locks or semaphores under high contention while providing other advantages such as immunity to preemption [1]. The time execution of MCDT with pthread 12

13 locks and Intel s atomic instructions is presented in Tables 3 and 4. More detailed explanation of setting and resetting the taken flag can be found in Appendix A. Table 3: Execution time of MCDT when ptrhead locks are used, input domain: pipe with 32 sub-domains. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 155s 2 MI processes, each process had 2 threads 80.4s 4 MI processes, each process had 2 threads 42.23s Table 4: Execution time of MCDT when atomic operations are used, input domain: pipe with 32 sub-domains. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 103.6s 2 MI processes, each process had 2 threads 52.24s 4 MI processes, each process had 2 threads 25.9s In order to reduce the number of conflicts, we split each sub-domain in two areas. Each thread is allowed to process only the triangles that belong to a certain area (Figure 8). The separator that creates two areas is a line x = midx, where midx is equal to 0.5 (minx + maxx). minx and maxx are the leftmost and the rightmost coordinate of a sub-domain. With this decomposition, conflicts are likely to occur only around the border between the areas. The probability that conflicts will occur is getting smaller as the size of the mesh grows. The reason is that the size of the triangles and cavities will be smaller as the quality of the mesh is improving. Thread 1 working area Thread 2 working area Separator Conflict robability Small Large Figure 8: Separator that splits a sub-domain into different areas. Table 5 and Table 6 represent the number of conflicts among cavities in MCDT before and after each sub-domain was divided in two areas. These experiments are conducted on a Xeon SMT architecture using a single SMT processor with 2 contexts. MCDT was running only one MI process with two threads. The input domain was a pipe (Figure 1), decomposed into 32 sub-domains. In Table 6 we can see that introducing a separator greatly reduces the number 13

14 Table 5: Number of conflicts before splitting the sub-domains. Number of committed Cavities Number of Conflicts Thread1 2,453,034 1,199,184 Thread2 2,462,935 1,142,578 Total 4,915,969 2,341,762 Table 6: Number of conflicts after splitting the sub-domains. Number of committed Cavities Number of Conflicts Thread1 2,487,237 3,005 Thread2 2,427,565 2,603 Total 4,914,802 5,608 of conflicts. Splitting a domain can introduce load imbalance among the threads. Our mechanism for resolving load imbalance is described in Section Lists Bad triangles in CDT are the triangles that do not meet the quality criteria. As mentioned in Section 2.2, CDT maintains two global lists of bad triangles, called badtris and bigtris. When a cavity is re-triangulated, the quality of each new triangle is checked. A new triangle that does not satisfy the quality criteria is put on one of the two lists. After re-triangulation of a cavity, a thread checks if bigtris and badtris are not empty. If at least one of the lists is not empty, a triangle from the top of the list is retrieved and a new cavity creation starts. If both lists are empty, there are no more triangles that do not satisfy the quality criteria and the mesh refinement stops. Therefore, before any cavity creation and refinement, bigtris or badtris needs to be accessed. In MCDT, these lists are accessed by multiple threads and they need to be protected. rotecting the lists can cause significant overhead under high contention. In our experiments, we found that bigtris is accessed much more frequently than badtris. For example, during the creation of the mesh of 10,000,000 triangles for the domain pipe, badtris is accessed 40,000 times in total, while bigtris list is accessed 12,800,000 times. Therefore, we focused on reducing the contention of the bigtris list. One potential solution for reducing contention and lock overhead is to use local lists of big triangles per thread. Big triangles that belong to a specific area in a sub-domain are inserted in the local list of the thread working in that area. Since threads can produce big triangles that belong to other threads areas, local lists of big triangles need to be protected in such an implementation. The technique applied to further reduce locking overhead and contention is to have two local lists of big triangles per thread. One list is strictly private to the owning thread, while the other 14

15 list can be shared with other threads, and therefore needs to be protected. If a thread, after a cavity re-triangulation, creates a new big triangle that belongs to its own area, the new big triangle is inserted in the private local list. If the triangle belongs to an area of some other thread, it is inserted in the shared local list of that other thread (Figure 9). Each thread can retrieve triangles from its private list as long as the private list is not empty. Only after the private list becomes empty will a thread start retrieving triangles from its shared local list. Separator Thread1 working area Triangle created by Thread2 Thread2 working area R I V A T E S H A R E D Thread1 lists S H A R E D R I V A T E Thread2 lists Figure 9: Local shared and private lists for each thread. Our experiments show that in each thread, the private local list of big triangles is accessed much more frequently than the shared local list of big triangles. During the creation of the mesh of 10,000,000 triangles for the domain pipe, the shared list of big triangles is accessed 800,000 times, while the private list of big triangles is accessed 12,000,000 times. Therefore, the locking overhead that comes from the shared big list is not significant. With this implementation, locking overhead and list contention do not become bottlenecks. In Tables 7 and 8, we presented the execution times MCDT after introducing the private lists for each thread. The domain meshed in these experiments was pipe (Figure 1). Each MI process in this experiment was using a single SMT processor, and each thread inside an MI process was using one context of an SMT processor. A more detailed explanation of maintaining the lists in MCDT is provided in Appendix C. 4.4 Load Balancing As stated in Section 4.2, each domain is initially divided in two areas, and each thread is working on a single area. When two threads are executed inside an MI process, a domain is simply split with the line x = midx, where midx is the middle point in a straight line between the leftmost and the rightmost x coordinates of the domain. This type of decomposition can introduce significant load imbalance between threads. With irregular domains, it is possible that one thread has to do much more work than the other 15

16 Table 7: Execution time of MCDT with only 1 local (shared) list per thread. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 64.29s 2 MI processes, each process had 2 threads 32.7s 4 MI processes, each process had 2 threads 16.5s Table 8: Execution time of MCDT with only 2 local lists per thread: one private and one shared. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 61s 2 MI processes, each process had 2 threads 31s 4 MI processes, each process had 2 threads 15.6s thread (Figure 10). This problem is solved by adjusting the value of midx at runtime. Separator Thread1 working area Thread2 working area MinX MidX MaxX Figure 10: Uneven work distribution between threads. In previous work [2], we showed that the size of badtris is proportional to the work done in each thread (Figure 11). In our implementation of MCDT, when the badtris list of one thread becomes 50% larger than the badtris list of another thread, midx is moved towards the area that is processed by the thread with the longer badtris list (Figure 12). This will introduce more overhead since the size of badtris has to be checked, but it will distribute work more evenly between threads. A similar technique can be applied when MCDT uses more than two threads per MI process. In this case, we can virtually draw multiple vertical lines that separate the areas processed by different threads. The size of badtris in one thread would need to be compared only to the lists of threads that work on adjacent areas. In Figure 13 and Figure 14, we can see the difference in the number of committed cavities between the two threads that are processing the same sub-domains. In this experiment, the domain is a pipe, and it is decomposed into 32 sub-domains. A mesh of 10 million triangles was created. In both figures, the x-axis represents what sub-domain is processed, while the y-axis represents the number of committed cavities. As we can see in Figure 13, without our load balancing there are sub-domains where one thread can commit twice as many cavities as 16

17 1e Size of the bigtris list e e+06 2e e+06 3e+06 Number of cavities created by a thread Figure 11: Size of the bigtris list during the program execution. New Separator Separator Thread1 working area Thread2 working area MinX MidX dx MidX MaxX Figure 12: Moving separator is used for fixing the load imbalance. the other thread. On the other hand, with our load balancing (Figure 14) both threads do approximately the same amount of work for each sub-domain. A more detailed explanation of how the load balancing is handled can be found in Appendix D. 4.5 Memory Management CDT uses a custom memory manager. After a cavity is created, the triangles that belong to the cavity are marked as deleted and removed from the global mesh. The cavity is then re-triangulated into a new set of triangles. These new triangles become part of the global mesh. The memory allocated for deleted triangles is never returned to the system. Instead, deleted triangles are inserted in a recycling list. The next time the program requests memory for a new triangle, the code looks for deleted triangles in the recycling list. Memory is allocated from the system only when the recycling list is empty. The simple recycling mechanism of CDT is inefficient in MCDT for two reasons: The recycling list is shared between threads and needs to be protected. Allocating memory from different threads causes contention inside the memory allocator, which needs to be thread-safe. We addressed the first problem by allocating a local recycling list in each thread. Each local list contains only deleted triangles from a single thread. Having private recycling lists raises 17

18 Total Number of Commited Cavities Difference in Number of Commited Cavities Number of subdomains Thread 1 Thread 2 Figure 13: Difference in number of committed cavities without any load balancing. Total Number of Commited Cavities Difference in Number of Commited Cavities Number of subdomains Thread 1 Thread 2 Figure 14: Difference in number of committed cavities with our load balancer. an important question: Can it happen that one private recycling list will be significantly larger than the other one? In other words, is it possible that one thread will delete more triangles than the other? If that is possible, one thread could always have enough memory on its own recycling list, while the other thread would frequently have to request memory from the system. To better understand the solution for this problem, we present the following example: Assume that is a big triangle, and 7 is the circumcenter of (Figure 15a). Assume that triangles 1 2 3, 6 3 5, and are all the triangles whose circumcenter contains 7. According to the Definition 1, 4 triangles: 1 2 3, 1 3 6, and define the cavity C M ( 7 ). Cavity C M ( 7 ) has 6 external edges: 1 2, 2 3, 3 4, 4 5, 5 6, 6 1. At the re-triangulation phase, after the deletion of the existing triangles and creation of new triangles, cavity C M ( 7 ) will contain 6 triangles (Figure 15b). The next remark makes more general conclusion: Remark 1. The number of newly created triangles in a cavity is always greater than or equal to the number of deleted triangles from that cavity. roof. Following the previous example, we can conclude: 1. After a cavity s creation, the number of external edges of the cavity will always be greater than or equal to the number of triangles in the cavity. 2. After the re-triangulation, the number of triangles in the cavity will be equal to the number of external edges of the cavity. 18

19 (a) (b) Figure 15: a) Cavity creation b) Cavity re-triangulation 1 and 2 imply that the number of newly created triangles in a cavity is always greater than or equal to the number of deleted triangles from that cavity. Deleted triangles from a cavity will be inserted in the recycling list, but according to Remark1, during the re-triangulation phase of the same cavity, they will all be retrieved from the recycling list and used for the creation of new triangles. Consequently, for each thread, the size of its recycling list will always be less than or equal to the size of the cavity currently being processed. Theoretically the size of a cavity is not bounded, but in all our experiments the size of any cavity never extended 5-6 triangles. Therefore, the recycling lists will always be approximately the same size. In Table 9, we can see the execution time of MCDT after we made the recycling list to be private for each thread. Again, the domain we used in this experiment was the pipe (Figure 1), and the created mesh was 10 million triangles. Table 9: Execution time for MCDT when recycling list is private for each thread. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 54.9s 2 MI processes, 2 threads 27.9s 4 MI processes, 2 threads 14.2s To address the second problem, contention in the memory allocator, a local, system-allocated memory pool for each thread is used. When the creation of a new triangle needs memory, a thread obtains that memory from its private pool. The memory pools do not have to be protected since they are private to each thread. Therefore, they do not incur contention. The use of local memory pools also reduces the number of times threads need to obtain memory from the system. We empirically determined that memory pools of size 4Kb give the best performance. We repeated the experiment with the pipe domain. The results are presented in Table 10. The memory management implementation details can be found in Appendix E. 19

20 Table 10: Execution time for MCDT when memory pools are used. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 52s 2 MI processes, 2 threads 26.8s 4 MI processes, 2 threads 13.5s 4.6 Removing STL structures and fixing MinAngle() The original version of CDT used STL structures. Although using the STL has several advantages in terms of code readability and code reuse, the STL itself introduces unacceptable overhead when used in multi-threaded code. The reason for this overhead can be traced to the inefficient mechanism used by STL to protect its data structures when the code requires STL structures to be thread-safe. During the cavity expansion phase and depth-search-first algorithm, CDT keeps all the triangles that belong to a single cavity in one STL vector. The reason is that the size of the cavity is not known and STL vectors can be extended dynamically. Also, during cavity retriangulation, newly created triangles are kept in an STL vector. Although there is no upper bound for the number of triangles contained in a single cavity, our experiments show that the typical size of a cavity is around 5 6 triangles. In MCDT, instead of using the two STL vectors, we use statically allocated arrays with a maximum of 20 elements each. When the size of a cavity needs to extend the size of the arrays, we allocate additional memory. This happens rarely, if at all. Replacing the two STL structures with statically allocated memory improves execution time by approximately 40% (Table 11). Table 11: Execution times for MCDT when STL structs used during the cavity creation and re-triangulation, are replaced by statically allocated arrays. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 31.4s 2 MI processes, 2 threads 16s 4 MI processes, 2 threads 8.6s After the cavity expansion phase is done, in CDT, triangles that belong to a cavity, as well as the triangles that are rejected from a cavity, are stored in 2 STL vectors (trisselected and trisrejected. Replacing those two vectors with statically allocated arrays improves the execution time of MCDT by approximately 10%. We have also modified some critical computational kernels in CDT in order to remove expensive floating operations. Besides their cost, such operations can be a bottleneck in SMT processors that share the floating point unit between threads [24]. computequality(), a function that checks the quality of a triangle, uses expensive acos() and sqrt() operations. We modified the code to replace these operations, while preserving correctness. 20

21 Table 12: Execution times for MCDT when STL vectors used for saving he information about a cavity are replaced by statically allocated arrays. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 28.4s 2 MI processes, 2 threads 14.6s 4 MI processes, 2 threads 7.6s Modifications are made on the part of the code that compares the minimum angle of a triangle versus the lower bound for the minimum angle. Instead of comparing the angles, we are able to compare the cosinus of the angles. Lets say that α is the minimum angle of some triangle, and β is the lower bound for the minimum angle specified by the user. We know that α < π 2, since the sum of all angles in a triangle is less than or equal to π. This implies that 0 < cos α < 1 and cos α is a monotonically decreasing function for α [0, π 2 ]. Therefore, we can compare cos α and cos β instead of comparing α and β (as it is done in CDT). Furthermore, we are able to compare cos 2 α and cos 2 β instead of comparing cos α and cos β. This is useful since α is calculated as a result of the cosinus theorem applied to the edges of a triangle, and the cosinus theorem is using the expensive sqrt() operation. Using multiplications and calculating cos 2 α is less expensive than using sqrt() and calculating cos α. This modification improved the execution time of MCDT by approximately 9% (Table 13). Table 13: Execution time of MCDT when MinAngle() function is optimized. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process, 2 threads 25.6s 2 MI processes, 2 threads 13.2s 4 MI processes, 2 threads 6.7s 5 Optimizations applied to CDT In the previous section we described a detailed implementation of MCDT. If we compare the execution time of MCDT (Section 4.6 ) and the execution time of CDT (Section 2.3) we can see that MCDT is about 1.8 times faster. However, several design solutions that we used in the MCDT implementation can also be applied to CDT. This section describes the effects that some of the optimizations from the last section had on CDT. 5.1 Using static arrays instead of STL structs As described in section 4.6, after a cavity is created, all triangles that belong to a cavity are kept in the STL vector trisselected. Also, all the triangles that do not belong to the cavity but 21

22 are adjacent to triangles that belong to the cavity are kept in the STL struct trisrejected. Replacing these two vectors with statically allocated arrays improves the execution time of CDT approximately 3-6%. The new execution time of CDT, when the pipe domain is meshed, and 10 million triangles are created, is shown in Table 14. Table 14: Execution time for CDT, when STL vectors used for saving the information about a cavity are replaced by statically allocated arrays. CDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process 52.4s 2 MI processes, 1 physical CU is used 43.9s 4 MI processes, 2 physical CU are used 22s 8 MI processes, 4 physical CU are used 10.9s 5.2 Removing STL structs from the cavity expansion phase As described in section 4.6, in the cavity expansion phase, all triangles that are touched during the breadth search algorithm are stored in one STL vector. Also, in the cavity re-triangulation phase, all new triangles that are inserted in the cavity are stored in the STL vector. As in MCDT, these two vectors are replaced with the statically allocated arrays. Again, if the size of the cavity extends the size of the statically allocated array, we designed a mechanism that extends the existing array. This optimization improves the execution time of CDT approximately 35%. The new execution time of CDT is shown in Table 15. Table 15: Execution times for CDT when STL structs used during the cavity creation and re-triangulation, are replaced by statically allocated arrays. CDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI processes 34s 2 MI processes, 1 physical CU is used 28.4s 4 MI processes, 2 physical CU is used 14.3s 8 MI processes, 4 physical CU is used 7.25s 5.3 Memory Management Instead of using the original memory management in CDT, we built in the memory management designed for MCDT, where instead of obtaining memory from the system every time a new triangle is created, the memory for new triangles is allocated from a big pool of memory. The size of the pool is 4KB. When all the memory from the pool is used, a new pool is allocated from the system. This optimization improves the execution time of CDT approximately 6%. The new execution time of CDT is shown in the Table

23 Table 16: Execution times for CDT when we used MCDT memory management. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI processes 32s 2 MI processes, 1 physical CU is used 26.4s 4 MI processes, 2 physical CU is used 13.5s 8 MI processes, 4 physical CU is used 6.7s 5.4 minangle() function As in MCDT, we changed the function that calculates the smallest angle of a triangle. The goal was to reduce the number of floating point operations that can conflict in functional units of a single SMT processor. All the changes in the minangle() function were already described in the section 4.6. This optimization improves the execution time of CDT for approximately 9%. The new execution time of CDT is represented in the Table 17. Table 17: Execution times for CDT, when minangle() function is optimized. MCDT, pipe, 32 sub-domains, 10 7 el Execution Time 1 MI process 26.88s 2 MI processes, 1 physical CU is used 23.5s 4 MI processes, 2 physical CU is used 11.8s 8 MI processes, 4 physical CU is used 5.9s After all the optimizations, the total execution time of CDT is improved by a factor of 2. This significant improvement brought the performance of CDT very close to Triangle. To create a mesh of approximately 10 million triangles for the pipe domain on the Intel Xeon architecture described in section 2.3, Triangle needs 24.9 seconds. When CDT is run with only one MI process, for the same problem size it needs seconds. However, using both contexts of a single SMT processor, CDT outperforms Triangle by about 6%. 6 MCDT Overhead After intensive optimizations, the CDT software again performed better than MCDT. If we compare the results from Table 17 and Table 13, we can see that the difference in execution times is about 9%. This section describes the problems that caused MCDT to run slower than CDT. MCDT is a multi-threaded code, and it contains variables and objects that are shared among the threads running inside a single MI process. For example, in Section 4.2, it is described that an update of the taken flag has to be atomic. Also, the MI library used by CDT does not support multi-threading. Therefore, all communication among processes in this code has to be protected by locks. 23

Algorithm, Software, and Hardware. Optimizations for Delaunay Mesh Generation. on Simultaneous Multithreaded Architectures

Algorithm, Software, and Hardware. Optimizations for Delaunay Mesh Generation. on Simultaneous Multithreaded Architectures Algorithm, Software, and Hardware Optimizations for Delaunay Mesh Generation on Simultaneous Multithreaded Architectures Christos D. Antonopoulos a Filip Blagojevic d Andrey N. Chernikov c, Nikos P. Chrisochoides

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Guaranteed Quality Parallel Delaunay Refinement for Restricted Polyhedral Domains

Guaranteed Quality Parallel Delaunay Refinement for Restricted Polyhedral Domains Guaranteed Quality Parallel Delaunay Refinement for Restricted Polyhedral Domains Démian Nave a,,1, Nikos Chrisochoides b,,2 L. Paul Chew c,3 a Pittsburgh Supercomputing Center, Carnegie Mellon University,

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Parallel Unstructured Mesh Generation by an Advancing Front Method

Parallel Unstructured Mesh Generation by an Advancing Front Method MASCOT04-IMACS/ISGG Workshop University of Florence, Italy Parallel Unstructured Mesh Generation by an Advancing Front Method Yasushi Ito, Alan M. Shih, Anil K. Erukala, and Bharat K. Soni Dept. of Mechanical

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Multiprocessor and Real-Time Scheduling. Chapter 10

Multiprocessor and Real-Time Scheduling. Chapter 10 Multiprocessor and Real-Time Scheduling Chapter 10 1 Roadmap Multiprocessor Scheduling Real-Time Scheduling Linux Scheduling Unix SVR4 Scheduling Windows Scheduling Classifications of Multiprocessor Systems

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 What are lock-free data structures A shared data structure is lock-free if its operations

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

OPERATING SYSTEM. Chapter 9: Virtual Memory

OPERATING SYSTEM. Chapter 9: Virtual Memory OPERATING SYSTEM Chapter 9: Virtual Memory Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory

More information

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Early Experiences on Accelerating Dijkstra s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical

More information

Multi-Threaded Graph Partitioning

Multi-Threaded Graph Partitioning Multi-Threaded Graph Partitioning Dominique LaSalle and George Karypis Department of Computer Science & Engineering University of Minnesota Minneapolis, Minnesota 5555, USA {lasalle,karypis}@cs.umn.edu

More information

Chapter 3 - Memory Management

Chapter 3 - Memory Management Chapter 3 - Memory Management Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 3 - Memory Management 1 / 222 1 A Memory Abstraction: Address Spaces The Notion of an Address Space Swapping

More information

Heckaton. SQL Server's Memory Optimized OLTP Engine

Heckaton. SQL Server's Memory Optimized OLTP Engine Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Differential Geometry: Circle Patterns (Part 1) [Discrete Conformal Mappinngs via Circle Patterns. Kharevych, Springborn and Schröder]

Differential Geometry: Circle Patterns (Part 1) [Discrete Conformal Mappinngs via Circle Patterns. Kharevych, Springborn and Schröder] Differential Geometry: Circle Patterns (Part 1) [Discrete Conformal Mappinngs via Circle Patterns. Kharevych, Springborn and Schröder] Preliminaries Recall: Given a smooth function f:r R, the function

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

Polygon Partitioning. Lecture03

Polygon Partitioning. Lecture03 1 Polygon Partitioning Lecture03 2 History of Triangulation Algorithms 3 Outline Monotone polygon Triangulation of monotone polygon Trapezoidal decomposition Decomposition in monotone mountain Convex decomposition

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Experience with Memory Allocators for Parallel Mesh Generation on Multicore Architectures

Experience with Memory Allocators for Parallel Mesh Generation on Multicore Architectures Experience with Memory Allocators for Parallel Mesh Generation on Multicore Architectures Andrey N. Chernikov 1 Christos D. Antonopoulos 2 Nikos P. Chrisochoides 1 Scott Schneider 3 Dimitrios S. Nikolopoulos

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

A Parallel 2T-LE Algorithm Refinement with MPI

A Parallel 2T-LE Algorithm Refinement with MPI CLEI ELECTRONIC JOURNAL, VOLUME 12, NUMBER 2, PAPER 5, AUGUST 2009 A Parallel 2T-LE Algorithm Refinement with MPI Lorna Figueroa 1, Mauricio Solar 1,2, Ma. Cecilia Rivara 3, Ma.Clicia Stelling 4 1 Universidad

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Operating System Concepts

Operating System Concepts Chapter 9: Virtual-Memory Management 9.1 Silberschatz, Galvin and Gagne 2005 Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Parallel Computer Architecture and Programming Written Assignment 3

Parallel Computer Architecture and Programming Written Assignment 3 Parallel Computer Architecture and Programming Written Assignment 3 50 points total. Due Monday, July 17 at the start of class. Problem 1: Message Passing (6 pts) A. (3 pts) You and your friend liked the

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou (  Zhejiang University Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads

More information

Lecture 3: Art Gallery Problems and Polygon Triangulation

Lecture 3: Art Gallery Problems and Polygon Triangulation EECS 396/496: Computational Geometry Fall 2017 Lecture 3: Art Gallery Problems and Polygon Triangulation Lecturer: Huck Bennett In this lecture, we study the problem of guarding an art gallery (specified

More information

Chapter 20: Database System Architectures

Chapter 20: Database System Architectures Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types

More information

Multimedia Systems 2011/2012

Multimedia Systems 2011/2012 Multimedia Systems 2011/2012 System Architecture Prof. Dr. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de Sitemap 2 Hardware

More information

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question: Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Chapter 9: Virtual Memory

Chapter 9: Virtual Memory Chapter 9: Virtual Memory Silberschatz, Galvin and Gagne 2013 Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

CS307: Operating Systems

CS307: Operating Systems CS307: Operating Systems Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building 3-513 wuct@cs.sjtu.edu.cn Download Lectures ftp://public.sjtu.edu.cn

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading

More information

Chapter 9: Virtual Memory. Operating System Concepts 9th Edition

Chapter 9: Virtual Memory. Operating System Concepts 9th Edition Chapter 9: Virtual Memory Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory Other Considerations

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Chapter 9: Virtual Memory. Operating System Concepts 9 th Edition

Chapter 9: Virtual Memory. Operating System Concepts 9 th Edition Chapter 9: Virtual Memory Silberschatz, Galvin and Gagne 2013 Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Voronoi Diagram. Xiao-Ming Fu

Voronoi Diagram. Xiao-Ming Fu Voronoi Diagram Xiao-Ming Fu Outlines Introduction Post Office Problem Voronoi Diagram Duality: Delaunay triangulation Centroidal Voronoi tessellations (CVT) Definition Applications Algorithms Outlines

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Chapter 10: Virtual Memory

Chapter 10: Virtual Memory Chapter 10: Virtual Memory Chapter 10: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory Other Considerations

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Analytical Modeling of Parallel Programs

Analytical Modeling of Parallel Programs 2014 IJEDR Volume 2, Issue 1 ISSN: 2321-9939 Analytical Modeling of Parallel Programs Hardik K. Molia Master of Computer Engineering, Department of Computer Engineering Atmiya Institute of Technology &

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Preview. Memory Management

Preview. Memory Management Preview Memory Management With Mono-Process With Multi-Processes Multi-process with Fixed Partitions Modeling Multiprogramming Swapping Memory Management with Bitmaps Memory Management with Free-List Virtual

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Chapter 9: Virtual Memory. Operating System Concepts 9 th Edition

Chapter 9: Virtual Memory. Operating System Concepts 9 th Edition Chapter 9: Virtual Memory Silberschatz, Galvin and Gagne 2013 Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

Intel Thread Building Blocks, Part IV

Intel Thread Building Blocks, Part IV Intel Thread Building Blocks, Part IV SPD course 2017-18 Massimo Coppola 13/04/2018 1 Mutexes TBB Classes to build mutex lock objects The lock object will Lock the associated data object (the mutex) for

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Virtual Memory 1 Chapter 8 Characteristics of Paging and Segmentation Memory references are dynamically translated into physical addresses at run time E.g., process may be swapped in and out of main memory

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Performance and Optimization Issues in Multicore Computing

Performance and Optimization Issues in Multicore Computing Performance and Optimization Issues in Multicore Computing Minsoo Ryu Department of Computer Science and Engineering 2 Multicore Computing Challenges It is not easy to develop an efficient multicore program

More information

Chapter 8: Virtual Memory. Operating System Concepts Essentials 2 nd Edition

Chapter 8: Virtual Memory. Operating System Concepts Essentials 2 nd Edition Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

More information

PowerVR Series5. Architecture Guide for Developers

PowerVR Series5. Architecture Guide for Developers Public Imagination Technologies PowerVR Series5 Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Mesh Generation through Delaunay Refinement

Mesh Generation through Delaunay Refinement Mesh Generation through Delaunay Refinement 3D Meshes Mariette Yvinec MPRI 2009-2010, C2-14-1, Lecture 4b ENSL Winter School, january 2010 Input : C : a 2D PLC in R 3 (piecewise linear complex) Ω : a bounded

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

10.1 Overview. Section 10.1: Overview. Section 10.2: Procedure for Generating Prisms. Section 10.3: Prism Meshing Options

10.1 Overview. Section 10.1: Overview. Section 10.2: Procedure for Generating Prisms. Section 10.3: Prism Meshing Options Chapter 10. Generating Prisms This chapter describes the automatic and manual procedure for creating prisms in TGrid. It also discusses the solution to some common problems that you may face while creating

More information

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

32 Hyper-Threading on SMP Systems

32 Hyper-Threading on SMP Systems 32 Hyper-Threading on SMP Systems If you have not read the book (Performance Assurance for IT Systems) check the introduction to More Tasters on the web site http://www.b.king.dsl.pipex.com/ to understand

More information

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals

More information

a process may be swapped in and out of main memory such that it occupies different regions

a process may be swapped in and out of main memory such that it occupies different regions Virtual Memory Characteristics of Paging and Segmentation A process may be broken up into pieces (pages or segments) that do not need to be located contiguously in main memory Memory references are dynamically

More information