Scalable and transparent parallelization of multiplayer games. Bogdan Simion

Size: px
Start display at page:

Download "Scalable and transparent parallelization of multiplayer games. Bogdan Simion"

Transcription

1 Scalable and transparent parallelization of multiplayer games by Bogdan Simion A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto Copyright c 2009 by Bogdan Simion

2 Abstract Scalable and transparent parallelization of multiplayer games Bogdan Simion Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2009 In this thesis, we study parallelization of multiplayer games using software Transactional Memory (STM) support. We show that STM provides not only ease of programming, but also better scalability than achievable with state-of-the-art lock-based programming for this realistic high impact application. We evaluate and compare two parallel implementations of a simplified version (named SynQuake) of the popular game Quake. While in STM SynQuake support for maintaining consistency of each potentially complex game action is automatic, conservative locking of surrounding objects within a bounding-box for the duration of the game action is inherently needed in lock-based SynQuake. This leads to higher scalability of STM SynQuake versus lock-based SynQuake due to increased false sharing in the latter. Task assignment to threads has a second-order effect on scalability of STM-SynQuake, impacting the application s true sharing patterns. We show that a locality-aware task assignment provides the best trade-off between load balancing and conflict reduction. ii

3 Acknowledgements First and foremost, I would like to offer my gratitude to my supervisor, Cristiana Amza, who has supported and guided me closely throughout my thesis research. I appreciate her patience, her knowledgeable and useful advice and realize that without her assistance this thesis would not have been possible. I would also like to thank the members of my examination committee, Professor Angela Demke Brown and Professor Ashvin Goel, for all their help and assistance. During the course of my research program and throughout my graduate experience, I have been lucky to have worked with great colleagues whom I would like to address a special thank you, especially to Daniel Lupei for all the interesting discussions and late-night work we have done together. I would also like to thank my family for all their help, understanding and support, especially since the geographical distance separating us has made our reunions very rare in the past two years. Finally, I would like to thank the Department of Electrical and Computer Engineering at University of Toronto, for their financial support and facilities made available for my research. iii

4 Contents 1 Introduction 1 2 Application and Programming Environment Application Environment: SynQuake SynQuake game Game architecture and data structures Game server structure Programming environment libtm library interface Access tracking and conflict detection in libtm Parallelization of SynQuake Synchronization issues for player actions: false sharing Synchronization algorithms for request processing True sharing patterns Load balancing in SynQuake Load balancing algorithms Experimental Results Experimental setup STM vs Lock-based SynQuake - scaling and processing time iv

5 4.3 STM vs Lock-based SynQuake processing time with physics computation The effect of load balancing on scaling Load balancing policies Influence of locality-awareness Quest scenarios Discussion Load balancing vs. synchronization Related Work 48 6 Conclusion 52 Bibliography 54 v

6 List of Figures 1.1 Processing player move: Radius-based locking around player position Screen shot of SynQuake. Brick-texture blocks represent walls / obstacles, while resources(food) are represented by apples Areanode tree structure. Each node maintains a list of game entities fully contained in its corresponding game region. A game object may be located in a leaf if it is fully contained geometrically in the corresponding game map region, or located in the common ancestor of the leaves overlapped SynQuake: server frame structure and component stages. The server contains a request processing stage, an administrative stage and a reply stage. Server stages are separated by synchronization barriers represented by dotted lines Areas of interest for a move followed by an attack. The figure depicts the areas of interest acquired with STM and locks, as well as the move and shoot ranges. Locks are more conservative, acquiring a larger area of interest, while STM gradually acquires ownership of the area that may be affected by the action Pseudo-code for processing actions in SynQuake. a) shows the locks synchronization mechanism; b) shows the way TM acquires ownership of game objects affected by the action vi

7 3.3 Load Balancing Policies. The first policy assigns players to threads in a simple round-robin fashion. In the spread algorithm, the players located within a grid unit are assigned to the same thread. The locality-aware policy assigns a set of regions to the same thread, according to a given algorithm Quest scenarios offering different levels of contention Comparison between STM-based vs Lock-based versions of SynQuake in different quest scenarios; a) shows the scalability of STM and Locks when increasing the number of threads from 1 to 4; b) shows the STM overheads compared to the Locks version, in terms of absolute processing time. Although STM scales better, the overheads of access tracking result in worse absolute times compared to the locks Comparison between STM vs Locks when physics computation is modeled. The physics computation introduced hides the high STM overheads and thus improves performance for STM which manages to overcome Locks in terms of absolute processing time when running with 2 and 4 threads The effect of load balancing on scaling in STM vs Locks, for short and long range actions. A good load balancing scheme, such as spread and locality-aware influences STM by reducing true sharing between concurrent threads. For locks, due to a higher degree of false sharing than in the case of STM, the load balancing has almost no impact on performance.. 35 vii

8 4.5 Load balancing policies. This picture depicts the performance of our load balancing algorithms in terms of total processing time, under several quest scenarios. In the first two scenarios, we show the benefit of using a localityaware policy, both static and dynamic performing significantly better. In the third scenario, we show the shortcomings of a static policy when quest locations are not known beforehand, as opposed to a dynamic policy which detects and adapts to transient quest locations. In the last scenario, we present the performance of our algorithms when quests migrate randomly during gameplay, showing that the dynamic renders the best overall performance Scenario with 4 quests located in the center of each map quadrant. This figure depicts the player flocking around 4 quest locations and the thread assignment when using the static and the dynamic locality-aware load balancing algoritms. Due to detecting player agglomerations dynamically, the assigment of the dynamic policy is identical to the manual ideal assignment of the static algorithm Scenario with 4 quests located on the first major splits in the areatree. This figure depicts the player flocking around 4 quest locations and the thread assignment when using the static and the dynamic locality-aware load balancing algoritms. By detecting player agglomerations dynamically, the assigment of the dynamic policy results in a better assignment than the static one viii

9 4.8 Scenario with 4 quests located in the south-eastern map quadrant. This figure depicts the player flocking around 4 quest locations located in a single map quadrant. This presents a worst-case scenario for the static assignment. Basically, when quest locations are not known from the beginning, the static policy results in a poor workload distribution, in this case degenerating in all the players getting assigned to the same thread. By dynamically detecting player agglomerations, the dynamic locality-aware policy does not experience this problem Load balancing vs. synchronization. This figure shows the tradeoff between gaining more parallelism and introducing more synchronization. In the scenario with a quest centrally placed on the map, one player crowding is detected. If assigning this component on multiple threads, the extra parallelism gained overcomes the extra synchronization required ix

10 List of Tables 4.1 This table presents the player distribution for the load balancing policies tested in Scenario1. The players are guided equally towards four quests located in the center of the four map quadrants. The per-processor player quota is given as a percentage from the total number of players This table presents the player distribution for the load balancing policies tested in Scenario2. The players are guided equally towards four quests located in pairs on the first two major bisections in the areanode tree. The per-processor player quota is given as a percentage from the total number of players This table presents the player distribution for the load balancing policies tested in Scenario3. The players are guided towards four quests located in the south-west quadrant of the game map. The per-processor player quota is given as a percentage from the total number of players x

11 Chapter 1 Introduction In the past years, interactive multiplayer online games have emerged as a very important domain of applications, influencing developments in computational technology from better hardware for graphics processing to better software programs and optimized algorithms. The popularity of these online games has been directly linked to the ability to offer support for a large number of players. Handling a high volume of real-time interactions between players can prove to be a very difficult task. All player communication is generally conducted using messages containing requests or updates sent or received from a central server. This exposes gaming servers as the main bottleneck. Parallel computation techniques have been accepted as a way to alleviate the pressure on the server. Processing server work in parallel can significantly reduce the response time of game servers and thus increase the maximum number of players that can be handled, as well as ensuring better gaming experience on the client-side. Efficient restructuring of game code and providing guarantees for data consistency (e.g., synchronizing accesses to shared data), are hard to achieve for game code, due to dynamic artifacts and code complexity. In the recent past, efforts directed towards parallelization of multiplayer online games have been scarce. An attempt of parallelizing the popular first-person-shooter game 1

12 Chapter 1. Introduction 2 Quake [1] has concluded that providing synchronization guarantees and at the same time achieving good scalability is a highly complex task, especially because of the conservative locking inherently required by the game processing. In this dissertation, we study the parallelization of multiplayer online games, using two different approaches: a lock-based and a Software Transactional Memory (STM)-based synchronization. Transactional Memory (TM) is an emerging alternative paradigm for parallel programming of generic applications, which promises to facilitate more efficient, programmerfriendly use of the plentiful parallelism available in chip multiprocessors, and on cluster farms. The main idea is to simplify application programming in parallel and distributed environments through the use of transactions. TM allows transactions on different processors to manipulate shared in-memory data structures concurrently in an atomic and serializable i.e., correct manner. Many commercial and research prototypes for supporting TM in software have been introduced recently, including Intel s freely available STM compiler [8], RSTM [10], TL2 [6], etc. However, STM uptake from the wider programming community has been slow for two main reasons: First, demonstrable efforts from the research or commercial communities towards parallelizing realistic applications using TM have been scarce. This is at odds with the claim that TM makes parallelization easy. Second, the existing STMbased efforts have shown that the software overhead of run-time transactional support do not currently allow efficient parallelization for any known application; STM-based parallelizations typically perform substantially worse than simple single-mutex based implementations. Our case study is the first realistic high impact application, where TM support provides both ease of programming and better performance than that achievable with stateof-the-art lock-based programming. Specifically, we study parallelizing multiplayer online game server code on a game benchmark modeled after Quake. Towards this, we leverage

13 Chapter 1. Introduction 3 an existing software TM library, libtm [9], that can be used in conjunction with generic C, or C++ programs. Parallelization of multi-player game code for the purposes of scaling the game server is inherently difficult. Game code is typically complex, and can include use of spatial data structures for collision detection, as well as other dynamic artifacts that require conservative synchronization. The nature of the code may thus induce substantial contention due to false sharing, as well as true sharing between threads in a parallel lock-based game implementation. For example, substantial false sharing in both space and time can occur in a parallel implementation of the popular first person shooter game Quake [1] as follows. Each Quake player action usually includes dynamically evolving sub-actions; a person may move while shifting items in their backpack, throwing an object at a distance, grabbing a nearby object, and/or shooting, which together constitute a single player action. Since the terrain within the potentially affected area may contain mutable objects, all subactions need to be processed together as an atomic, consistent unit for the purposes of collision detection with other player actions. In Quake, each player action is thus processed based on a bounding box estimate, e.g., given as a sphere, around the initial player position, for the possible range of their intended action, as shown in Figure 1.1. In a parallel lock-based server code implementation [1], this translates into eagerly acquiring ownership of all potentially affected objects of the game map within this bounding box, before processing the action. This conservative locking induces unnecessary conflicts, by locking more objects than necessary, and holding these locks for longer periods than needed. In contrast, with Transactional Memory support, a player action can be split into time-steps, or its constituent sub-actions, with collision detection performed dynamically, hence more accurately, as the avatar encounters various obstacles, which potentially change its direction of movement, as shown by the intermediate steps in Figure 1.1.

14 Chapter 1. Introduction 4 xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xx xxxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxxx xxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxxxx x Estimated lock radius xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx Figure 1.1: Processing player move: Radius-based locking around player position The atomicity and consistency of the whole player action is automatically provided by the underlying transactional support. STM support thus results in reduced false sharing overall in both space and time. Aside from false sharing, game code scalability is also orthogonally affected by the true sharing induced by player actions on objects located at the boundary between partitions of the game world assigned to different threads. Consequently, task to thread assignment for load balancing purposes needs to be done in a locality-aware fashion in the game world, thus reducing the number of boundaries, hence the potential conflicts on boundary objects. In order to facilitate experimentation, and to study a range of game genres and in-game scenarios, we have developed an in-house game benchmark, called SynQuake. SynQuake is primarily based on the popular open-source Quake game, but it is a full-

15 Chapter 1. Introduction 5 fledged multiplayer game in its own right. SynQuake extracts representative features of many first person shooter games, or those of strategy games involving a mix of shortrange and long-range interactions. SynQuake can be driven by either human players, or robot players, and can emulate synthetically generated quest scenarios, i.e., hot-spots in the game world to facilitate experimental evaluation. We use the Quake 3 area node tree as a standard game code spatial data structure facilitating storing and retrieval of the location and attributes of game objects on the game map. While we paid close attention to accurately modeling realistic game server data structures and representative access patterns to these data structures, graphics is simple and 2D, and game world physics computation is limited to 2D collision detection and path finding around obstacles. We argue that, the complex physics computations present in many games would involve only non-transactional data accesses, hence would hide the overheads of STM. Thus, with SynQuake, we showcase the worst case scenario for the performance of STM-based versus lock-based parallelization of any multiplayer game. Our detailed comparative performance evaluation shows that the main factor affecting overall scaling is the effect of the false sharing inherent in the parallelization scheme. Since STM reduces the degree of false sharing in both space and time, we experimentally show that the scaling of STM-based SynQuake from one to four server threads is better than the scaling of lock-based SynQuake in all game scenarios we study. We further show that the better scaling of STM results in better overall performance of STM-based SynQuake compared to lock-based SynQuake at four server threads in all scenarios involving a minimum of physics computation. We explored several load balancing techniques which range from a policy that fully focuses on equally distributing tasks among threads, while sacrificing locality, to an approach that minimizes true sharing but offers no guarantee with respect to load distribution. We observed that having a locality-aware thread assignment policy is critical for minimizing true sharing.

16 Chapter 1. Introduction 6 Our performance evaluation also shows that the load balancing scheme makes for significant improvements in scalability only for STM-SynQuake. For a lock-based parallelization, due to the dominant impact of false sharing, decreasing the degree of true sharing through an appropriate load balancing scheme makes little or no difference. Finally, parallelization with STM automatically hides the details of game consistency maintenance across the boundaries of an underlying partitioned world, thus offering the players the view of a large seamless world which could be hosted on parallel, distributed, or hybrid platforms, completely transparently. In summary, our main contributions consist in analyzing the factors that affect scaling and overheads of the STM and lock-based implementations of SynQuake. Specifically, we study the following factors: i) the effect of false sharing patterns, ii) the influence of load balancing on reducing true sharing patterns, iii) the performance gain of a locality-aware adaptive load balancing policy, in terms of maximizing parallelism and minimizing synchronization, and we show that the STM-based implementation outperforms the traditional lockbased parallelization for all realistic game scenarios analyzed. Analyzing the aspects enumerated above contributes largely to providing support for scalable and transparent parallelization of multiplayer game servers. In chapter 2 we outline the design of our game benchmark, SynQuake, as well as the architectural features and relevant data structures. We then examine the synchronization and load balancing related challenges of parallelizing SynQuake in chapter 3. Chapter 4 presents the experimental results comparing the performance of the lockbased and STM-based versions of SynQuake, followed by a discussion of related work in chapter 5. Chapter 6 concludes the dissertation.

17 Chapter 2 Application and Programming Environment 2.1 Application Environment: SynQuake This chapter presents the design and implementation details of our SynQuake game. First, we describe the features of the SynQuake game, specifically the game world, the game entities (like players, food items and walls), the game actions and additional features like quests. Our game simulates real-game features and player interactions. For testing purposes we have additionally created the option of robot players which are performing actions according to a simple AI algorithm. Second, we describe the game architecture and data structures. As in the famous Quake game, we use an areanode tree data structure for representing the game map and entities. This structure uses a binary space partitioning technique which facilitates quick-retrieval of game objects, based on their locations in the game world. Further, we describe the SynQuake game server. Adopting the same model present in Quake, we have implemented the SynQuake server as a succession of server frames, each frame decomposed in a set of stages separated by synchronization barriers. Each stage 7

18 Chapter 2. Application and Programming Environment 8 has a specific role (e.g., receiving and processing client requests, forming and sending replies back etc). The stages which constitute a significant part of the processing time require parallelization. We shall discuss this in detail in the next chapter SynQuake game Multiplayer Online Games (MOGs) are not only difficult to build, but also very expensive to maintain and administer. Except in isolated cases, such as the well-known first person shooter game, Quake, popular MOGs do not share their server code. Furthermore, only a few game scenarios are available with Quake 3, and the game map used in a previous study on Quake parallelization [1] is too small, resulting in excessive player crowding when scaling the player population to drive a larger number of server threads. To solve these problems, we have built a simplified version of Quake, called SynQuake. SynQuake models three types of game entities: players, resources (represented by apples) and walls. A typical game map for SynQuake is presented in Figure 2.1. The game map is created offline and does not change during gameplay. Each game entity is defined by its position on the game map and by a set of attributes specific to its type. For example, besides its position on the game map, a player is described by its life or health level, its speed and direction. Food items (apples) are described by their units of nutrition. A player can take only a limited amount of food at a time within a short-range around the player; eating food increases the player s health level, which has the visual result of increasing the player s stature/size in the game, depending on the number of nutrition units in the food item (apple). Once the food in one location is gone, it can be regenerated, but it is likely to appear in a different location. The player with a larger stature/size i.e., who has eaten more food, wins a fight between two players. Players are mutable game entities that can have both their position and attributes modified as a result of game interactions. For example, an attack decreases a player s

19 Chapter 2. Application and Programming Environment 9 Figure 2.1: Screen shot of SynQuake. Brick-texture blocks represent walls / obstacles, while resources(food) are represented by apples. life, while consuming a resource increases it. Resources are partly mutable e.g., apples can have their attributes affected by game play, but not their position, while walls are immutable entities. As a result, each of these game objects requires different levels of synchronization in parallel SynQuake, from full synchronization protection for players to no protection for walls. As part of game play, players perform short range actions like moving and consuming resources, fighting with other players, or long range actions such as simultaneously moving and shooting. Each of these actions can cause conflicts between different threads processing player actions concurrently. For example, conflicts occur when two players try to move to the same spot, or one player gets attacked while consuming a resource. To simulate areas of high interest in the game, and the associated pattern of players

20 Chapter 2. Application and Programming Environment 10 flocking to a particular area of the map, we have added quests which attract players towards that area with a high probability. These correspond to standard areas attracting players existing in Quake 3, and also in strategy games, such as a camp site, weaponry location, and health areas. Complex game scenarios can be recreated in SynQuake by varying the distribution of quests in time and space. The SynQuake game server thus captures all the basic interactions in multi-player games, and accurately represents network and system features of a game server. Furthermore, we can emulate various game genres by varying the size and amount of objects, as well as the update frequency. In order to replicate massive multiplayer scenarios, players can be driven by a simple AI algorithm that has players moving with high probability towards a quest, if one is present, eating if hungry, fighting with other players, or fleeing, if being chased by a stronger opponent Game architecture and data structures Our server stores a representation of the game world and all the game entities. Since it is critical for the server to quickly generate a list of entities that a player interacts with in any given action, we need an efficient representation of the game map. Therefore, similar to Quake [1], in order to facilitate an efficient search for all entities involved in a particular action, we represent the virtual game map using a spatial data structure, called the areanode tree (see Figure 2.2b). The areanode tree is a binary space partitioning tree, where each node represents a specific region of the game map. The tree is constructed by recursively dividing the map into sub-units, starting with the root node, corresponding to the entire game world. Nodes on subsequent levels of the tree are created by splitting the region corresponding to the parent node along its median segment. The splits are performed alternately along the x and y axes, until a predefined tree depth (or the maximum split granularity possible) is reached. The leaves in the areanode tree form a grid of equal-sized regions located on

21 Chapter 2. Application and Programming Environment 11 A1 xx B1 A B A2 B2 A1 A2 B1 B2 a) Game map b) Areanode tree Figure 2.2: Areanode tree structure. Each node maintains a list of game entities fully contained in its corresponding game region. A game object may be located in a leaf if it is fully contained geometrically in the corresponding game map region, or located in the common ancestor of the leaves overlapped. the game map. For the rest of this dissertation, we will use the terms grid unit and tree leaf interchangeably. Each entity on the map is maintained inside the finest-grained tree-node whose corresponding region completely overlaps it. This translates in leaf nodes maintaining objects that are fully contained inside grid units, while placing the rest of the entities in the common ancestor of all the grid units they overlap. Each leaf in the areanode tree maintains a list of entities fully contained inside the game map area represented by the leaf. Game entities overlapping two or more regions are placed in a list maintained in a unique parent node of the leaves overlapped. This parent node can be one level up (if the entity overlaps its immediate descendants) or several levels higher (the closest common ancestor of the crossed leaves). For example, an entity crossing the first split segment of the game world will be placed in the root node. Figure 2.2 shows a map for which game objects are indexed using a two-level areanode tree. In figure 2.2a, we have the resulting grid units corresponding to each leaf, while fig-

22 Chapter 2. Application and Programming Environment 12 ure 2.2b presents the entire tree structure resulting from splitting the root node vertically and its children horizontally. The map is populated with three objects: one completely situated inside leaf B2 and correspondingly placed in B2 s list inside the tree; a second one found on the border between grid units A1 and A2 and as a result maintained in A s list; the last entity is located on the border between A1 and B1 and kept inside the root node, since its corresponding region is the smallest one that completely overlaps it. The areanodes are not in any way related to particular features of the game world. The splits are not performed along walls, nor are they done using specific knowledge of the game map (e.g., quest locations, resource locations). Since entities migrate during game play (e.g., players move, apples get respawned after being consumed), the areanode tree must be updated accordingly in order to reflect the new positions of each game entity. This accounts for the most significant source of contention among processing threads when parallelizing the game server Game server structure In this section, we describe the structure of the SynQuake game server code and its components. In the original parallel Quake server code, server processing consists of three stages: world physics update, request processing and reply processing. Following the same design, the SynQuake server loops through three stages which form a server frame or iteration. These stages are: request processing, administrative tasks and reply processing (see Figure 2.3). In the first stage, the server receives requests from clients (players), distributes them on the worker threads according to a load balancing policy and processes them in parallel. The novelty in our implementation is the admin stage. In the admin stage, we perform an operation of rebalancing the workload by reassigning players to threads where needed, considering that after the set of action requests from players, the game configuration

23 Chapter 2. Application and Programming Environment 13 Client requests 1 Receive & Process Requests 2 Admin (single thread) Server frame Client updates 3 Form & Send Replies Figure 2.3: SynQuake: server frame structure and component stages. The server contains a request processing stage, an administrative stage and a reply stage. Server stages are separated by synchronization barriers represented by dotted lines. changes. Since this stage is present between receiving requests and sending replies, we ensure that the time spent doing load balancing is quite short, so that there is no lag experienced on the client side. Finally, in the reply processing stage, the server threads send updates to their assigned clients. After all threads reach the final synchronization barrier, a new server frame starts. Similar to the existing parallel implementation of Quake [1], in SynQuake, parallelization is performed at the granularity of individual stages, while enforcing serial execution of consecutive stages through synchronization barriers (Figure 2.3). We have imposed the same invariants as in the parallel Quake: a) each server frame is distinct and does not overlap with other server frames and b) each server frame executes its component stages in the original order. This ensures consistency and avoids any possible complications due to phase reordering or overlapping. While these constraints may be relaxed, this problem

24 Chapter 2. Application and Programming Environment 14 is beyond the scope of this work. Synchronization is not necessary during the read-only reply stage, since its execution does not overlap any of the other stages. The administration phase does not require any synchronization either, since its tasks, e.g., load balancing, are being executed by a single thread. Consequently, the stage handling client requests is the only stage requiring protection against concurrent accesses to shared data. When leveraging parallelism in this stage, we must ensure data consistency by using synchronization methods. The standard synchronization method is to acquire locks for shared data until the current owner thread finishes the operation and exits the critical section. An alternative to lock-based synchronization is a transactional memory runtime system. We use a software transactional memory library called libtm, which makes possible the detection of data races and ensures correct execution of the parallel game server. In the next section, we present details regarding both programming environments. 2.2 Programming environment We use the standard pthreads interface for our lock-based parallelization, and an existing Software Transactional Memory library, libtm, for parallelization of SynQuake. In our parallel SynQuake, the number of threads is constant throughout all server execution phases and matches the number of processors available. Each thread is assigned to a separate processor, in order to avoid cache invalidations. We use the sched setaffinity call to bind the execution of a given thread to a particular processor using the appropriate bitmask. The worker threads are spawned at the server initialization stage and interthread communication is achieved using a global shared memory model. We have implemented two versions of the game, one using locks to synchronize shared memory accesses, and a second using an STM library for memory access tracking. The STM allows transactions on different processors to manipulate shared in-memory

25 Chapter 2. Application and Programming Environment 15 data structures concurrently in a data-race-free manner. Instead of explicit fine-grained locking of data items, the programmer specifies the beginning and end of parallel regions with transaction delimiters. Runtime support provided by the library automatically detects data races between concurrent transactions and ensures correct parallel execution. Any detected incorrect execution resulting from a data race causes one or more transactions to be rolled back and restarted. The run-time system automatically detects which memory regions are read and written by a transaction, and maintains the recoverability of data for the written ranges of memory libtm library interface In a TM program supported by the libtm library, transactions need to be delineated with begin transaction and commit transaction statements. Furthermore, shared data needs to be distinguished from private per-thread data accessed inside transactions. For this purpose, transactional shared and private variables should be declared using the meta-types tm shared and tm private, respectively. For example, a shared variable, int x in the original program, needs to be declared as tm shared<int>x. The definition of each of these meta-types in our library is a C++ template using the original type of the variable as a parameter, (e.g. tm shared<original type>). Uses of Type Declarations in libtm: Declarations for tm types are used in libtm for run-time access tracking through operator overloading. Specifically, any read or write accesses on shared and private transactional variables are tracked inside the implementation of the overloaded conversion or assignment operators. Furthermore, libtm maintains recovery data for both tm shared and tm private variables updated in transactions, while performing conflict detection and resolution only for tm shared variables.

26 Chapter 2. Application and Programming Environment Access tracking and conflict detection in libtm Conflicts are detected based on the meta-data information encapsulated in each tm shared object. While this layout allows for access tracking to take place at word-level granularity, libtm also provides the capability of varying the granularity of access tracking dynamically at runtime. This is achieved through a mechanism of redirection that can remap a tm shared variable to any meta-data object indicated by the programmer. By remapping several semantically-linked tm shared variables to the same meta-data object we can effectively increase the granularity of the access tracking, and hence reduce the overheads associated with bookkeeping inside the library. This approach offers several benefits when compared to a strict object-based granularity. First, remapping can take place at runtime and can factor in semantic aspects that may vary in time. Secondly, it can handle dynamic data structures as well as variables that aren t co-located in memory next to each other, whereas objects need to be statically determined at compile time. The conflict detection mechanism used by libtm represents a variation on the classic two-phase locking concurrency control algorithm, resulting in a blocking implementation of transactional memory semantics. libtm uses an optimistic approach for detecting conflicts between transactions by acquiring ownership of written data at commit-time, while allowing readers access to the old values until then. However, any read-write conflicts detected at commit time are resolved by aborting the conflicting reader transactions. In the next chapter, we discuss the parallelization-related challenges we encountered in our game SynQuake in these two programming environments.

27 Chapter 3 Parallelization of SynQuake We present two orthogonal parallelization techniques for SynQuake: a parallel version using lock-based synchronization and a parallel version leveraging a transactional memory runtime system. We use the software transactional memory libtm, presented in section 2.2, which enables the detection of conflicts on shared memory accesses and guarantees correct execution of the game server code. In the following, we describe the parallelization of the request stage in detail including the parallelization challenges and our strategy for both a lock based implementation and a transactional memory implementation. In terms of parallelization challenges, we discuss in detail the false sharing effects, which are significant in the lock-based version because of the conservative locking needed when performing collision detection, and reduced in the case of the STM approach. We compare the two parallel game code implementations, showing that STM facilitates the programmer s task by using simple constructs to mark critical sections, without the need to worry about problems such as deadlocks. We then present the importance of load balancing on reducing true sharing patterns and thus minimizing the synchronization occuring among processing threads. Furthermore, we stress the importance of locality-awareness in balancing the workload, and introduce a set of load balancing algorithms, including a new dynamic locality-aware 17

28 Chapter 3. Parallelization of SynQuake 18 policy. 3.1 Synchronization issues for player actions: false sharing For processing a player action, the server needs to perform collision detection against all game objects intersecting the player s trajectory. Since the avatar s direction can be altered by collision with entities situated in its path, the avatar s trajectory and its final position are impossible to predict from the beginning of the action. However, the whole action and its effects on the game world need to appear as a consistent atomic unit to the players. Therefore, processing a client request in Quake consists of: i) computing the potential area of the game map impacted by the action, which we will call area of interest of the action and then ii) performing the game action, by determining its effects upon game entities. This pre-existing request processing scheme leads to a potentially significant reduction in the degree of false sharing for a TM-based versus a lock-based game parallelization scheme in terms of both i) the number of objects involved in collision detection and ii) the duration of the potential conflicts for these objects as we explain in the following. In a lock-based implementation, the entire area of interest of the action needs to be conservatively locked for the duration of all processing related to the action. For example, let s consider the scenario in Figure 3.1, where the player executes a shoot-after-move compound action. Since the player may stop short of its intended destination, or change direction during the move e.g., due to encountering an obstacle, the long-range (shooting) interaction may occur at any point in time during the move itself. Hence, the lock-based implementation needs to compute an area of interest, and conservatively lock all objects corresponding to the long range interaction at all possible points of the player s trajectory.

29 xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx Chapter 3. Parallelization of SynQuake 19 xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx Move range Area of interest with TM Shoot range Area of interest with locks Figure 3.1: Areas of interest for a move followed by an attack. The figure depicts the areas of interest acquired with STM and locks, as well as the move and shoot ranges. Locks are more conservative, acquiring a larger area of interest, while STM gradually acquires ownership of the area that may be affected by the action. This could incur substantial false sharing between threads in both space and time. In contrast, a TM-based implementation implicitly acquires access to objects gradually, as the server progresses through the execution of the action. The move can thus be decomposed into sub-actions, and collision detection can be performed for each subaction as it dynamically happens, resulting in a substantially smaller bounding box for the overall action, as shown in Figure 3.1, as well as shorter total time of protected access to the objects involved.

30 Chapter 3. Parallelization of SynQuake Synchronization algorithms for request processing In the following, we describe the synchronization protocol used in the processing stage of the server. We present our two designs for maintaining consistency of the game map: lock-based and STM-based. The algorithms in pseudocode for both versions are included in Figure 3.2. In the lock-based version, Figure 3.2b, we first compute the area of interest corresponding to the currently executing action. Then ownership of this entire area of interest is acquired by locking all the areanode tree leaves overlapping it, in a predefined order, such that deadlocks are avoided. For simplicity, this ordering is provided by the depth-first traversal of the tree. Next, we process all entities overlapping the area of interest currently maintained at the leaf-level of the areanode tree. For entities maintained at higher levels within the tree, i.e., entities intersecting multiple grid units, we need to additionally lock their corresponding (higher-level) nodes. Locking of these higher-level nodes is necessary in order to maintain the integrity of their entity lists; these entity lists may be modified concurrently by other player actions accessing objects co-located in the parent node, but outside our area of interest. Since parent nodes are sensitive to contention, the locks need to be released as soon as the processing of their entities is complete. Finally, we relinquish ownership of the area of interest by releasing the locked leaves. Figure 3.2b presents the transactional version of the synchronization protocol for action processing. As before, we first compute the area of interest associated with the current action. We then process all entities located within both leaves and parent nodes overlapping the area of interest, with consistency being seamlessly ensured by the underlying TM library. For the transactional approach, the programmer only has to mark the critical section

31 Chapter 3. Parallelization of SynQuake 21 /*Algorithm: Action processing in SynQuake using locks*/ /*Algorithm: Action processing in SynQuake using TM*/ function processaction( actionplan, player ) function processaction( actionplan, player ) { { //Compute expanded area of interest for the entire action plan expanded_range = computeexpandedareaofinterest( actionplan, player ); BEGIN_TRANSACTION(); /*Lock expanded area of interest*/ //Get areanode leaves overlapping the expanded area of interest expandedleavesset = getoverlappingleaves( area_tree, expanded_range ); foreach leaf in expandedleavesset Lock( leaf ); foreach sub-action in actionplan { //Get areanode leaves overlapping sub-action s area of interest range = computeareaofinterest( sub-action, player ); leavesset = getoverlappingleaves( area_tree, range ); /*Processing sub-action in leaf nodes*/ foreach leaf in leavesset foreach entity in leaf.entityset perform( sub-action, entity ); foreach sub-action in actionplan { //Compute area of interest for the current sub-action range = computeareaofinterest( sub-action, player ); /*Processing sub-action in leaf nodes*/ //Get areanode leaves overlapping the area of interest leavesset = getoverlappingleaves( area_tree, range ); foreach leaf in leavesset foreach entity in leaf.entityset perform( sub-action, entity ); } /*Processing sub-action in parent nodes*/ //Get areanode parents overlapping the area of interest parentsset = getoverlappingparents( area_tree, range ); foreach parent in parentsset { //Temporarily lock parent Lock( parent ); foreach entity in parent.entityset perform( sub-action, entity ); Unlock( parent ); } } /*Unlock expanded area of interest*/ foreach leaf in expandedleavesset Unlock( leaf ); } } /*Processing sub-action in parent nodes*/ //Get areanode parents overlapping the area of interest parentsset = getoverlappingparents( area_tree, range ); foreach parent in parentsset foreach entity in parent.entityset perform( sub-action, entity ); END_TRANSACTION(); (a) Lock-based version (b) STM-based version Figure 3.2: Pseudo-code for processing actions in SynQuake. a) shows the locks synchronization mechanism; b) shows the way TM acquires ownership of game objects affected by the action. with a BEGIN TRANSACTION() - COMMIT TRANSACTION() construct, without having to worry about any integrity issues related to game play or the data structures used. As a result, the STM-based approach offers a simpler programming interface, by transparently tracking memory conflicts on shared data and ensuring correct execution by aborting and rolling back the conflicting transactions. In summary, in this chapter we presented the synchronization issues and the advantage of the STM-based implementation in terms of reduced false-sharing patterns which result from on-the-fly acquiring of action bounding box ownership. Additionally, the STM library facilitates programming by using a runtime system which seamlessly guarantees correctness and consistency.

32 Chapter 3. Parallelization of SynQuake True sharing patterns While concurrency in multiplayer game servers can be severely limited as a result of false sharing, true sharing patterns can also degrade application performance. Since two different threads may handle players performing actions affecting the same game entities, or interacting directly with one another, true sharing may occur on entities located at the boundary of thread assignments. Therefore, to reduce synchronization costs resulting from true sharing, the load balancing policy should take into consideration the spatial locality of players in the game. True sharing can be thus mitigated by using locality-aware thread-assignment policies, trying to allocate the processing of entities close to each other to the same thread. However, using locality as a sole criteria in assigning tasks can cause overloading in threads processing highly populated areas. Thread overload results in increased response times and degraded user experience. Moreover, load imbalance results in idle time at barriers for underloaded threads. Consequently, the load balancing policy should achieve a good compromise between balanced load and reduced synchronization. 3.4 Load balancing in SynQuake We analyzed the tradeoff between balancing load and reducing synchronization among server threads across three policies. When optimizing for an equal distribution of the workload, a naive load balancing policy assigns players in a round-robin fashion to processing threads. However, since it offers very poor spatial locality, players situated next to one another could be handled by different threads (see Figure 3.3a), causing high levels of true sharing. In order to leverage the spatial locality of players, load balancing needs to be performed at a higher granularity than player level. This is the case of the spread policy which achieves uniform load across threads by mapping entire grid units to specific

Software transactional memory

Software transactional memory Transactional locking II (Dice et. al, DISC'06) Time-based STM (Felber et. al, TPDS'08) Mentor: Johannes Schneider March 16 th, 2011 Motivation Multiprocessor systems Speed up time-sharing applications

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat Relaxing Concurrency Control in Transactional Memory by Utku Aydonat A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of The Edward S. Rogers

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Serializability of Transactions in Software Transactional Memory

Serializability of Transactions in Software Transactional Memory Serializability of Transactions in Software Transactional Memory Utku Aydonat Tarek S. Abdelrahman Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto {uaydonat,tsa}@eecg.toronto.edu

More information

1 Publishable Summary

1 Publishable Summary 1 Publishable Summary 1.1 VELOX Motivation and Goals The current trend in designing processors with multiple cores, where cores operate in parallel and each of them supports multiple threads, makes the

More information

EXTENDING THE PRIORITY CEILING PROTOCOL USING READ/WRITE AFFECTED SETS MICHAEL A. SQUADRITO A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE

EXTENDING THE PRIORITY CEILING PROTOCOL USING READ/WRITE AFFECTED SETS MICHAEL A. SQUADRITO A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE EXTENDING THE PRIORITY CEILING PROTOCOL USING READ/WRITE AFFECTED SETS BY MICHAEL A. SQUADRITO A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Concurrency Control Service 7

Concurrency Control Service 7 Concurrency Control Service 7 7.1 Service Description The purpose of the Concurrency Control Service is to mediate concurrent access to an object such that the consistency of the object is not compromised

More information

G Programming Languages Spring 2010 Lecture 13. Robert Grimm, New York University

G Programming Languages Spring 2010 Lecture 13. Robert Grimm, New York University G22.2110-001 Programming Languages Spring 2010 Lecture 13 Robert Grimm, New York University 1 Review Last week Exceptions 2 Outline Concurrency Discussion of Final Sources for today s lecture: PLP, 12

More information

Improving the Practicality of Transactional Memory

Improving the Practicality of Transactional Memory Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter

More information

Interactive Responsiveness and Concurrent Workflow

Interactive Responsiveness and Concurrent Workflow Middleware-Enhanced Concurrency of Transactions Interactive Responsiveness and Concurrent Workflow Transactional Cascade Technology Paper Ivan Klianev, Managing Director & CTO Published in November 2005

More information

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology Mobile and Heterogeneous databases Distributed Database System Transaction Management A.R. Hurson Computer Science Missouri Science & Technology 1 Distributed Database System Note, this unit will be covered

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Cache Coherence in Distributed and Replicated Transactional Memory Systems. Technical Report RT/4/2009

Cache Coherence in Distributed and Replicated Transactional Memory Systems. Technical Report RT/4/2009 Technical Report RT/4/2009 Cache Coherence in Distributed and Replicated Transactional Memory Systems Maria Couceiro INESC-ID/IST maria.couceiro@ist.utl.pt Luis Rodrigues INESC-ID/IST ler@ist.utl.pt Jan

More information

Application Programming

Application Programming Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris

More information

Chí Cao Minh 28 May 2008

Chí Cao Minh 28 May 2008 Chí Cao Minh 28 May 2008 Uniprocessor systems hitting limits Design complexity overwhelming Power consumption increasing dramatically Instruction-level parallelism exhausted Solution is multiprocessor

More information

Transaction Models for Massively Multiplayer Online Games

Transaction Models for Massively Multiplayer Online Games 2011 30th IEEE International Symposium on Reliable Distributed Systems Transaction Models for Massively Multiplayer Online Games Kaiwen Zhang Department of Computer Science University of Toronto, Toronto,

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014 Outline Database Tuning Nikolaus Augsten University of Salzburg Department of Computer Science Database Group 1 Unit 8 WS 2013/2014 Adapted from Database Tuning by Dennis Shasha and Philippe Bonnet. Nikolaus

More information

Summary: Issues / Open Questions:

Summary: Issues / Open Questions: Summary: The paper introduces Transitional Locking II (TL2), a Software Transactional Memory (STM) algorithm, which tries to overcomes most of the safety and performance issues of former STM implementations.

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Transaction Processing in Mobile Database Systems

Transaction Processing in Mobile Database Systems Ashish Jain* 1 http://dx.doi.org/10.18090/samriddhi.v7i2.8631 ABSTRACT In a mobile computing environment, a potentially large number of mobile and fixed users may simultaneously access shared data; therefore,

More information

Database Management and Tuning

Database Management and Tuning Database Management and Tuning Concurrency Tuning Johann Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Unit 8 May 10, 2012 Acknowledgements: The slides are provided by Nikolaus

More information

Lock Tuning. Concurrency Control Goals. Trade-off between correctness and performance. Correctness goals. Performance goals.

Lock Tuning. Concurrency Control Goals. Trade-off between correctness and performance. Correctness goals. Performance goals. Lock Tuning Concurrency Control Goals Performance goals Reduce blocking One transaction waits for another to release its locks Avoid deadlocks Transactions are waiting for each other to release their locks

More information

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime B. Saha, A-R. Adl- Tabatabai, R. Hudson, C.C. Minh, B. Hertzberg PPoPP 2006 Introductory TM Sales Pitch Two legs

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

NePaLTM: Design and Implementation of Nested Parallelism for Transactional Memory Systems

NePaLTM: Design and Implementation of Nested Parallelism for Transactional Memory Systems NePaLTM: Design and Implementation of Nested Parallelism for Transactional Memory Systems Haris Volos 1, Adam Welc 2, Ali-Reza Adl-Tabatabai 2, Tatiana Shpeisman 2, Xinmin Tian 2, and Ravi Narayanaswamy

More information

Monitors; Software Transactional Memory

Monitors; Software Transactional Memory Monitors; Software Transactional Memory Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 18, 2012 CPD (DEI / IST) Parallel and

More information

A Practical Scalable Distributed B-Tree

A Practical Scalable Distributed B-Tree A Practical Scalable Distributed B-Tree CS 848 Paper Presentation Marcos K. Aguilera, Wojciech Golab, Mehul A. Shah PVLDB 08 March 8, 2010 Presenter: Evguenia (Elmi) Eflov Presentation Outline 1 Background

More information

Real-time grid computing for financial applications

Real-time grid computing for financial applications CNR-INFM Democritos and EGRID project E-mail: cozzini@democritos.it Riccardo di Meo, Ezio Corso EGRID project ICTP E-mail: {dimeo,ecorso}@egrid.it We describe the porting of a test case financial application

More information

An Automated Framework for Decomposing Memory Transactions to Exploit Partial Rollback

An Automated Framework for Decomposing Memory Transactions to Exploit Partial Rollback An Automated Framework for Decomposing Memory Transactions to Exploit Partial Rollback Aditya Dhoke Virginia Tech adityad@vt.edu Roberto Palmieri Virginia Tech robertop@vt.edu Binoy Ravindran Virginia

More information

The goal of the Pangaea project, as we stated it in the introduction, was to show that

The goal of the Pangaea project, as we stated it in the introduction, was to show that Chapter 5 Conclusions This chapter serves two purposes. We will summarize and critically evaluate the achievements of the Pangaea project in section 5.1. Based on this, we will then open up our perspective

More information

Concurrency Control CHAPTER 17 SINA MERAJI

Concurrency Control CHAPTER 17 SINA MERAJI Concurrency Control CHAPTER 17 SINA MERAJI Announcement Sign up for final project presentations here: https://docs.google.com/spreadsheets/d/1gspkvcdn4an3j3jgtvduaqm _x4yzsh_jxhegk38-n3k/edit#gid=0 Deadline

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Data-Race Detection in Transactions- Everywhere Parallel Programming

Data-Race Detection in Transactions- Everywhere Parallel Programming Data-Race Detection in Transactions- Everywhere Parallel Programming by Kai Huang B.S. Computer Science and Engineering, B.S. Mathematics Massachusetts Institute of Technology, June 2002 Submitted to the

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Graph-based protocols are an alternative to two-phase locking Impose a partial ordering on the set D = {d 1, d 2,..., d h } of all data items.

Graph-based protocols are an alternative to two-phase locking Impose a partial ordering on the set D = {d 1, d 2,..., d h } of all data items. Graph-based protocols are an alternative to two-phase locking Impose a partial ordering on the set D = {d 1, d 2,..., d h } of all data items. If d i d j then any transaction accessing both d i and d j

More information

Concurrency Control. Transaction Management. Lost Update Problem. Need for Concurrency Control. Concurrency control

Concurrency Control. Transaction Management. Lost Update Problem. Need for Concurrency Control. Concurrency control Concurrency Control Process of managing simultaneous operations on the database without having them interfere with one another. Transaction Management Concurrency control Connolly & Begg. Chapter 19. Third

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

IN5050: Programming heterogeneous multi-core processors Thinking Parallel IN5050: Programming heterogeneous multi-core processors Thinking Parallel 28/8-2018 Designing and Building Parallel Programs Ian Foster s framework proposal develop intuition as to what constitutes a good

More information

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads.

Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Executive Summary. It is important for a Java Programmer to understand the power and limitations of concurrent programming in Java using threads. Poor co-ordination that exists in threads on JVM is bottleneck

More information

Conflict Detection and Validation Strategies for Software Transactional Memory

Conflict Detection and Validation Strategies for Software Transactional Memory Conflict Detection and Validation Strategies for Software Transactional Memory Michael F. Spear, Virendra J. Marathe, William N. Scherer III, and Michael L. Scott University of Rochester www.cs.rochester.edu/research/synchronization/

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log Philip A. Bernstein Microsoft Research Redmond, WA, USA phil.bernstein@microsoft.com Sudipto Das Microsoft Research

More information

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

ASSIGNMENT- I Topic: Functional Modeling, System Design, Object Design. Submitted by, Roll Numbers:-49-70

ASSIGNMENT- I Topic: Functional Modeling, System Design, Object Design. Submitted by, Roll Numbers:-49-70 ASSIGNMENT- I Topic: Functional Modeling, System Design, Object Design Submitted by, Roll Numbers:-49-70 Functional Models The functional model specifies the results of a computation without specifying

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Distributed Systems (ICE 601) Transactions & Concurrency Control - Part1

Distributed Systems (ICE 601) Transactions & Concurrency Control - Part1 Distributed Systems (ICE 601) Transactions & Concurrency Control - Part1 Dongman Lee ICU Class Overview Transactions Why Concurrency Control Concurrency Control Protocols pessimistic optimistic time-based

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Profile of CopperEye Indexing Technology. A CopperEye Technical White Paper

Profile of CopperEye Indexing Technology. A CopperEye Technical White Paper Profile of CopperEye Indexing Technology A CopperEye Technical White Paper September 2004 Introduction CopperEye s has developed a new general-purpose data indexing technology that out-performs conventional

More information

Transactions. Kathleen Durant PhD Northeastern University CS3200 Lesson 9

Transactions. Kathleen Durant PhD Northeastern University CS3200 Lesson 9 Transactions Kathleen Durant PhD Northeastern University CS3200 Lesson 9 1 Outline for the day The definition of a transaction Benefits provided What they look like in SQL Scheduling Transactions Serializability

More information

Point Cloud Filtering using Ray Casting by Eric Jensen 2012 The Basic Methodology

Point Cloud Filtering using Ray Casting by Eric Jensen 2012 The Basic Methodology Point Cloud Filtering using Ray Casting by Eric Jensen 01 The Basic Methodology Ray tracing in standard graphics study is a method of following the path of a photon from the light source to the camera,

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Research Collection. Cluster-Computing and Parallelization for the Multi-Dimensional PH-Index. Master Thesis. ETH Library

Research Collection. Cluster-Computing and Parallelization for the Multi-Dimensional PH-Index. Master Thesis. ETH Library Research Collection Master Thesis Cluster-Computing and Parallelization for the Multi-Dimensional PH-Index Author(s): Vancea, Bogdan Aure Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010437712

More information

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, lazy implementation 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Shared Virtual Memory. Programming Models

Shared Virtual Memory. Programming Models Shared Virtual Memory Arvind Krishnamurthy Fall 2004 Programming Models Shared memory model Collection of threads Sharing the same address space Reads/writes on shared address space visible to all other

More information

Tradeoff Evaluation. Comparison between CB-R and CB-A as they only differ for this aspect.

Tradeoff Evaluation. Comparison between CB-R and CB-A as they only differ for this aspect. Tradeoff Evaluation Comparison between C2PL and CB-A, as both: Allow intertransaction caching Don t use propagation Synchronously activate consistency actions Tradeoff Evaluation Comparison between CB-R

More information

Concurrency Control. R &G - Chapter 19

Concurrency Control. R &G - Chapter 19 Concurrency Control R &G - Chapter 19 Smile, it is the key that fits the lock of everybody's heart. Anthony J. D'Angelo, The College Blue Book Review DBMSs support concurrency, crash recovery with: ACID

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Enabling Fine-grained Access Control in Flexible Distributed Object-aware Process Management Systems

Enabling Fine-grained Access Control in Flexible Distributed Object-aware Process Management Systems Enabling Fine-grained Access Control in Flexible Distributed Object-aware Process Management Systems Kevin Andrews, Sebastian Steinau, and Manfred Reichert Institute of Databases and Information Systems

More information

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) Cost of Concurrency in Hybrid Transactional Memory Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) 1 Transactional Memory: a history Hardware TM Software TM Hybrid TM 1993 1995-today

More information

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Chapter 3. Design of Grid Scheduler. 3.1 Introduction Chapter 3 Design of Grid Scheduler The scheduler component of the grid is responsible to prepare the job ques for grid resources. The research in design of grid schedulers has given various topologies

More information

Lecture 21 Concurrency Control Part 1

Lecture 21 Concurrency Control Part 1 CMSC 461, Database Management Systems Spring 2018 Lecture 21 Concurrency Control Part 1 These slides are based on Database System Concepts 6 th edition book (whereas some quotes and figures are used from

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Advances in Data Management Transaction Management A.Poulovassilis

Advances in Data Management Transaction Management A.Poulovassilis 1 Advances in Data Management Transaction Management A.Poulovassilis 1 The Transaction Manager Two important measures of DBMS performance are throughput the number of tasks that can be performed within

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

DBT Tool. DBT Framework

DBT Tool. DBT Framework Thread-Safe Dynamic Binary Translation using Transactional Memory JaeWoong Chung,, Michael Dalton, Hari Kannan, Christos Kozyrakis Computer Systems Laboratory Stanford University http://csl.stanford.edu

More information

Practical Database Design Methodology and Use of UML Diagrams Design & Analysis of Database Systems

Practical Database Design Methodology and Use of UML Diagrams Design & Analysis of Database Systems Practical Database Design Methodology and Use of UML Diagrams 406.426 Design & Analysis of Database Systems Jonghun Park jonghun@snu.ac.kr Dept. of Industrial Engineering Seoul National University chapter

More information

Speculative Synchronization

Speculative Synchronization Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization

More information

WHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management

WHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management WHITE PAPER: ENTERPRISE AVAILABILITY Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management White Paper: Enterprise Availability Introduction to Adaptive

More information

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM Lecture 6: Lazy Transactional Memory Topics: TM semantics and implementation details of lazy TM 1 Transactions Access to shared variables is encapsulated within transactions the system gives the illusion

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

Storage Hierarchy Management for Scientific Computing

Storage Hierarchy Management for Scientific Computing Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction

More information

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Memory hierarchy J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

More information

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking,

More information

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction : A Distributed Shared Memory System Based on Multiple Java Virtual Machines X. Chen and V.H. Allan Computer Science Department, Utah State University 1998 : Introduction Built on concurrency supported

More information

Programming Languages Third Edition. Chapter 7 Basic Semantics

Programming Languages Third Edition. Chapter 7 Basic Semantics Programming Languages Third Edition Chapter 7 Basic Semantics Objectives Understand attributes, binding, and semantic functions Understand declarations, blocks, and scope Learn how to construct a symbol

More information

Tradeoffs in Transactional Memory Virtualization

Tradeoffs in Transactional Memory Virtualization Tradeoffs in Transactional Memory Virtualization JaeWoong Chung Chi Cao Minh, Austen McDonald, Travis Skare, Hassan Chafi,, Brian D. Carlstrom, Christos Kozyrakis, Kunle Olukotun Computer Systems Lab Stanford

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Lecture 12 Transactional Memory

Lecture 12 Transactional Memory CSCI-UA.0480-010 Special Topics: Multicore Programming Lecture 12 Transactional Memory Christopher Mitchell, Ph.D. cmitchell@cs.nyu.edu http://z80.me Database Background Databases have successfully exploited

More information

Hierarchical Chubby: A Scalable, Distributed Locking Service

Hierarchical Chubby: A Scalable, Distributed Locking Service Hierarchical Chubby: A Scalable, Distributed Locking Service Zoë Bohn and Emma Dauterman Abstract We describe a scalable, hierarchical version of Google s locking service, Chubby, designed for use by systems

More information

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments WHITE PAPER Application Performance Management The Case for Adaptive Instrumentation in J2EE Environments Why Adaptive Instrumentation?... 3 Discovering Performance Problems... 3 The adaptive approach...

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Concurrency Control. [R&G] Chapter 17 CS432 1

Concurrency Control. [R&G] Chapter 17 CS432 1 Concurrency Control [R&G] Chapter 17 CS432 1 Conflict Serializable Schedules Two schedules are conflict equivalent if: Involve the same actions of the same transactions Every pair of conflicting actions

More information

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Revision 2 Othon Michail Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise

More information