Scalable and transparent parallelization of multiplayer games. Bogdan Simion

Size: px

Start display at page:

Download "Scalable and transparent parallelization of multiplayer games. Bogdan Simion"

Jonathan Howard
5 years ago
Views:

1 Scalable and transparent parallelization of multiplayer games by Bogdan Simion A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto Copyright c 2009 by Bogdan Simion

2 Abstract Scalable and transparent parallelization of multiplayer games Bogdan Simion Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2009 In this thesis, we study parallelization of multiplayer games using software Transactional Memory (STM) support. We show that STM provides not only ease of programming, but also better scalability than achievable with state-of-the-art lock-based programming for this realistic high impact application. We evaluate and compare two parallel implementations of a simplified version (named SynQuake) of the popular game Quake. While in STM SynQuake support for maintaining consistency of each potentially complex game action is automatic, conservative locking of surrounding objects within a bounding-box for the duration of the game action is inherently needed in lock-based SynQuake. This leads to higher scalability of STM SynQuake versus lock-based SynQuake due to increased false sharing in the latter. Task assignment to threads has a second-order effect on scalability of STM-SynQuake, impacting the application s true sharing patterns. We show that a locality-aware task assignment provides the best trade-off between load balancing and conflict reduction. ii

3 Acknowledgements First and foremost, I would like to offer my gratitude to my supervisor, Cristiana Amza, who has supported and guided me closely throughout my thesis research. I appreciate her patience, her knowledgeable and useful advice and realize that without her assistance this thesis would not have been possible. I would also like to thank the members of my examination committee, Professor Angela Demke Brown and Professor Ashvin Goel, for all their help and assistance. During the course of my research program and throughout my graduate experience, I have been lucky to have worked with great colleagues whom I would like to address a special thank you, especially to Daniel Lupei for all the interesting discussions and late-night work we have done together. I would also like to thank my family for all their help, understanding and support, especially since the geographical distance separating us has made our reunions very rare in the past two years. Finally, I would like to thank the Department of Electrical and Computer Engineering at University of Toronto, for their financial support and facilities made available for my research. iii

4 Contents 1 Introduction 1 2 Application and Programming Environment Application Environment: SynQuake SynQuake game Game architecture and data structures Game server structure Programming environment libtm library interface Access tracking and conflict detection in libtm Parallelization of SynQuake Synchronization issues for player actions: false sharing Synchronization algorithms for request processing True sharing patterns Load balancing in SynQuake Load balancing algorithms Experimental Results Experimental setup STM vs Lock-based SynQuake - scaling and processing time iv

5 4.3 STM vs Lock-based SynQuake processing time with physics computation The effect of load balancing on scaling Load balancing policies Influence of locality-awareness Quest scenarios Discussion Load balancing vs. synchronization Related Work 48 6 Conclusion 52 Bibliography 54 v

6 List of Figures 1.1 Processing player move: Radius-based locking around player position Screen shot of SynQuake. Brick-texture blocks represent walls / obstacles, while resources(food) are represented by apples Areanode tree structure. Each node maintains a list of game entities fully contained in its corresponding game region. A game object may be located in a leaf if it is fully contained geometrically in the corresponding game map region, or located in the common ancestor of the leaves overlapped SynQuake: server frame structure and component stages. The server contains a request processing stage, an administrative stage and a reply stage. Server stages are separated by synchronization barriers represented by dotted lines Areas of interest for a move followed by an attack. The figure depicts the areas of interest acquired with STM and locks, as well as the move and shoot ranges. Locks are more conservative, acquiring a larger area of interest, while STM gradually acquires ownership of the area that may be affected by the action Pseudo-code for processing actions in SynQuake. a) shows the locks synchronization mechanism; b) shows the way TM acquires ownership of game objects affected by the action vi

7 3.3 Load Balancing Policies. The first policy assigns players to threads in a simple round-robin fashion. In the spread algorithm, the players located within a grid unit are assigned to the same thread. The locality-aware policy assigns a set of regions to the same thread, according to a given algorithm Quest scenarios offering different levels of contention Comparison between STM-based vs Lock-based versions of SynQuake in different quest scenarios; a) shows the scalability of STM and Locks when increasing the number of threads from 1 to 4; b) shows the STM overheads compared to the Locks version, in terms of absolute processing time. Although STM scales better, the overheads of access tracking result in worse absolute times compared to the locks Comparison between STM vs Locks when physics computation is modeled. The physics computation introduced hides the high STM overheads and thus improves performance for STM which manages to overcome Locks in terms of absolute processing time when running with 2 and 4 threads The effect of load balancing on scaling in STM vs Locks, for short and long range actions. A good load balancing scheme, such as spread and locality-aware influences STM by reducing true sharing between concurrent threads. For locks, due to a higher degree of false sharing than in the case of STM, the load balancing has almost no impact on performance.. 35 vii

8 4.5 Load balancing policies. This picture depicts the performance of our load balancing algorithms in terms of total processing time, under several quest scenarios. In the first two scenarios, we show the benefit of using a localityaware policy, both static and dynamic performing significantly better. In the third scenario, we show the shortcomings of a static policy when quest locations are not known beforehand, as opposed to a dynamic policy which detects and adapts to transient quest locations. In the last scenario, we present the performance of our algorithms when quests migrate randomly during gameplay, showing that the dynamic renders the best overall performance Scenario with 4 quests located in the center of each map quadrant. This figure depicts the player flocking around 4 quest locations and the thread assignment when using the static and the dynamic locality-aware load balancing algoritms. Due to detecting player agglomerations dynamically, the assigment of the dynamic policy is identical to the manual ideal assignment of the static algorithm Scenario with 4 quests located on the first major splits in the areatree. This figure depicts the player flocking around 4 quest locations and the thread assignment when using the static and the dynamic locality-aware load balancing algoritms. By detecting player agglomerations dynamically, the assigment of the dynamic policy results in a better assignment than the static one viii

9 4.8 Scenario with 4 quests located in the south-eastern map quadrant. This figure depicts the player flocking around 4 quest locations located in a single map quadrant. This presents a worst-case scenario for the static assignment. Basically, when quest locations are not known from the beginning, the static policy results in a poor workload distribution, in this case degenerating in all the players getting assigned to the same thread. By dynamically detecting player agglomerations, the dynamic locality-aware policy does not experience this problem Load balancing vs. synchronization. This figure shows the tradeoff between gaining more parallelism and introducing more synchronization. In the scenario with a quest centrally placed on the map, one player crowding is detected. If assigning this component on multiple threads, the extra parallelism gained overcomes the extra synchronization required ix

10 List of Tables 4.1 This table presents the player distribution for the load balancing policies tested in Scenario1. The players are guided equally towards four quests located in the center of the four map quadrants. The per-processor player quota is given as a percentage from the total number of players This table presents the player distribution for the load balancing policies tested in Scenario2. The players are guided equally towards four quests located in pairs on the first two major bisections in the areanode tree. The per-processor player quota is given as a percentage from the total number of players This table presents the player distribution for the load balancing policies tested in Scenario3. The players are guided towards four quests located in the south-west quadrant of the game map. The per-processor player quota is given as a percentage from the total number of players x

11 Chapter 1 Introduction In the past years, interactive multiplayer online games have emerged as a very important domain of applications, influencing developments in computational technology from better hardware for graphics processing to better software programs and optimized algorithms. The popularity of these online games has been directly linked to the ability to offer support for a large number of players. Handling a high volume of real-time interactions between players can prove to be a very difficult task. All player communication is generally conducted using messages containing requests or updates sent or received from a central server. This exposes gaming servers as the main bottleneck. Parallel computation techniques have been accepted as a way to alleviate the pressure on the server. Processing server work in parallel can significantly reduce the response time of game servers and thus increase the maximum number of players that can be handled, as well as ensuring better gaming experience on the client-side. Efficient restructuring of game code and providing guarantees for data consistency (e.g., synchronizing accesses to shared data), are hard to achieve for game code, due to dynamic artifacts and code complexity. In the recent past, efforts directed towards parallelization of multiplayer online games have been scarce. An attempt of parallelizing the popular first-person-shooter game 1

12 Chapter 1. Introduction 2 Quake [1] has concluded that providing synchronization guarantees and at the same time achieving good scalability is a highly complex task, especially because of the conservative locking inherently required by the game processing. In this dissertation, we study the parallelization of multiplayer online games, using two different approaches: a lock-based and a Software Transactional Memory (STM)-based synchronization. Transactional Memory (TM) is an emerging alternative paradigm for parallel programming of generic applications, which promises to facilitate more efficient, programmerfriendly use of the plentiful parallelism available in chip multiprocessors, and on cluster farms. The main idea is to simplify application programming in parallel and distributed environments through the use of transactions. TM allows transactions on different processors to manipulate shared in-memory data structures concurrently in an atomic and serializable i.e., correct manner. Many commercial and research prototypes for supporting TM in software have been introduced recently, including Intel s freely available STM compiler [8], RSTM [10], TL2 [6], etc. However, STM uptake from the wider programming community has been slow for two main reasons: First, demonstrable efforts from the research or commercial communities towards parallelizing realistic applications using TM have been scarce. This is at odds with the claim that TM makes parallelization easy. Second, the existing STMbased efforts have shown that the software overhead of run-time transactional support do not currently allow efficient parallelization for any known application; STM-based parallelizations typically perform substantially worse than simple single-mutex based implementations. Our case study is the first realistic high impact application, where TM support provides both ease of programming and better performance than that achievable with stateof-the-art lock-based programming. Specifically, we study parallelizing multiplayer online game server code on a game benchmark modeled after Quake. Towards this, we leverage

13 Chapter 1. Introduction 3 an existing software TM library, libtm [9], that can be used in conjunction with generic C, or C++ programs. Parallelization of multi-player game code for the purposes of scaling the game server is inherently difficult. Game code is typically complex, and can include use of spatial data structures for collision detection, as well as other dynamic artifacts that require conservative synchronization. The nature of the code may thus induce substantial contention due to false sharing, as well as true sharing between threads in a parallel lock-based game implementation. For example, substantial false sharing in both space and time can occur in a parallel implementation of the popular first person shooter game Quake [1] as follows. Each Quake player action usually includes dynamically evolving sub-actions; a person may move while shifting items in their backpack, throwing an object at a distance, grabbing a nearby object, and/or shooting, which together constitute a single player action. Since the terrain within the potentially affected area may contain mutable objects, all subactions need to be processed together as an atomic, consistent unit for the purposes of collision detection with other player actions. In Quake, each player action is thus processed based on a bounding box estimate, e.g., given as a sphere, around the initial player position, for the possible range of their intended action, as shown in Figure 1.1. In a parallel lock-based server code implementation [1], this translates into eagerly acquiring ownership of all potentially affected objects of the game map within this bounding box, before processing the action. This conservative locking induces unnecessary conflicts, by locking more objects than necessary, and holding these locks for longer periods than needed. In contrast, with Transactional Memory support, a player action can be split into time-steps, or its constituent sub-actions, with collision detection performed dynamically, hence more accurately, as the avatar encounters various obstacles, which potentially change its direction of movement, as shown by the intermediate steps in Figure 1.1.

14 Chapter 1. Introduction 4 xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xx xxxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxxx xxxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxxxx x Estimated lock radius xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx Figure 1.1: Processing player move: Radius-based locking around player position The atomicity and consistency of the whole player action is automatically provided by the underlying transactional support. STM support thus results in reduced false sharing overall in both space and time. Aside from false sharing, game code scalability is also orthogonally affected by the true sharing induced by player actions on objects located at the boundary between partitions of the game world assigned to different threads. Consequently, task to thread assignment for load balancing purposes needs to be done in a locality-aware fashion in the game world, thus reducing the number of boundaries, hence the potential conflicts on boundary objects. In order to facilitate experimentation, and to study a range of game genres and in-game scenarios, we have developed an in-house game benchmark, called SynQuake. SynQuake is primarily based on the popular open-source Quake game, but it is a full-

15 Chapter 1. Introduction 5 fledged multiplayer game in its own right. SynQuake extracts representative features of many first person shooter games, or those of strategy games involving a mix of shortrange and long-range interactions. SynQuake can be driven by either human players, or robot players, and can emulate synthetically generated quest scenarios, i.e., hot-spots in the game world to facilitate experimental evaluation. We use the Quake 3 area node tree as a standard game code spatial data structure facilitating storing and retrieval of the location and attributes of game objects on the game map. While we paid close attention to accurately modeling realistic game server data structures and representative access patterns to these data structures, graphics is simple and 2D, and game world physics computation is limited to 2D collision detection and path finding around obstacles. We argue that, the complex physics computations present in many games would involve only non-transactional data accesses, hence would hide the overheads of STM. Thus, with SynQuake, we showcase the worst case scenario for the performance of STM-based versus lock-based parallelization of any multiplayer game. Our detailed comparative performance evaluation shows that the main factor affecting overall scaling is the effect of the false sharing inherent in the parallelization scheme. Since STM reduces the degree of false sharing in both space and time, we experimentally show that the scaling of STM-based SynQuake from one to four server threads is better than the scaling of lock-based SynQuake in all game scenarios we study. We further show that the better scaling of STM results in better overall performance of STM-based SynQuake compared to lock-based SynQuake at four server threads in all scenarios involving a minimum of physics computation. We explored several load balancing techniques which range from a policy that fully focuses on equally distributing tasks among threads, while sacrificing locality, to an approach that minimizes true sharing but offers no guarantee with respect to load distribution. We observed that having a locality-aware thread assignment policy is critical for minimizing true sharing.

16 Chapter 1. Introduction 6 Our performance evaluation also shows that the load balancing scheme makes for significant improvements in scalability only for STM-SynQuake. For a lock-based parallelization, due to the dominant impact of false sharing, decreasing the degree of true sharing through an appropriate load balancing scheme makes little or no difference. Finally, parallelization with STM automatically hides the details of game consistency maintenance across the boundaries of an underlying partitioned world, thus offering the players the view of a large seamless world which could be hosted on parallel, distributed, or hybrid platforms, completely transparently. In summary, our main contributions consist in analyzing the factors that affect scaling and overheads of the STM and lock-based implementations of SynQuake. Specifically, we study the following factors: i) the effect of false sharing patterns, ii) the influence of load balancing on reducing true sharing patterns, iii) the performance gain of a locality-aware adaptive load balancing policy, in terms of maximizing parallelism and minimizing synchronization, and we show that the STM-based implementation outperforms the traditional lockbased parallelization for all realistic game scenarios analyzed. Analyzing the aspects enumerated above contributes largely to providing support for scalable and transparent parallelization of multiplayer game servers. In chapter 2 we outline the design of our game benchmark, SynQuake, as well as the architectural features and relevant data structures. We then examine the synchronization and load balancing related challenges of parallelizing SynQuake in chapter 3. Chapter 4 presents the experimental results comparing the performance of the lockbased and STM-based versions of SynQuake, followed by a discussion of related work in chapter 5. Chapter 6 concludes the dissertation.

17 Chapter 2 Application and Programming Environment 2.1 Application Environment: SynQuake This chapter presents the design and implementation details of our SynQuake game. First, we describe the features of the SynQuake game, specifically the game world, the game entities (like players, food items and walls), the game actions and additional features like quests. Our game simulates real-game features and player interactions. For testing purposes we have additionally created the option of robot players which are performing actions according to a simple AI algorithm. Second, we describe the game architecture and data structures. As in the famous Quake game, we use an areanode tree data structure for representing the game map and entities. This structure uses a binary space partitioning technique which facilitates quick-retrieval of game objects, based on their locations in the game world. Further, we describe the SynQuake game server. Adopting the same model present in Quake, we have implemented the SynQuake server as a succession of server frames, each frame decomposed in a set of stages separated by synchronization barriers. Each stage 7

18 Chapter 2. Application and Programming Environment 8 has a specific role (e.g., receiving and processing client requests, forming and sending replies back etc). The stages which constitute a significant part of the processing time require parallelization. We shall discuss this in detail in the next chapter SynQuake game Multiplayer Online Games (MOGs) are not only difficult to build, but also very expensive to maintain and administer. Except in isolated cases, such as the well-known first person shooter game, Quake, popular MOGs do not share their server code. Furthermore, only a few game scenarios are available with Quake 3, and the game map used in a previous study on Quake parallelization [1] is too small, resulting in excessive player crowding when scaling the player population to drive a larger number of server threads. To solve these problems, we have built a simplified version of Quake, called SynQuake. SynQuake models three types of game entities: players, resources (represented by apples) and walls. A typical game map for SynQuake is presented in Figure 2.1. The game map is created offline and does not change during gameplay. Each game entity is defined by its position on the game map and by a set of attributes specific to its type. For example, besides its position on the game map, a player is described by its life or health level, its speed and direction. Food items (apples) are described by their units of nutrition. A player can take only a limited amount of food at a time within a short-range around the player; eating food increases the player s health level, which has the visual result of increasing the player s stature/size in the game, depending on the number of nutrition units in the food item (apple). Once the food in one location is gone, it can be regenerated, but it is likely to appear in a different location. The player with a larger stature/size i.e., who has eaten more food, wins a fight between two players. Players are mutable game entities that can have both their position and attributes modified as a result of game interactions. For example, an attack decreases a player s

Chapter 2. Application and Programming Environment 9 Figure 2.1: Screen shot of SynQuake. Brick-texture blocks represent walls / obstacles, while resources(food) are represented by apples.

19 Chapter 2. Application and Programming Environment 9 Figure 2.1: Screen shot of SynQuake. Brick-texture blocks represent walls / obstacles, while resources(food) are represented by apples. life, while consuming a resource increases it. Resources are partly mutable e.g., apples can have their attributes affected by game play, but not their position, while walls are immutable entities. As a result, each of these game objects requires different levels of synchronization in parallel SynQuake, from full synchronization protection for players to no protection for walls. As part of game play, players perform short range actions like moving and consuming resources, fighting with other players, or long range actions such as simultaneously moving and shooting. Each of these actions can cause conflicts between different threads processing player actions concurrently. For example, conflicts occur when two players try to move to the same spot, or one player gets attacked while consuming a resource. To simulate areas of high interest in the game, and the associated pattern of players

20 Chapter 2. Application and Programming Environment 10 flocking to a particular area of the map, we have added quests which attract players towards that area with a high probability. These correspond to standard areas attracting players existing in Quake 3, and also in strategy games, such as a camp site, weaponry location, and health areas. Complex game scenarios can be recreated in SynQuake by varying the distribution of quests in time and space. The SynQuake game server thus captures all the basic interactions in multi-player games, and accurately represents network and system features of a game server. Furthermore, we can emulate various game genres by varying the size and amount of objects, as well as the update frequency. In order to replicate massive multiplayer scenarios, players can be driven by a simple AI algorithm that has players moving with high probability towards a quest, if one is present, eating if hungry, fighting with other players, or fleeing, if being chased by a stronger opponent Game architecture and data structures Our server stores a representation of the game world and all the game entities. Since it is critical for the server to quickly generate a list of entities that a player interacts with in any given action, we need an efficient representation of the game map. Therefore, similar to Quake [1], in order to facilitate an efficient search for all entities involved in a particular action, we represent the virtual game map using a spatial data structure, called the areanode tree (see Figure 2.2b). The areanode tree is a binary space partitioning tree, where each node represents a specific region of the game map. The tree is constructed by recursively dividing the map into sub-units, starting with the root node, corresponding to the entire game world. Nodes on subsequent levels of the tree are created by splitting the region corresponding to the parent node along its median segment. The splits are performed alternately along the x and y axes, until a predefined tree depth (or the maximum split granularity possible) is reached. The leaves in the areanode tree form a grid of equal-sized regions located on

21 Chapter 2. Application and Programming Environment 11 A1 xx B1 A B A2 B2 A1 A2 B1 B2 a) Game map b) Areanode tree Figure 2.2: Areanode tree structure. Each node maintains a list of game entities fully contained in its corresponding game region. A game object may be located in a leaf if it is fully contained geometrically in the corresponding game map region, or located in the common ancestor of the leaves overlapped. the game map. For the rest of this dissertation, we will use the terms grid unit and tree leaf interchangeably. Each entity on the map is maintained inside the finest-grained tree-node whose corresponding region completely overlaps it. This translates in leaf nodes maintaining objects that are fully contained inside grid units, while placing the rest of the entities in the common ancestor of all the grid units they overlap. Each leaf in the areanode tree maintains a list of entities fully contained inside the game map area represented by the leaf. Game entities overlapping two or more regions are placed in a list maintained in a unique parent node of the leaves overlapped. This parent node can be one level up (if the entity overlaps its immediate descendants) or several levels higher (the closest common ancestor of the crossed leaves). For example, an entity crossing the first split segment of the game world will be placed in the root node. Figure 2.2 shows a map for which game objects are indexed using a two-level areanode tree. In figure 2.2a, we have the resulting grid units corresponding to each leaf, while fig-

22 Chapter 2. Application and Programming Environment 12 ure 2.2b presents the entire tree structure resulting from splitting the root node vertically and its children horizontally. The map is populated with three objects: one completely situated inside leaf B2 and correspondingly placed in B2 s list inside the tree; a second one found on the border between grid units A1 and A2 and as a result maintained in A s list; the last entity is located on the border between A1 and B1 and kept inside the root node, since its corresponding region is the smallest one that completely overlaps it. The areanodes are not in any way related to particular features of the game world. The splits are not performed along walls, nor are they done using specific knowledge of the game map (e.g., quest locations, resource locations). Since entities migrate during game play (e.g., players move, apples get respawned after being consumed), the areanode tree must be updated accordingly in order to reflect the new positions of each game entity. This accounts for the most significant source of contention among processing threads when parallelizing the game server Game server structure In this section, we describe the structure of the SynQuake game server code and its components. In the original parallel Quake server code, server processing consists of three stages: world physics update, request processing and reply processing. Following the same design, the SynQuake server loops through three stages which form a server frame or iteration. These stages are: request processing, administrative tasks and reply processing (see Figure 2.3). In the first stage, the server receives requests from clients (players), distributes them on the worker threads according to a load balancing policy and processes them in parallel. The novelty in our implementation is the admin stage. In the admin stage, we perform an operation of rebalancing the workload by reassigning players to threads where needed, considering that after the set of action requests from players, the game configuration

23 Chapter 2. Application and Programming Environment 13 Client requests 1 Receive & Process Requests 2 Admin (single thread) Server frame Client updates 3 Form & Send Replies Figure 2.3: SynQuake: server frame structure and component stages. The server contains a request processing stage, an administrative stage and a reply stage. Server stages are separated by synchronization barriers represented by dotted lines. changes. Since this stage is present between receiving requests and sending replies, we ensure that the time spent doing load balancing is quite short, so that there is no lag experienced on the client side. Finally, in the reply processing stage, the server threads send updates to their assigned clients. After all threads reach the final synchronization barrier, a new server frame starts. Similar to the existing parallel implementation of Quake [1], in SynQuake, parallelization is performed at the granularity of individual stages, while enforcing serial execution of consecutive stages through synchronization barriers (Figure 2.3). We have imposed the same invariants as in the parallel Quake: a) each server frame is distinct and does not overlap with other server frames and b) each server frame executes its component stages in the original order. This ensures consistency and avoids any possible complications due to phase reordering or overlapping. While these constraints may be relaxed, this problem

24 Chapter 2. Application and Programming Environment 14 is beyond the scope of this work. Synchronization is not necessary during the read-only reply stage, since its execution does not overlap any of the other stages. The administration phase does not require any synchronization either, since its tasks, e.g., load balancing, are being executed by a single thread. Consequently, the stage handling client requests is the only stage requiring protection against concurrent accesses to shared data. When leveraging parallelism in this stage, we must ensure data consistency by using synchronization methods. The standard synchronization method is to acquire locks for shared data until the current owner thread finishes the operation and exits the critical section. An alternative to lock-based synchronization is a transactional memory runtime system. We use a software transactional memory library called libtm, which makes possible the detection of data races and ensures correct execution of the parallel game server. In the next section, we present details regarding both programming environments. 2.2 Programming environment We use the standard pthreads interface for our lock-based parallelization, and an existing Software Transactional Memory library, libtm, for parallelization of SynQuake. In our parallel SynQuake, the number of threads is constant throughout all server execution phases and matches the number of processors available. Each thread is assigned to a separate processor, in order to avoid cache invalidations. We use the sched setaffinity call to bind the execution of a given thread to a particular processor using the appropriate bitmask. The worker threads are spawned at the server initialization stage and interthread communication is achieved using a global shared memory model. We have implemented two versions of the game, one using locks to synchronize shared memory accesses, and a second using an STM library for memory access tracking. The STM allows transactions on different processors to manipulate shared in-memory

25 Chapter 2. Application and Programming Environment 15 data structures concurrently in a data-race-free manner. Instead of explicit fine-grained locking of data items, the programmer specifies the beginning and end of parallel regions with transaction delimiters. Runtime support provided by the library automatically detects data races between concurrent transactions and ensures correct parallel execution. Any detected incorrect execution resulting from a data race causes one or more transactions to be rolled back and restarted. The run-time system automatically detects which memory regions are read and written by a transaction, and maintains the recoverability of data for the written ranges of memory libtm library interface In a TM program supported by the libtm library, transactions need to be delineated with begin transaction and commit transaction statements. Furthermore, shared data needs to be distinguished from private per-thread data accessed inside transactions. For this purpose, transactional shared and private variables should be declared using the meta-types tm shared and tm private, respectively. For example, a shared variable, int x in the original program, needs to be declared as tm shared<int>x. The definition of each of these meta-types in our library is a C++ template using the original type of the variable as a parameter, (e.g. tm shared<original type>). Uses of Type Declarations in libtm: Declarations for tm types are used in libtm for run-time access tracking through operator overloading. Specifically, any read or write accesses on shared and private transactional variables are tracked inside the implementation of the overloaded conversion or assignment operators. Furthermore, libtm maintains recovery data for both tm shared and tm private variables updated in transactions, while performing conflict detection and resolution only for tm shared variables.

26 Chapter 2. Application and Programming Environment Access tracking and conflict detection in libtm Conflicts are detected based on the meta-data information encapsulated in each tm shared object. While this layout allows for access tracking to take place at word-level granularity, libtm also provides the capability of varying the granularity of access tracking dynamically at runtime. This is achieved through a mechanism of redirection that can remap a tm shared variable to any meta-data object indicated by the programmer. By remapping several semantically-linked tm shared variables to the same meta-data object we can effectively increase the granularity of the access tracking, and hence reduce the overheads associated with bookkeeping inside the library. This approach offers several benefits when compared to a strict object-based granularity. First, remapping can take place at runtime and can factor in semantic aspects that may vary in time. Secondly, it can handle dynamic data structures as well as variables that aren t co-located in memory next to each other, whereas objects need to be statically determined at compile time. The conflict detection mechanism used by libtm represents a variation on the classic two-phase locking concurrency control algorithm, resulting in a blocking implementation of transactional memory semantics. libtm uses an optimistic approach for detecting conflicts between transactions by acquiring ownership of written data at commit-time, while allowing readers access to the old values until then. However, any read-write conflicts detected at commit time are resolved by aborting the conflicting reader transactions. In the next chapter, we discuss the parallelization-related challenges we encountered in our game SynQuake in these two programming environments.

27 Chapter 3 Parallelization of SynQuake We present two orthogonal parallelization techniques for SynQuake: a parallel version using lock-based synchronization and a parallel version leveraging a transactional memory runtime system. We use the software transactional memory libtm, presented in section 2.2, which enables the detection of conflicts on shared memory accesses and guarantees correct execution of the game server code. In the following, we describe the parallelization of the request stage in detail including the parallelization challenges and our strategy for both a lock based implementation and a transactional memory implementation. In terms of parallelization challenges, we discuss in detail the false sharing effects, which are significant in the lock-based version because of the conservative locking needed when performing collision detection, and reduced in the case of the STM approach. We compare the two parallel game code implementations, showing that STM facilitates the programmer s task by using simple constructs to mark critical sections, without the need to worry about problems such as deadlocks. We then present the importance of load balancing on reducing true sharing patterns and thus minimizing the synchronization occuring among processing threads. Furthermore, we stress the importance of locality-awareness in balancing the workload, and introduce a set of load balancing algorithms, including a new dynamic locality-aware 17

28 Chapter 3. Parallelization of SynQuake 18 policy. 3.1 Synchronization issues for player actions: false sharing For processing a player action, the server needs to perform collision detection against all game objects intersecting the player s trajectory. Since the avatar s direction can be altered by collision with entities situated in its path, the avatar s trajectory and its final position are impossible to predict from the beginning of the action. However, the whole action and its effects on the game world need to appear as a consistent atomic unit to the players. Therefore, processing a client request in Quake consists of: i) computing the potential area of the game map impacted by the action, which we will call area of interest of the action and then ii) performing the game action, by determining its effects upon game entities. This pre-existing request processing scheme leads to a potentially significant reduction in the degree of false sharing for a TM-based versus a lock-based game parallelization scheme in terms of both i) the number of objects involved in collision detection and ii) the duration of the potential conflicts for these objects as we explain in the following. In a lock-based implementation, the entire area of interest of the action needs to be conservatively locked for the duration of all processing related to the action. For example, let s consider the scenario in Figure 3.1, where the player executes a shoot-after-move compound action. Since the player may stop short of its intended destination, or change direction during the move e.g., due to encountering an obstacle, the long-range (shooting) interaction may occur at any point in time during the move itself. Hence, the lock-based implementation needs to compute an area of interest, and conservatively lock all objects corresponding to the long range interaction at all possible points of the player s trajectory.

29 xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx Chapter 3. Parallelization of SynQuake 19 xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx Move range Area of interest with TM Shoot range Area of interest with locks Figure 3.1: Areas of interest for a move followed by an attack. The figure depicts the areas of interest acquired with STM and locks, as well as the move and shoot ranges. Locks are more conservative, acquiring a larger area of interest, while STM gradually acquires ownership of the area that may be affected by the action. This could incur substantial false sharing between threads in both space and time. In contrast, a TM-based implementation implicitly acquires access to objects gradually, as the server progresses through the execution of the action. The move can thus be decomposed into sub-actions, and collision detection can be performed for each subaction as it dynamically happens, resulting in a substantially smaller bounding box for the overall action, as shown in Figure 3.1, as well as shorter total time of protected access to the objects involved.

30 Chapter 3. Parallelization of SynQuake Synchronization algorithms for request processing In the following, we describe the synchronization protocol used in the processing stage of the server. We present our two designs for maintaining consistency of the game map: lock-based and STM-based. The algorithms in pseudocode for both versions are included in Figure 3.2. In the lock-based version, Figure 3.2b, we first compute the area of interest corresponding to the currently executing action. Then ownership of this entire area of interest is acquired by locking all the areanode tree leaves overlapping it, in a predefined order, such that deadlocks are avoided. For simplicity, this ordering is provided by the depth-first traversal of the tree. Next, we process all entities overlapping the area of interest currently maintained at the leaf-level of the areanode tree. For entities maintained at higher levels within the tree, i.e., entities intersecting multiple grid units, we need to additionally lock their corresponding (higher-level) nodes. Locking of these higher-level nodes is necessary in order to maintain the integrity of their entity lists; these entity lists may be modified concurrently by other player actions accessing objects co-located in the parent node, but outside our area of interest. Since parent nodes are sensitive to contention, the locks need to be released as soon as the processing of their entities is complete. Finally, we relinquish ownership of the area of interest by releasing the locked leaves. Figure 3.2b presents the transactional version of the synchronization protocol for action processing. As before, we first compute the area of interest associated with the current action. We then process all entities located within both leaves and parent nodes overlapping the area of interest, with consistency being seamlessly ensured by the underlying TM library. For the transactional approach, the programmer only has to mark the critical section

31 Chapter 3. Parallelization of SynQuake 21 /*Algorithm: Action processing in SynQuake using locks*/ /*Algorithm: Action processing in SynQuake using TM*/ function processaction( actionplan, player ) function processaction( actionplan, player ) { { //Compute expanded area of interest for the entire action plan expanded_range = computeexpandedareaofinterest( actionplan, player ); BEGIN_TRANSACTION(); /*Lock expanded area of interest*/ //Get areanode leaves overlapping the expanded area of interest expandedleavesset = getoverlappingleaves( area_tree, expanded_range ); foreach leaf in expandedleavesset Lock( leaf ); foreach sub-action in actionplan { //Get areanode leaves overlapping sub-action s area of interest range = computeareaofinterest( sub-action, player ); leavesset = getoverlappingleaves( area_tree, range ); /*Processing sub-action in leaf nodes*/ foreach leaf in leavesset foreach entity in leaf.entityset perform( sub-action, entity ); foreach sub-action in actionplan { //Compute area of interest for the current sub-action range = computeareaofinterest( sub-action, player ); /*Processing sub-action in leaf nodes*/ //Get areanode leaves overlapping the area of interest leavesset = getoverlappingleaves( area_tree, range ); foreach leaf in leavesset foreach entity in leaf.entityset perform( sub-action, entity ); } /*Processing sub-action in parent nodes*/ //Get areanode parents overlapping the area of interest parentsset = getoverlappingparents( area_tree, range ); foreach parent in parentsset { //Temporarily lock parent Lock( parent ); foreach entity in parent.entityset perform( sub-action, entity ); Unlock( parent ); } } /*Unlock expanded area of interest*/ foreach leaf in expandedleavesset Unlock( leaf ); } } /*Processing sub-action in parent nodes*/ //Get areanode parents overlapping the area of interest parentsset = getoverlappingparents( area_tree, range ); foreach parent in parentsset foreach entity in parent.entityset perform( sub-action, entity ); END_TRANSACTION(); (a) Lock-based version (b) STM-based version Figure 3.2: Pseudo-code for processing actions in SynQuake. a) shows the locks synchronization mechanism; b) shows the way TM acquires ownership of game objects affected by the action. with a BEGIN TRANSACTION() - COMMIT TRANSACTION() construct, without having to worry about any integrity issues related to game play or the data structures used. As a result, the STM-based approach offers a simpler programming interface, by transparently tracking memory conflicts on shared data and ensuring correct execution by aborting and rolling back the conflicting transactions. In summary, in this chapter we presented the synchronization issues and the advantage of the STM-based implementation in terms of reduced false-sharing patterns which result from on-the-fly acquiring of action bounding box ownership. Additionally, the STM library facilitates programming by using a runtime system which seamlessly guarantees correctness and consistency.

32 Chapter 3. Parallelization of SynQuake True sharing patterns While concurrency in multiplayer game servers can be severely limited as a result of false sharing, true sharing patterns can also degrade application performance. Since two different threads may handle players performing actions affecting the same game entities, or interacting directly with one another, true sharing may occur on entities located at the boundary of thread assignments. Therefore, to reduce synchronization costs resulting from true sharing, the load balancing policy should take into consideration the spatial locality of players in the game. True sharing can be thus mitigated by using locality-aware thread-assignment policies, trying to allocate the processing of entities close to each other to the same thread. However, using locality as a sole criteria in assigning tasks can cause overloading in threads processing highly populated areas. Thread overload results in increased response times and degraded user experience. Moreover, load imbalance results in idle time at barriers for underloaded threads. Consequently, the load balancing policy should achieve a good compromise between balanced load and reduced synchronization. 3.4 Load balancing in SynQuake We analyzed the tradeoff between balancing load and reducing synchronization among server threads across three policies. When optimizing for an equal distribution of the workload, a naive load balancing policy assigns players in a round-robin fashion to processing threads. However, since it offers very poor spatial locality, players situated next to one another could be handled by different threads (see Figure 3.3a), causing high levels of true sharing. In order to leverage the spatial locality of players, load balancing needs to be performed at a higher granularity than player level. This is the case of the spread policy which achieves uniform load across threads by mapping entire grid units to specific

Software transactional memory

Software transactional memory Transactional locking II (Dice et. al, DISC'06) Time-based STM (Felber et. al, TPDS'08) Mentor: Johannes Schneider March 16 th, 2011 Motivation Multiprocessor systems Speed up time-sharing applications