arxiv: v2 [cs.dc] 8 Feb 2018

Size: px

Start display at page:

Download "arxiv: v2 [cs.dc] 8 Feb 2018"

Jocelin McDonald
5 years ago
Views:

SEVENTH FRAMEWORK PROGRAMME THEME ICT-2013.3.4 Avance Computing, Embee an Control Systems arxiv:1801.

dc] 8 Feb 2018 Execution Moels for Energy-Efficient Computing Systems Project ID: 611183 D2.

Aras Atalar, Aners Gienstam, Paul Renau-Gou, Philippas Tsigas Date of preparation (latest version):

1 SEVENTH FRAMEWORK PROGRAMME THEME ICT Avance Computing, Embee an Control Systems arxiv: v2 [cs.dc] 8 Feb 2018 Execution Moels for Energy-Efficient Computing Systems Project ID: D2.2 White-box methoologies, programming abstractions an libraries Phuong Ha, Vi Tran, Ibrahim Umar, Aras Atalar, Aners Gienstam, Paul Renau-Gou, Philippas Tsigas Date of preparation (latest version): Copyright c The EXCESS Consortium The opinions of the authors expresse in this ocument o not necessarily reflect the official opinion of EXCESS partners or of the European Commission.

2 D2.2: White-box methoologies, programming abstractions an libraries 2 DOCUMENT INFORMATION Deliverable Number D2.2 Deliverable Name White-box methoologies, programming abstractions an libraries Authors Phuong Ha Vi Tran Ibrahim Umar Aras Atalar Aners Gienstam Paul Renau-Gou Philippas Tsigas Responsible Author Phuong Ha phuong.hoai.ha@uit.no Phone: Keywors High Performance Computing; Energy Efficiency WP/Task WP2/Task 2.2 Nature R Dissemination Level PU Planne Date Final Version Date Reviewe by Christoph Kessler (LIU), Michael Gienger (HLRS) MGT Boar Approval YES

3 D2.2: White-box methoologies, programming abstractions an libraries 3 DOCUMENT HISTORY Partner Date Comment Version UiT (P. Ha, V. Tran) Deliverable skeleton 0.1 UiT (P. Ha, V. Tran, I. Umar) First version sent to WP2 partners 0.2 CTH (P. Renau-Gou) Chalmers part inclue 0.3 UiT (P. Ha, V. Tran, I. Umar) Consoliate version sent to internal 0.4 reviewers UiT (P. Ha, V. Tran) Final version 1.0

4 D2.2: White-box methoologies, programming abstractions an libraries 4 Abstract This eliverable reports the results of white-box methoologies an early results of the first prototype of libraries an programming abstractions as available by project month 18 by Work Package 2 (WP2). It reports i) the latest results of Task 2.2 on white-box methoologies, programming abstractions an libraries for eveloping energy-efficient ata structures an algorithms an ii) the improve results of Task 2.1 on investigating an moeling the trae-off between energy an performance of concurrent ata structures an algorithms. The work has been conucte on two main EXCESS platforms: Intel platforms with recent Intel multicore CPUs an Moviius Myria1 platform. Regaring white-box methoologies, we have evise new relaxe cache-oblivious moels an propose a new power moel for Myria1 platform an an energy moel for lock-free queues on CPU platforms. For Myria1 platform, the improve moel now consiers both computation an ata movement cost as well as architecture an application properties. The moel has been evaluate with a set of micro-benchmarks an application benchmarks. For Intel platforms, we have generalize the moel for concurrent queues on CPU platforms to offer more flexibility accoring to the workers calling the ata structure (parallel section sizes of enqueuers an equeuers are ecouple). Regaring programming abstractions an libraries, we have continue investigating the trae-offs between energy consumption an performance of ata structures such as concurrent queues an concurrent search trees base on the early results of Task 2.1. Base on the investigation, we have implemente a set of libraries an programming abstractions incluing concurrent queues an concurrent search trees that are energy-aware. The preliminary results show that our concurrent trees are faster an more energy efficient than the state-of-the-art on commoity HPC an embee platforms.

5 D2.2: White-box methoologies, programming abstractions an libraries 5 Executive Summary Computing technology is currently at the beginning of the isruptive transition from petascale to exascale computing ( ), posing a great challenge on energy efficiency. High performance computing (HPC) in 2020 will be characterize by ata-centric workloas that, unlike those in traitional sequential/parallel computing, are comprise of big, ivergent, fast an complex ata. In orer to aress energy challenges in HPC, the new ata must be organize an accesse in an energy-efficient manner through novel funamental ata structures an algorithms that strive for the energy limit. Moreover, the general application- an technology-tren inicates finer-graine execution (i.e. smaller chunks of work per compute core) an more frequent communication an synchronization between cores an uncore components (e.g. memory) in HPC applications. Therefore, not only concurrent ata structures an memory access algorithms but also synchronization is essential to optimize the energy consumption of HPC applications. However, previous concurrent ata structures, memory access algorithms an synchronization algorithms were esigne without energy consumption in min. The esign of energy-efficient funamental concurrent ata structures an algorithms for inter-process communication in HPC remains a largely unexplore area an requires significant efforts to be successful. Work package 2 (WP2) aims to evelop interfaces an libraries for energy-efficient interprocess communication an ata sharing on the new EXCESS platforms integrating Moviius embee processors. In orer to set the stage for these tasks, WP2 nees to investigate an moel the trae-offs between energy consumption an performance of ata structures an algorithms for inter-process communication. WP2 also concerns supporting energy-efficient massive parallelism through scalable concurrent ata structures an algorithms that strive for the energy limit, an minimizing inter-component communication through locality- an heterogeneity-aware ata structures an algorithms. The latest results of Task 2.2 (PM7 - PM36) on white-box methoologies, programming abstractions an libraries available by project month 18 as well as the improve results of Task 2.1 on investigating an moeling the trae-off between energy an performance [41], are summarize in this report. White-box methoologies The white-box methoologies presente in this report inclue a new cache-oblivious methoology an new energy moels that help evelop energy-efficient ata structures an algorithms: We have evise a new relaxe cache oblivious methoology that is appropriate for eveloping energy-efficient concurrent ata structures an algorithms. We have propose a new power moel for Moviius embee platform (Myria1) which is able to preict the power consume by a program running on a specific number of cores. We have valiate the moel with a set of micro-benchmarks an

6 D2.2: White-box methoologies, programming abstractions an libraries 6 real applications such as sparse/ense linear algebra kernels an the graph application kernels (e.g., the Graph 500 kernels 1 ). We have propose a way to moel the energy behavior of lock-free queue implementations an parallel applications that use them on CPU platforms. Focusing on steay state behavior, we have ecompose energy behavior into throughput an power issipation which can be moele separately an later recombine into several useful metrics, such as energy per operation. Programming abstractions an libraries We escribe a set of implemente concurrent search trees an queues as well as their energy an performance analyses: We have evelope a concurrent search tree library that contains several state-of-theart concurrent search trees such as the non-blocking binary search tree, the Software Transactional Memory (STM) base re-black tree, AVL tree, an speculation-frienly tree, the fast concurrent B-tree, an the static cache-oblivious binary search tree. A family of novel locality-aware an energy efficient concurrent search trees, namely the DeltaTree, the Balance DeltaTree, an Heterogeneous DeltaTree, are also enclose in the concurrent search tree library. The DeltaTrees are platform-inepenent an up to 140% faster an 220% more energy efficient than the state-of-the-art on commoity HPC an embee platforms. On lock-free queues on CPU platforms, we have automatize the process of estimating the performance an the power issipation of any queue implementation, an integrate it in the EXCESS software. This report is organize as follows. Section 1 provies the backgroun an motivations of the work presente in this eliverable. Section 2 escribes two EXCESS platforms an their setting to measure power consumption. The white-box methoologies such as relaxe cacheoblivious methoology an power moels are presente in Section 3. Section 4 escribes the first prototype of EXCESS libraries an programming abstractions incluing numerous concurrent search tree an queue implementations as well as their performance an energy analyses. Section 5 conclues the report with future works. 1

7 D2.2: White-box methoologies, programming abstractions an libraries 7 Contents 1 Introuction Purpose White-box Methoologies Cache-oblivious Methoology Power an Energy Moels Libraries an Programming Abstractions Contributions EXCESS Platforms an Energy Measurement Settings System A: CPU-base Platform System Description Measurement Methoology for Energy Consumption System B: Moviius Myria1 Embee Platform Moviius Myria1 Architecture Myria1 Measurement Set-up White-box Methoologies Cache-oblivious Methoology Preliminaries Cache-oblivious Algorithms Cache-oblivious Data Structures New Relaxe Cache-oblivious Moel New Concurrency-aware van Eme Boas Layout Power Moel for Computational Algorithms on Moviius Platform Energy Moel Description Moel Valiation Energy Moel for Lock-Free Queues on CPU Platform Motivation an Preliminaries Framework Throughput Estimation Power Estimation White-box Methoology for Instantiating the Energy Moel of Queues on CPU Platform Instantiating the Throughput Moel Instantiating the Power Moel Summary Programming Abstractions an Libraries Concurrent Search Trees Energy-efficient Concurrent Search Trees

8 D2.2: White-box methoologies, programming abstractions an libraries Libraries of Concurrent Search Trees Performance an Energy Analysis of Concurrent Search Trees Concurrent Lock-Free Queues Description of the Implementations Experiments: Preictions an Measurements Conclusions 97

9 D2.2: White-box methoologies, programming abstractions an libraries 9 1 Introuction 1.1 Purpose In orer to aress energy challenges in HPC an embee computing, ata must be organize an accesse in an energy-efficient manner through novel funamental ata structures an algorithms that strive for the energy limit. Due to more frequent communication an synchronization between cores an memory components in HPC an embee computing, not only efficient esign of concurrent ata structures an memory access algorithms but also synchronization is essential to optimize the energy consumption. However, previous concurrent ata structures, memory access algorithms an synchronization algorithms were esigne without consiering energy consumption. Although there are existing stuies on the energy utilization of concurrent ata structures emonstrating non-optimal results on energy consumption, the esign of energy-efficient funamental concurrent ata structures an algorithms for inter-process communication in HPC an embee computing is not yet wiely explore an becomes a challenging an interesting research irection. EXCESS aims to investigate the trae-offs between energy consumption an performance of concurrent ata structures an algorithms as well as inter-process communication in HPC an embee computing. By analyzing the non-intuitive results, EXCESS will evise a comprehensive moel for energy consumption of concurrent ata structures an algorithms for inter-process communication, especially in the presence of component composition. The new energy-efficient technology will be elivere through novel execution moels for the energy-efficient computing paraigm, which consist of complete energy-aware software stacks (incluing energy-aware component moels, programming moels, libraries/algorithms an runtimes) an configurable energy-aware simulation systems for future energy-efficient architectures. The goal of Work package 2 (WP2) is to evelop interfaces an libraries for inter-process communication an ata sharing on EXCESS platforms integrating Moviius embee processors, along with investigating an moeling the trae-offs between energy consumption an performance of ata structures an algorithms for inter-process communication. WP2 also concerns supporting energy-efficient massive parallelism through scalable concurrent ata structures an algorithms that strive for the energy limit, an minimizing intercomponent communication through locality- an heterogeneity-aware ata structures an algorithms. In aition to investigating the trae-offs an evising comprehensive moels for energy consumption of concurrent ata structures an algorithms (Task 2.1), WP2 also ientifies essential concurrent ata structures an algorithms for inter-process communication in HPC with the focus on how to customize them (Task 2.2). We exploit common ata-flow patterns to create generalize communication abstractions with which application esigners can easily create an exploit the customization for the ata-flow patterns. This task constitutes the interfaces an libraries for inter-process communication an ata sharing on EXCESS platforms. The results also constitute a white-box methoology for tuning energy efficiency

10 D2.2: White-box methoologies, programming abstractions an libraries 10 an performance of concurrent ata structures an algorithms. This report summarizes i) the early results of Task 2.2 on white-box methoologies, programming abstractions an libraries an ii) the improve results of Task 2.1 on investigating an moeling the trae-off between energy an performance of concurrent ata structures an algorithms. The improve results of Task 2.1 constitute the theoretical basis for the whole work package. 1.2 White-box Methoologies White-box methoology is a general stuy or a theoretical analysis of the principles an methos applie into a fiel of stuy or research to outline how the stuy or research shoul be taken. The term white-box means that the principles an methos in the methoology have prior knowlege of the inner workings an structures of the objects involve in the stuy. In the scope of EXCESS project, the energy efficiency, architecture an inner workings of a system must be well unerstoo, an likewise for algorithms an ata-sets. The white-box methoologies stuie in this report inclue cache-oblivious methoology, a power moel for Myria1 platform an an energy moel for lock-free concurrent queues Cache-oblivious Methoology Energy efficiency is one of the most important factors in esigning high performance systems. As a result, ata must be organize an accesse in an energy-efficient manner through novel funamental ata structures an algorithms that strive for the energy limit. Unlike conventional locality-aware algorithms that only concern about whether the ata is on-chip (e.g., cache) or not (e.g., DRAM), new energy-efficient ata structures an algorithms must consier ata locality in finer-granularity: where on chip the ata is. Dally [25] preicte that for chips using the 10nm technology, the energy require between accessing ata in nearby on-chip memory an accessing ata across the chip will iffer as much as 75x (2pJ versus 150pJ), whereas the energy require between accessing the on-chip ata an accessing the off-chip ata will only iffer by 2x (150pJ versus 300pJ). Therefore, in orer to construct energy efficient software systems, ata structures an algorithms must support not only high parallelism but also fine-graine ata locality [25]. In orer to evise locality-aware algorithms, we nee theoretical execution moels that promote ata locality. One example of such moels is the the cache-oblivious (CO) moels [34], which enable the analysis of ata transfer between two levels of the memory hierarchy. CO moels are using the same analysis as the wiely known I/O moels [3] except in CO moels an optimal replacement is assume. Lower ata transfer complexity implies better ata locality an higher energy efficiency as energy consumption cause by ata transfer ominates the total energy consumption [25]. These moels require the knowlege of the algorithm an some parameters of the architecture to be known beforehan, hence they are white-box methos. The cache-oblivious (CO) moels (cf. Section ) support not only fine-graine ata locality but also portability. A CO algorithm that is optimize for 2-level memory, is

11 D2.2: White-box methoologies, programming abstractions an libraries 11 asymptotically optimize for unknown multilevel memory (e.g., register, L1C, L2C,..., LLC, memory), enabling fine-graine ata locality (e.g., minimizing ata movement between L1C an L2C). As cache sizes an block sizes in the CO moels are unknown, CO algorithms are expecte to be portable across ifferent systems. For example, the memory transfer cost of an algorithm (e.g., how many ata blocks nee to be transferre between two level of memory), which is analyze using the CO moel, will be applicable on both HPC machines an embee platforms (e.g., Myria1/2 platforms), irrespective of the variations in the harware parameters such as memory hierarchy, specifications an sizes. The performance portability is useful for analyzing the ata movement an energy consumption of an algorithm in a platform-inepenent manner. The memory transfer cost of an algorithm obtaine using the CO moel can be regare as a first piece of information that can enable software esigners to rapily analyze the performance an energy consumption of their algorithms. After all, memory transfer is one of the parameters that ominate the total energy consumption. As for the next step, the transfer cost can be fe irectly into the energy moel of a specific platform to get a goo approximation on the energy consumption of the algorithm on the platform. Algorithms an ata structures analyze using the cache-oblivious moels [34] are foun to be cache-efficient an isk-efficient [14, 28], making them suitable for improving energy efficiency in moern high performance systems. Nowaays, multilevel memory hierarchies in commoity systems are becoming more prominent as moern CPUs ten to have at least 3 level of caches an isks start to incorporate hybri-ssd cache memories. With minimal effort, cache-oblivious algorithms are expecte to be always locality-optimize irrespective of variations in memory hierarchies, enabling less ata transfers between memory levels that irectly translate into runtime energy savings. Since their inception, cache-oblivious moels have been extensively use for esigning locality-aware funamental algorithms an ata structures [14, 28, 33]. Among those algorithms are scanning algorithms (e.g., traversals, aggregates, an array reversals), ivie an conquer algorithms (e.g., meian an selection, an matrix multiplication), an sorting algorithms (e.g., mergesort an funnel-sort [34]). Several static ata structures (e.g., static search trees, an funnels) an ynamic ata structures (e.g., orere files, b-trees, priority queues, an linke-list) have been also analyze using the cache-oblivious moels. Performance of the sai cache-oblivious algorithms an ata structures have been reporte similar to or sometimes better than the performance of their traitional cache-aware counterparts Power an Energy Moels The energy consume by worlwie computing systems increases 7% annually an becomes a major concern in information technology society. In orer to tackle this issue, the research community an inustry have propose several research approaches to reuce the energy consumption of IT systems [69]. One of the key research irections to improve energy-efficiency is to unerstan how much energy a computing system consumes an characterize the energy consume by an iniviual component. By knowing the energy consumption of an algorithm on a specific computing

12 D2.2: White-box methoologies, programming abstractions an libraries 12 architecture, researchers an practitioners can esign an implement new approaches to reuce the energy consume by a certain algorithm on a specific platform. The energy an power consumption of computing systems can be either measure by integrate sensors an external multimeters or estimate by moels. Energy an power measurement equipment an sensors are not always available an can be costly to eploy an set up. Therefore, energy an power moels are an alternative an convenient metho to estimate the energy consumption of a computing component or a whole computing system. The moels, however, nee to be simple to use an shoul not interfere with the energy estimate results [69]. We have conucte a stuy on a power moel of Myria1 platform (cf. Section 3.2)an an energy moel for lock-free queues on CPU platform (cf. Section 3.4). 1.3 Libraries an Programming Abstractions Concurrent ata structures (e.g., search trees an queues) an algorithms use as the builing blocks of energy-efficient software systems must support high parallelism an fine-graine ata locality. In this work we focus on two libraries of the most wiely use concurrent ata structures, namely the search trees an queues. Concurrent search trees are funamental ata structures use wiely in high performance file systems (e.g., TokuFS an XFS) an atabase systems (e.g., InnoDB, MyISAM, PostgreSQL an TokuDB). Concurrent search trees are usually use as the back en of ictionaries supporting search, insertion an eletion of recors. Concurrent FIFO queues are funamental ata structures that are key components in applications, algorithms, run-time an operating systems. The proucer/consumer pattern, e.g., is a common approach to parallelizing applications where threas act as either proucers or consumers an synchronize an stream ata items between them using a share collection. A concurrent queue, a.k.a. share first-in, first-out or FIFO buffer, is a share collection of elements which supports at least the basic operations Enqueue (as an element) an Dequeue (removes the olest element). Dequeue returns the element remove or, if the queue is empty, Null. A large number of lock-free (an wait-free) queue implementations have appeare in the literature, e.g., [81, 65, 79, 66, 52, 37] being some of the most influential or most efficient results. Each implementation of a lock-free queue has obviously its strong an weak points so the impact on performance an energy when choosing one particular implementation for any given situation may not be obvious. As the number of known implementations of lock-free concurrent queues is growing, it is of great interest to escribe a framework within which the ifferent implementations can be ranke, accoring to the parameters that characterize the situation. In this eliverable, we report our results on concurrent ata structures such as concurrent search trees an lock-free queues (cf. Section 4). 1.4 Contributions The main achievements in this report are summarize in two main parts as below.

13 D2.2: White-box methoologies, programming abstractions an libraries 13 White-box methoologies inclue new cache-oblivious methoology an energy moels. We have evise a new relaxe cache oblivious moel that are appropriate for eveloping energy-efficient concurrent ata structures an algorithms. We have propose an power moel which is able to preict the power consume by a program running on a specific number of cores. Given a certain platform an the computation intensity, the moel can preict the power consume by an algorithm, answering the question how many cores are require to run a program to achieve the optimize energy consumption. The moel consiers both platform an algorithm properties, giving more insights into how to esign the algorithm to achieve better energy efficiency. The moel has been valiate by a set of microbenchmarks an application kernels such as sparse/ense linear algebra kernels an graph kernels on Moviius embee platform (Myria1). We have continue the work one in D2.1 [44] on the moeling of queue implementations. We have move from a grey-box moel, where the performance an the power consumption of enqueue an equeue operations were hien, to a white-box moel, where the impact of those operations are stuie separately, an combine at the en. Aitionally, we have generalize the moel to offer more flexibility accoring to the workers calling the ata structure (parallel section sizes of enqueuers an equeuers are ecouple). Programming abstractions an libraries provie a set of implemente concurrent search trees an their energy an performance analyses, as well as a set of implemente lockfree queues, together with a comparison between the preicte an measure throughput an power. We have evelope a concurrent search tree library that contains several state-ofthe-art concurrent search trees such as the non-blocking binary search tree, Software Transactional Memory (STM) base re-black tree, AVL tree, an speculationfrienly tree, fast concurrent B-tree, an static cache-oblivious binary search tree. A family of novel locality-aware an energy efficient concurrent search trees, namely DeltaTree, Balance DeltaTree, an Heterogeneous DeltaTree are also enclose in the concurrent search tree library. All the components in this library support the stan-alone benchmark program moe for the purpose of micro-benchmarking an library moe that is pluggable into any C/C++ base programs. The DeltaTrees are platform-inepenent an up to 140% faster an 220% more energy efficient than the state-of-the-art on commoity HPC an embee platforms. In single threa evaluation, Heterogeneous DeltaTree is 38% faster than st::set of GCC stanar library in the theoretical worst-case scenario of inserting a sorte sequence of keys; an is 50% faster than st::set for inserting a ranom sequence of keys the theoretical average-case scenario.

14 D2.2: White-box methoologies, programming abstractions an libraries 14 On lock-free queues on CPU platforms, we have automatize the process of estimating the performance an the power issipation of any queue implementation, an integrate it in the EXCESS software.

15 D2.2: White-box methoologies, programming abstractions an libraries 15 2 EXCESS Platforms an Energy Measurement Settings In this section, we introuce briefly two EXCESS platforms that we work with. The system escriptions, which have been mentione in Deliverable D2.1, are presente here to make this eliverable self-containe. As compare to Deliverable D2.1, this section is ae with more upate an etaile information on measurement set-up for Myria1 platform. 2.1 System A: CPU-base Platform System Description CPU: Intel(R) Xeon(R) CPU E5-2687W v2 2 sockets, 8 cores each Max frequency: 3.4GHz, Min frequency: 1.2GHz, frequency speestep by DVFS: GHz. Turbo moe: 4.0GHz. Hyperthreaing (isable) L3 cache: 25M, internal write-back unifie, L2 cache: 256K, internal write-back unifie. L1 cache (ata): 32K internal write-back DRAM: 16GB in 4 4GB DDR3 REG ECC PC moules run at 1600MTransfers/s. Each socket has 4 DDR3 channels, each supporting 2 moules. In this case 1 channel per socket is use. Motherboar: Intel Workstation W2600CR, BIOS version: /22/2013 Har rive: Seagate ST10000DM003-9YN162 1TB SATA Measurement Methoology for Energy Consumption The energy measurement equipment for System A at CTH, escribe in Section 2.1.1, is shown in Figure 1 an outline below. It has previously been escribe in etail in EXCESS D1.1 [56] an D5.1 [78]. The system is equippe with external harware sensors for two levels of energy monitoring as well as built in energy sensors: At the system level using an external Watts Up.Net [29] power meter, which is connecte between the wall socket an the system. At the component level using shunt resistors inserte between the power supply unit an the various components, such as CPU, DRAM an motherboar. The signals from the shunt resistors are capture with an Alink USB-1901 [1] ata acquisition unit (DAQ) using a custom utility.

D2.2: White-box methoologies, programming abstractions an libraries 16 power connector Shunt resistor probe lines Polling system state from PAPI via SSH Polling Wattsup via USB Polling

A custom utility base on the PAPI library [17, 84] is use to recor these counters an other system state parameters of interest.

2 System B: Moviius Myria1 Embee Platform 2.2.1 Moviius Myria1 Architecture The Myria1 platform evelope by Moviius contains eight separate SHAVE (Streaming Hybri Architecture Vector Engine) processors an one RISC core namely LEON.

In this work, we consier the following registers an functional units as escribe below. Integer Register File (IRF) the SAU.

16 D2.2: White-box methoologies, programming abstractions an libraries 16 power connector Shunt resistor probe lines Polling system state from PAPI via SSH Polling Wattsup via USB Polling USB-1901 via USB Figure 1: Deployment of energy measurement evices for System A. Intel s RAPL energy counters are also available for the CPU an DRAM components. A custom utility base on the PAPI library [17, 84] is use to recor these counters an other system state parameters of interest. For the work presente in this report the component level harware sensors an the RAPL energy counters have mainly been use. 2.2 System B: Moviius Myria1 Embee Platform Moviius Myria1 Architecture The Myria1 platform evelope by Moviius contains eight separate SHAVE (Streaming Hybri Architecture Vector Engine) processors an one RISC core namely LEON. Each SHAVE one resies on one solitary power islan. The SHAVE processor contains a set of register files an a set of arithmetic units as Figure 2. In this work, we consier the following registers an functional units as escribe below. Integer Register File (IRF) the SAU. Register file for storing integers from either the IAU or Scalar Register File (SRF) Register file for storing integers from either the IAU or the SAU. Vector Register File (VRF) Register file for storing integers from either the VAU. Integer Arithmetic Unit (IAU) Performs all arithmetic instructions that operate on integer numbers, accesses the IRF. Scalar Arithmetic Unit (SRF) Performs all Scalar integer/floating point arithmetic.

D2.2: White-box methoologies, programming abstractions an libraries 17 Figure 2: SHAVE Instruction Units Vector Arithmetic Unit (VAU) Performs all Vector integer/floating point arithmetic.

17 D2.2: White-box methoologies, programming abstractions an libraries 17 Figure 2: SHAVE Instruction Units Vector Arithmetic Unit (VAU) Performs all Vector integer/floating point arithmetic. Loa Store Unit (LSU) There are two LSUs (LSU0 & LSU1) an they perform any memory access an IO instructions. Control Move Unit (CMU) This unit interacts with all register files, an allows for comparing an moving between the register files. The memory architecture of Myria1 obtaine from eliverable D4.1 is shown in Figure 3. Eight SHAVE cores can access Double Data Rate Ranom Access Memory (DDR RAM) via L2 cache or bypass L2 cache. Except from DDR RAM, Moviius introuces a new memory component on-chip RAM containing one megabyte of internal memory with high banwith local storage of ata an instruction coe for the SHAVE processors. This memory component is name CMX. The CMX is constructe to allow eight SHAVEs parallel access to ata an program coe memory without stalling. Each SHAVE can access ata on its own slice of CMX. It can also access other CMX slices of other SHAVE cores with slower time as the trae-off. In the later sections, the energy moel is valiate with both memory components CMX an DDR.

D2.2: White-box methoologies, programming abstractions an libraries 18 Figure 3: Myria1 Memory Hierarchy Figure 4: Myria1 Power Supply Moification 2.2.2 Myria1 Measurement Set-up The platform supports the measurement of the power consume only by the Myria1 chip.

2V core voltage an one HAMEG multimeter measuring all the voltage, current an consume power values.

18 D2.2: White-box methoologies, programming abstractions an libraries 18 Figure 3: Myria1 Memory Hierarchy Figure 4: Myria1 Power Supply Moification Myria1 Measurement Set-up The platform supports the measurement of the power consume only by the Myria1 chip. We use a bench setup consisting of Myria1 MV153 boar, a DC step own converter ownregulating the 5V wall PSU to the 1.2V core voltage an one HAMEG multimeter measuring all the voltage, current an consume power values. The moifications were mae to the MV153 boar to bypass the on-boar voltage regulator which own-regulates the 5V wall PSU to the 1.2V core voltage require by Myria1. That allows an external bench power-supply to be use in its place as shown in Figure 4. The MV153 boar is moifie in orer to measure the voltage, current an the consume power of only Myria1 chip instea of the whole boar. The HAMEG multimeter provies the measure ata at a rate of 50 times per secon which is able to capture the measurements for benchmarks with execution time longer than 20 millisecons.

19 D2.2: White-box methoologies, programming abstractions an libraries 19 3 White-box Methoologies In this section, we present our stuies on white-box methoologies incluing cache-oblivious methoology, a power moel for Myria1 platform an an energy moel for lock-free concurrent queues. 3.1 Cache-oblivious Methoology This section presents the cache-oblivious (CO) methoology. As pointe in Section 1.2.1, theoretical execution moels that promote ata locality are neee in orer to evise energyefficient concurrent algorithms an ata structures. Better ata locality will result in higher energy efficiency since energy consumption cause by ata transfer is preicte to ominate the total energy consumption [25]. We first present two memory moels: 1) the I/O moel [3] an 2) cache-oblivious moel [34], which enable the analysis of ata transfer between two levels of the memory hierarchy. We then present some examples of CO algorithms an ata structures that are analyze using the CO moels in Section an 3.1.3, respectively. Section presents a new relaxe cache-oblivious moel on which a new concurrency-aware van Eme Boas (veb) layout is evise (cf. Section 3.1.5) Preliminaries I/O moel. The I/O 2 moel was introuce by Aggarwal an Vitter [3]. In their seminal paper, Aggarwal an Vitter postulate that the memory hierarchy consists of two levels, an internal memory with size M (e.g., DRAM) an an external storage of infinite size (e.g., isks). Data is transferre in B-size blocks between those two levels of memory an the CPU can only access ata that are available in the internal memory. In the I/O moel, an algorithm s time complexity is assume to be ominate by how many block transfers are require, as loaing ata from isk to memory takes much more time than processing the ata. For this I/O moel, B-tree [7] is an optimal search tree [23]. B-trees an its concurrent variants [12, 21, 38, 39] are optimize for a known memory block size B (e.g., page size) to minimize the number of memory blocks accesse by the CPU uring a search, thereby improving ata locality. The I/O transfer complexity of B-tree is O(log B N), the optimal. However, the I/O moel has its rawbacks. Firstly, to use this moel, an algorithm has to know the B an M (memory size) parameters in avance. The problem is that these parameters are sometimes unknown (e.g., when memory is share with other applications) an most importantly not portable between ifferent platforms. Seconly, in reality there are ifferent block sizes at ifferent levels of the memory hierarchy that can be use in the esign of locality-aware ata layout for search trees. For example in [57, 72], Intel engineers 2 The term I/O is from now on use a shorthan for block I/O operations

20 D2.2: White-box methoologies, programming abstractions an libraries 20 have come out with very fast search trees by crafting a platform-epenent ata layout base on the register size, SIMD with, cache line size, an page size. Existing B-trees limit spatial locality optimization to the memory level with block size B, leaving access to other memory levels with ifferent block size unoptimize. For example a traitional B-tree that is optimize for searching ata in isks (i.e., B is page size), where each noe is an array of sorte keys, is optimal for transfers between a isk an RAM (cf. Figure 5c). However, ata transfers between RAM an last level cache (LLC) are no longer optimal. For searching a key insie each B-size block in RAM, the transfer complexity is Θ(log(B/L)) transfers between RAM an LLC, where L is the cache line size. Note that a search with optimal cache line transfers of O(log L B) is achievable by using the van Eme Boas layout [13]. This layout has been prove to be optimal for search using the cache-oblivious moel [34]. Cache-oblivious moel The cache-oblivious moel was introuce by Frigo et al. in [34], which is similar to the I/O moel except that the block size B an memory size M are unknown. Using the same analysis of the Aggarwal an Vitter s two-level I/O moel, an algorithm is categorize as cache-oblivious if it has no variables that nee to be tune with respect to harware parameters, such as cache size an cache-line length in orer to achieve optimality, assuming that I/Os are performe by an optimal off-line cache replacement strategy. If a cache-oblivious algorithm is optimal for arbitrary two-level memory, the algorithm is also optimal for any ajacent pair of available levels of the memory hierarchy. Therefore without knowing anything about memory level hierarchy an the size of each level, a cacheoblivious algorithm can automatically aapt to multiple levels of the memory hierarchy. In [14], cache-oblivious algorithms were reporte performing better on multiple levels of memory hierarchy an more robust espite changes in memory size parameters compare to the cache-aware algorithms. One simple example is that in the cache-oblivious moel, B-tree is no longer optimal because of the unknown B. Instea, the van Eme Boas (veb) layout-base trees that are escribe by Bener [9, 10, 11] an Broal, [13], are optimal. We woul like to refer the reaers to [14, 34] for a more comprehensive overview of the I/O moel an cache-oblivious moel. We provie some of the examples of cache-oblivious algorithms an cache oblivious ata structures in the following texts Cache-oblivious Algorithms Scanning algorithms an their erivatives One example of a naive cache-oblivious (CO) algorithm is the linear scanning of an N element array that requires Θ(N/B) I/Os or transfers. Bentley s array reversal algorithm an Blum s linear time selection algorithm are primarily base on the scanning algorithm, therefore they also perform in Θ(N/B) I/Os [14, 28].

21 D2.2: White-box methoologies, programming abstractions an libraries Divie an conquer algorithms. Another example of CO algorithms in ivie an conquer algorithms is the matrix operation algorithms. Frigo et al. prove that transposition of an n m matrix was optimally solve in O(mn/B) I/Os an the multiplication of an m n-matrix an an n p-matrix was solve using O((mn + np + mp)/b + mnp/(b M)) I/Os, where M is the memory size [34]. As for square matrices (e.g., N N), using the Strassen s algorithm, the require I/O boun has been prove to be O(N 2 /B + N lg 7 /B M), the optimal Sorting algorithms. Demaine gave two examples of cache-oblivious sorting algorithm in his brief survey paper [28], namely the mergesort an funnelsort [34]. In the same text he also wrote that both sorting algorithms achieve the optimal Θ( N log B 2 N ) I/Os, matching those in the original B analysis of Aggarwal an Vitter [3] Cache-oblivious Data Structures Static ata structures One of the examples of cache-oblivious (CO) static ata structures is the CO search trees that can be achieve using the van Eme Boas (veb) layout [71, 83]. The veb-base trees recursively arrange relate ata in contiguous memory locations, minimizing ata transfer between any two ajacent levels of the memory hierarchy (cf. Figure 7). Figure 5 illustrates the veb layout, where the size B of memory blocks transferre between 2-level memory in the I/O moel [3] is 3 (cf. Section ). Traversing a complete binary tree with the Breath First Search layout (or BFS tree for short) (cf. Figure 5a) with height 4 will nee three memory transfers to locate the key at leaf-noe 13. The first two levels with three noes (1, 2, 3) fit within a single block transfer while the next two levels nee to be loae in two separate block transfers that contain noes (6, 7, 8) 3 an noes (13, 14, 15), respectively. Generally, the number of memory transfers for a BFS tree of size N is (log 2 N log 2 B) = log 2 N/B log 2 N for N B. For a veb tree with the same height, the require memory transfers is only two. As shown in Figure 5b, locating the key in leaf-noe 12 requires only a transfer of noes (1, 2, 3) followe by a transfer of noes (10, 11, 12). Generally, the memory transfer complexity for searching for a key in a tree of size N is now reuce to log 2 N = log log 2 B B N, simply by using an efficient tree layout so that nearby noes are locate in ajacent memory locations. If B = 1024, searching a BFS tree for a key at a leaf requires 10x (or log 2 B) more I/Os than searching a veb tree with the same size N where N B. On commoity machines with multi-level memory, the veb layout is even more efficient. So far the veb layout is shown to have log 2 B less I/Os for two-level memory. In a typical machine having three levels of cache (with cache line size of 64B), a RAM (with page size 3 For simplicity, we assume that the memory controller transfers a memory block of 3 noes starting at the aress of the noe requeste.

22 D2.2: White-box methoologies, programming abstractions an libraries (a) (b) (c) L1C L2C LLC RAM B 1 =16 4x B 2 =16 4x B 3 =16 4x B 4 = x DISK Figure 5: Illustration of require memory transfers in searching for (a) key 13 in BFS tree an (b) key 12 in veb tree, where noe s value is its memory location. An example of multi-level memory is shown in (c), where B x is the block size B between levels of memory. of 4KB) an a isk, searching a veb tree can achieve up to 640x less I/Os than searching a BFS tree, assuming the noe size is 4 bytes (Figure 5c) Dynamic ata structures. In a stanar linke-list structure supporting traversals, insertions an eletions, the bestknown cache-oblivious solution was O((lg 2 N)/B) I/Os for upates an O(K/B) for traversing K elements in the list [28]. The first cache-oblivious priority queue was ue to Arge et al. [5] an it supports inserts an elete-min operations in O( 1 / B log N M/B / B ) I/Os. The veb layout in static cache-oblivious search tree has inspire many cache-oblivious ynamic search trees such as cache-oblivious B-trees [9, 10, 11] an cache-oblivious binary trees [13]. All of these search tree implementations have been prove having the optimal bouns of O(log B N) in searches an require amortize O(log B N) I/Os for upates. However, veb-base trees poorly support concurrent upate operations. Inserting or eleting a noe may result in relocating a large part of the tree in orer to maintain the veb layout (cf. Section 3.1.5). Bener et al. [11] iscusse the problem an provie important theoretical esigns of concurrent veb-base B-trees. Nevertheless, we have foun that the theoretical esigns are not very efficient in practice ue to the actual overhea of maintaining necessary pointers as well as their large memory footprint.

23 D2.2: White-box methoologies, programming abstractions an libraries New Relaxe Cache-oblivious Moel We observe that it is unnecessary to keep a veb-base tree in a contiguous block of memory whose size is greater than some upper boun. In fact, allocating a contiguous block of memory for a veb-base tree oes not guarantee a contiguous block of physical memory. Moern OSes an systems utilize ifferent sizes of continuous physical memory blocks, for example, in the form of pages an cache-lines. A contiguous block in virtual memory might be translate into several blocks with gaps in RAM; also, a page might be cache by several cache lines with gaps at any level of cache. This is one of the motivations for the new relaxe cache oblivious moel propose. We efine relaxe cache oblivious algorithms to be cache-oblivious (CO) algorithms with the restriction that an upper boun UB on the unknown memory block size B is known in avance. As long as an upper boun on all the block sizes of multilevel memory is known, the new relaxe CO moel maintains the key feature of the original CO moel [34]. First, temporal locality is exploite perfectly as there are no constraints on cache size M in the moel. As a result, an optimal offline cache replacement policy can be assume. In practice, the Least Recently Use (LRU) policy with memory of size (1 + ɛ)m, where ɛ > 0, is nearly as goo as the optimal replacement policy with memory of size M [74]. Secon, analysis for a simple two-level memory are applicable for an unknown multilevel memory (e.g., registers, L1/L2/L3 caches an memory). Namely, an algorithm that is optimal in terms of ata movement for a simple two-level memory is asymptotically optimal for an unknown multilevel memory. This feature enables algorithm esigns that can utilize fine-graine ata locality in the multilevel memory hierarchy of moern architectures. The upper boun on the contiguous block size can be obtaine easily from any system (e.g., page-size or any values greater than that), which is platform-inepenent. In fact, the search performance in the new relaxe cache oblivious moel is resilient to ifferent upper boun values (cf. Lemma 3.1 in Section 3.1.5) New Concurrency-aware van Eme Boas Layout We propose improvements to the conventional van Eme Boas (veb) layout to support high performance an high concurrency, which results in new concurrency-aware ynamic veb layout. We first efine the following notations that will be use to elaborate on the improvements: b i (unknown): block size in terms of the number of noes at level i of the memory hierarchy (like B in the I/O moel [3]), which is unknown as in the cache-oblivious moel [34]. When the specific level i of the memory hierarchy is irrelevant, we use notation B instea of b i in orer to be consistent with the I/O moel. UB (known): the upper boun (in terms of the number of noes) on the block size b i of all levels i of the memory hierarchy. Noe: the largest recursive subtree of a van Eme Boas-base search tree that contains at most UB noes (cf. ashe triangles of height 2 L in Figure 6b). Noe is a

24 D2.2: White-box methoologies, programming abstractions an libraries 24 (a) UB H = 2 L (b) T mo 2 k B 2 L W... 2 k 2 k B B T logn X k B Figure 6: (a) New concurrency-aware veb layout. (b) Search using concurrency-aware veb layout. fixe-size tree-container with the veb layout. level of etail k is a partition of the tree into recursive subtrees of height at most 2 k. Let L be the level of etail of Noe. Let H be the height of a Noe, we have H = 2 L. For simplicity, we assume H = log 2 (UB + 1). N, T : size an height of the whole tree in terms of basic noes (not in terms of Noes). Conventional van Eme Boas (veb) layout. The conventional van Eme Boas (veb) layout has been introuce in cache-oblivious ata structures [9, 10, 11, 13, 34]. Figure 7 illustrates the veb layout. Suppose we have a complete binary tree with height h. For simplicity, we assume h is a power of 2, i.e., h = 2 k, k N. The tree is recursively lai out in the memory as follows. The tree is conceptually split between noes of height h/2 an h/2 + 1, resulting in a top subtree T an m 1 = 2 h/2 bottom subtrees W 1, W 2,, W m1 of height h/2. The (m 1 +1) top an bottom subtrees are then locate in contiguous memory locations where T is locate before W 1, W 2,, W m1. Each of the subtrees of height h/2 is then lai out similarly to (m 2 + 1) subtrees of height h/4, where m 2 = 2 h/4. The process continues until each subtree contains only one noe, i.e., the finest level of etail, 0. The main feature of the veb layout is that the cost of any search in this layout is O(log B N) memory transfers, where N is the tree size an B is the unknown memory block size in the cache-oblivious moel [34]. Namely, its search is cache-oblivious. The search cost is the optimal an matches the search boun of B-trees that requires the memory block size B to be known in avance. Moreover, at any level of etail, each subtree in the veb layout is store in a contiguous block of memory.

25 D2.2: White-box methoologies, programming abstractions an libraries 25 T h/2 h W 1... W m h/2 Tree partition... T W 1 W m Memory allocation Figure 7: Static van Eme Boas (veb) layout: a tree of height h is recursively split at height h/2. The top subtree T of height h/2 an m = 2 h/2 bottom subtrees W 1 ; W 2 ;... ; W m of height h/2 are locate in contiguous memory locations where T is locate before W 1 ; W 2 ;... ; W m. Although the conventional veb layout is helpful for utilizing ata locality, it poorly supports concurrent upate operations. Inserting (or eleting) a noe at position i in the contiguous block storing the tree may restructure a large part of the tree. For example, inserting new noes in the full subtree W 1 (a leaf subtree) in Figure 7 will affect the other subtrees W 2, W 3,, W m by rebalancing existing noes between W 1 an the subtrees in orer to have space for new noes. Even worse, we will nee to allocate a new contiguous block of memory for the whole tree if the previously allocate block of memory for the tree runs out of space [13]. Note that we cannot use ynamic noe allocation via pointers since at any level of etail, each subtree in the veb layout must be store in a contiguous block of memory. Concurrency-aware veb layout. In orer to make the veb layout suitable for highly concurrent ata structures with upate operations, we introuce a novel concurrency-aware ynamic veb layout. Our key iea is that if we know an upper boun UB on the unknown memory block size B, we can support ynamic noe allocation via pointers while maintaining the optimal search cost of O(log B N) memory transfers without knowing B (cf. Lemma 3.1). The assumption on known upper boun UB is supporte by the fact that in practice it is unnecessary to keep the veb layout in a contiguous block of memory whose size is greater than some upper boun.

26 D2.2: White-box methoologies, programming abstractions an libraries 26 Figure 6a illustrates the new concurrency-aware veb layout base on the relaxe cache oblivious moel. Let L be the coarsest level of etail such that every recursive subtree contains at most UB noes. Namely, let H an S be the height an size of such a subtree then H = 2 L an S = 2 H 1 < UB. The tree is recursively partitione into level of etail L where each subtree represente by a triangle in Figure 6a, is store in a contiguous memory block of size UB. Unlike the conventional veb, the subtrees at level of etail L are linke to each other using pointers, namely each subtree at level of etail k > L is not store in a contiguous block of memory. Intuitively, since UB is an upper boun on the unknown memory block size B, storing a subtree at level of etail k > L in a contiguous memory block of size greater than UB, oes not reuce the number of memory transfers, provie there is perfect alignment. For example, in Figure 6a, traveling from a subtree W at level of etail L, which is store in a contiguous memory block of size UB, to its chil subtree X at the same level of etail will result in at least two memory transfers: one for W an one for X. Therefore, it is unnecessary to store both W an X in a contiguous memory block of size 2UB. As a result, the memory transfer cost for search operations in the new concurrency-aware veb layout is intuitively the same as that of the conventional veb layout (cf. Lemma 3.1) while the concurrency-aware veb supports high concurrency with upate operations. Lemma 3.1. For any upper boun UB of the unknown memory block size B, a search in a complete binary tree with the new concurrency-aware veb layout achieves the optimal memory transfer O(log B N), where N an B are the tree size an the unknown memory block size in the cache-oblivious moel [34], respectively. Proof. (Sketch) Figure 6b illustrates the proof. Let k be the coarsest level of etail such that every recursive subtree contains at most B noes. Since B UB, k L, where L is the coarsest level of etail at which every recursive subtree ( Noes) contains at most UB noes. That means there are at most 2 L k subtrees along the search path in a Noe an no subtree of epth 2 k is split ue to the bounary of Noes. Namely, triangles of height 2 k fit within a ashe triangle of height 2 L in Figure 6b. Because at any level of etail i L in the concurrency-aware veb layout, a recursive subtree of epth 2 i is store in a contiguous block of memory, each subtree of epth 2 k within a Noe is store in at most 2 memory blocks of size B (epening on the starting location of the subtree in memory). Since every subtree of epth 2 k fits in a Noe (i.e., no subtree is store across two Noes), every subtree of epth 2 k is store in at most 2 memory blocks of size B. Since the tree has height T, T/2 k subtrees of epth 2 k are traverse in a search an thereby at most 2 T/2 k memory blocks are transferre. Since a subtree of height 2 k+1 contains more than B noes, 2 k+1 log 2 (B + 1), or 2 k 1 log 2 2(B + 1). We have 2 T 1 N 2 T since the tree is a complete binary tree. This implies log 2 N T log 2 N + 1. Therefore, the number of memory blocks transferre in a search is 2 T/2 k 4 log 2 N+1 = log 2 (B+1) 4 log B+1 N + log B+1 2 = O(log B N), where N 2.

27 D2.2: White-box methoologies, programming abstractions an libraries 27 A library of novel locality-aware an energy efficient concurrent search trees base on the new concurrency-aware veb layout is presente in Section Power Moel for Computational Algorithms on Moviius Platform The objective of this stuy is to buil a power moel that can estimate the power consumption of an algorithm on Myria platform. In orer to o so, the moel consiers both algorithmic an platform properties. Our moel is inspire by Amahl law analysis an the Roofline moel of energy [19, 20]. Although the Roofline moel also connects algorithmic an platform properties, it oes not consier the number of cores nor memory contention when the number of cores varies. By estimating the power consume by a system running a specific number of cores, our moel can preict the number of cores require to achieve the least energy consumption. In this report, the moel has been evaluate by micro-benchmarks an application kernels such as sparse/ense linear algebra kernels an graph kernels on Moviius embee platform (Myria1) Energy Moel Description In Deliverable D2.1, we presente our initial ieas on an energy moel for Myria1 platform. In this Deliverable, we have improve the moel by consiering both computational an ata movement power. Another improvement is that we create the micro-benchmarks whose sizes are approximately 1 KB that fits to Myria1 cache buffer. The measurements are, therefore, more accurate an o not nee to inclue inter-operation cost like the power moel in Deliverable D2.1. The etails on how to buil the moel an its evaluation are escribe as follows Computational Power The computational power of a system is the require power to perform its computation using ata from closest memory components such as register files. The computational power inclues static power, active power an ynamic power of involve functional units. At first we aim to fin out the corresponing values of static, active an ynamic power of each SHAVE core an each SHAVE arithmetic unit. The experimental results have shown that the power consumption of Moviius Myria1 platform is rule by the following moel: P comp = P sta + #{SHV} (P act + P yn SHV ) (1) In that formula, the static power P sta is the neee power when the Myria1 processor is on. The P act is the power consume when a SHAVE core is on. Therefore, if benchmarks or programs use n SHAVE cores, the active power nees to be multiplie with the number of use SHAVE cores. The ynamic power P yn SHV of each SHAVE is the power consume by all working operation units in one SHAVE. Operation units inclue arithmetic units (e.g., IAU, VAU, SAU an

28 D2.2: White-box methoologies, programming abstractions an libraries 28 CMU) an loa store unit (e.g., LSU1, LSU2). The experimental results show that ifferent arithmetic operation units have ifferent P yn values. When running a benchmark with one more SHAVE, we can ientify the sum of SHAVE P act an P yn which is the power ifference of the two runs (the run with one SHAVE core an the run with two SHAVEs). Given the sum of P act an P yn, P sta is erive from the Equations 1. The experimental results show the average value of P sta from all micro-benchmarks: P sta = mw (2) Dynamic power P yn SHV formula. That means P yn SHV P yn (Sau) + P yn (Iau). of SHAVE running multiple units is compute by the following P yn SHV = i P yn i (op) (3) is the sum of P yn of all involve arithmetic units. E.g. P yn (SauIau) ) = an P act by using the actual For each operation unit, we obtain the two parameters Pop yn power consumption of the benchmark for iniviual units an multiple units. For example, P yn Iau, P yn Sau an P act are ientifie from Equations 4, 5, 6. P yn (Iau) = P sta + #{SHV} (P act + P yn (Iau) ) (4) P yn (Sau) = P sta + #{SHV} (P act + P yn (Sau) ) (5) P yn (SauIau) = P sta + #{SHV} (P act + P yn (Sau) + P yn (Iau) ) (6) Then, the average value of P act from all micro-benchmarks is erive as below. P act = mw (7) Given the values of P sta, P act an P yn SHV as Equations 2, 7 an 3 respectively, the computation power P comp for Moviius Myria1 can be estimate by applying the below formula: ( ) P comp = P sta + #{SHV} P act + i Table 1 lists the ynamic power of each operation unit Data Movement Power P yn i (op) Since Myria1 has two memory components within the chip (CMX an DDR), we moel the ata movement for both of the memory components. The ata is either move from CMX to registers or from DDR to registers before being processe by arithmetic units. Data movement is performe by two units of the SHAVE core namely Loa Store Unit (LSU). LSU loas an stores ata from the memory components such as DDR an CMX to the register memory of SHAVE processor. Therefore, the consume power of ata movement (8)

29 D2.2: White-box methoologies, programming abstractions an libraries 29 Operation Unit Pop yn SauXor SauMul VauXor VauMul IauXor IauMul CmuCpss CmuCpivr LsuLoa LsuStore Table 1: P yn op of SHAVE Operation Units also inclues the static power P sta, the active power P act, LSU unit power P yn LSU an the contention power P ctn when there are more cores accessing memory than the number of available memory ports. Below is the formula to estimate the power consumption of ata movement. P ata = P sta + min(m, n) (P act + P yn LSU ) + max(n m, 0) P ctn (9) In the formula, n is the number of active cores running the program; m is the number representing memory ports or banwith available in the platform; the contention power P ctn is the power overhea occurring when SHAVE cores actively wait for accessing ata because of the limite memory ports (or banwith) in the platform architecture. Myria1 has two separate memory components: CMX an DDR. As escribe in section 2.2.1, each SHAVE core has its own CMX memory slice meaning that the number of ata accessing ports m equals to the number of active cores n when transferring ata between CMX memory an registers. Therefore, ata movement for CMX oes not cause contention power an the power consumption of CMX ata movement PCMX ata is: P ata CMX = P sta + min(n, n) (P act + P yn LSU ) + max(0, 0) P ctn (10) P ata CMX = P sta + n (P act + P yn LSU ) (11) The ata movement between DDR an registers, however, has the contention power ue to the limite DDR ports/banwith share among all SHAVE cores. P ata DDR = P sta + min(m, n) (P act + P yn LSU ) + max(n m, 0) P ctn (12) A Power Moel for Computational Algorithms Typical applications require both computation an ata movement. The power consumption then nees to consier the power of both the computation an ata movement. In our moel,

30 D2.2: White-box methoologies, programming abstractions an libraries 30 we use the concept of operational intensity propose by Williams et.al. [86] to characterize algorithms. An algorithm can be characterize by the amount of computational work W an a number of ata accesses Q. W is the number of operations performe by a program. Q is the number of transferre bytes require uring a program execution. Both W an Q efine the operation intensity I of the algorithm. I = W Q (13) For the power moel, we assume that the core either perform computation or transfer ata uring its active state. The power consumption of a processor epens on the ratio of the computation to ata movement uring its execution. This portion can be calculate base on the time ratio of performing computation to transferring ata. As the time neee to perform one operation is ifferent from the time require to transfer one byte of ata, we introuce a parameter α to the moel. The parameter α is the property of the platform an its value epens on each platform architecture. The moel consiers both computation an ata movement cost is erive below. P = P comp W ( α Q + W ) + P ata α Q ( α Q + W ) (14) After converting W an Q to I by using the Equation 14, the final moel is simplifie as below: P = P comp I ( α + I ) + P ata α ( α + I ) (15) The complete moel with etails to compute the power consume by a given application/ algorithm is presente as below. P = P sta + n (P act + P yn SHV ) ( I α + I ) + (min(m, n) (P act + P LSU ) + max(n m, 0) P ctn α ) ( α + I ) Table 2 summarizes the parameter list of the propose moel. (16) Moel Valiation Experimental valiation with micro-benchmarks Regaring the assembly coe of micro-benchmarks, a fixe number of instructions is provie for all the experiments, meaning that each assembly coe file contains five instructions in an iteration an the iteration is infinitely repeate. This convention keeps the continuity an consistency of experiments while giving insights into measuring the power consumption with ifferent SHAVE units, enabling the comparisons among SHAVE units. As compare to micro-benchmarks use in Deliverable D2.1, the number of instructions in one iteration

31 D2.2: White-box methoologies, programming abstractions an libraries 31 Parameters P sta P act P yn SHV P LSU P ctn m n I α Explanation Static power of a SHAVE core Active power of a SHAVE core Dynamic power of a SHAVE core Operation power of Loa Store Unit Contention power of a SHAVE core Number of memory ports Number of running SHAVE cores Operation intensity of an algorithm Time ratio of ata transfer to computation Table 2: Moel Parameter List for each mico-benchmark is re-calculate so that the whole iteration fits to L1 cache buffer size (1KB). The assembly coe files use in the experimental evaluation contain coe that executes the instruction ecoe an instruction fetch. Majority of tests use pseuo-realistic ata. By using pseuo-realistic ata we have as many non-zero values as possible an avoiing ata value repetition at ifferent offsets. Analyses of experimental results are performe base on an ientifie set of micro-benchmarks. Each micro-benchmark is execute with ifferent numbers of SHAVE such as 1, 2, 4, 6 an 8 SHAVE cores. A sample coe of micro-benchmark performing SauXor operation is given in Figure 8. Computation micro-benchmarks Applying this moel to the computational microbenchmarks for each of operation units, the moel relative error is plotte in the Figure 9. Having all parameter ata from Equation 2, 7 an Table 1, the moel-preicte values are compute. Relative errors are the percentage ifference between the actual power consumption measure by evice an the preicte values from the moel. Uner this moel, the relative error which is the percentage ifference between measure an estimate values varies from -5% to 6% which is an 11% margin. This proves that the moel can be applie for micro-benchmarks running a single unit or any combination of two or three units in parallel. This moel shows the compositionality of power consumption not only for multiple SHAVE cores but also for multiple operation units within a SHAVE. Data movement micro-benchmarks Data movement micro-benchmarks are the micro-benchmarks running LSU units to transfer ata from CMX memory to core registers. The micro-benchmarks are esigne to execute loa/store operation only or with other functional units. When execute with other computation units, the instructions are in ifferent pipelines to make both LSU an computation units execute in parallel manner. Since each of the eight SHAVE cores in Myria1 can access CMX ata with its own port, this matches the Equation 11 as presente in Section

32 D2.2: White-box methoologies, programming abstractions an libraries 32 Figure 8: operations An example of micro-benchmark is written in asm coe to perform SauXor The experimental results of LSU also follow the power moel shown in Figure 10. The

33 D2.2: White-box methoologies, programming abstractions an libraries 33 Figure 9: Relative error of the moel for computational micro-benchmarks Figure 10: Relative error of the moel for ata movement micro-benchmarks relative error of ata movement micro-benchmarks varies from -4% to 6% which is a 10% margin. This proves that the moel can be applie to Myria1 platform accessing CMX ata. Intensity-base micro-benchmarks: the combination of computation an ata movement Since any application requires both computation an ata movement, we esign micro-benchmarks which execute both computation an ata movement units. Unlike the combination of multiple units in ifferent instruction pipelines as computational microbenchmarks, this set of micro-benchmarks contains sequential instructions to trigger functional units in sequential orer. With this way, the assembly coe oes not use the pipeline

34 D2.2: White-box methoologies, programming abstractions an libraries 34 optimization feature of Myria1 platform an therefore, provie raw performance for a more precise analysis. In orer to valiate the power moel for any algorithms, the micro-benchmarks use in experiments at this phase also inicate ifferent operation intensities. Operation intensity I is retrieve from the assembly coe by counting the number of arithmetic instructions such as Xor, Mul an the number of LSU instructions. The number of arithmetic instructions inicate the amount of work W an LSU instructions multiplie by four (each LSU instruction loa 32 bits which equals to four bytes) inicate the number of accesse ata bytes Q. By changing the ratio of arithmetic instructions to LSU instructions, we create micro-benchmarks with ifferent intensity values. A sample coe of micro-benchmark with intensity I = 0.25 is given in Figure 11. Intensity-base micro-benchmarks using CMX memory to store ata has experimental results shown in Figure 12. Since eight SHAVE cores have eight accessing ports to CMX memory, cores o not wait for memory fetching an there is no contention power. The power moel of Myria1 using CMX memory is the Equation 17. P = P sta + n (P act + P yn SHV ) ( I α + I ) + n (P act α (17) + P LSU ) ( α + I ) In the CMX power moel in Equation 17, all parameters are known except α. By using non-regression techniques in Matlab function lsqcurvefit to fin unknown parameters, α is ientifie as equal to one. This is reasonable since accessing four bytes of ata in CMX requires average four cycles [67] which means one cycle per byte an an operation use in micro-benchmarks (e.g. SauXor) also requires one cycle to be execute. We plot the moel eviation error of Myria using CMX in Figure 12. The relative error of intensity micro-benchmarks varies from -9% to 4% which is a 13% margin. This proves that the moel can be applie to estimate the power consumption of an algorithm on Myria1 platform accessing CMX ata. Intensity-base micro-benchmarks using DDR memory to store ata has the experimental results as shown in Figure 13. Since eight SHAVE cores share the same ata bus to DDR memory, contention power might happen when cores wait for memory fetching. The power moel of Myria1 using DDR memory is the Equation 18. P = P sta + n (P act + P yn SHV ) ( I α + I ) + (max(m, n) (P act + P LSU ) + max(n m, 0) P ctn α (18) ) ( α + I ) In the DDR power moel in Equation 18, there are unknown parameters such as α, m an P ctn. By using non-linear regression techniques, α an P ctn are ientifie for each intensity value. Figure 14 shows how well the preicte ata fits to measure ata with

35 D2.2: White-box methoologies, programming abstractions an libraries 35 Figure 11: An example of micro-benchmark is written in asm coe with intensity I = 0.25 the foun parameter values. We plot the moel eviation error of the moel for intensity micro-benchmarks using DDR memory in Figure 13. The relative error of intensity microbenchmarks varies from -16% to 14%. The moel has high accuracy at intensity lower than 1 an higher eviation with micro-benchmarks running two cores. This requires more investigation. However, for each intensity, the eviation is less than 24% margin. The foun tunable parameters such as α, m an P ctn are summarize in the Table 3. The parameter values are erive from experimental results on micro-benchmarks using Matlab function lsqcurvef it.

36 D2.2: White-box methoologies, programming abstractions an libraries 36 Figure 12: Relative error of the moel for intensity micro-benchmarks using CMX Figure 13: Relative error of intensity micro-benchmarks using DDR Experiment with application benchmarks We want to analyze application benchmarks that represent for typical applications. Therefore, we follow two metrics to choose the applications for our investigation. Since the energy consumption of an application epens on its operation intensity I, it is one main factor to choose application benchmarks. The energy consumption not only epens on the power but also program execution time. Therefore, the performance spee-up of an application is another main factor we consier to choose application benchmarks. The three benchmarks we choose are Matrix Multiplication, Sparse Matrix Vector Multiplication an Breath First Search inclue in Berkeley Dwarfs [6] as shown in Figure 15. Matrix Multiplication is prove to have high operation intensity an high performance spee-up [68] while Sparse Matrix Vector Multiplication has low intensity an high spee-up ue to its parallel scalability [85]. Breath First Search, on the other han, has low intensity an saturate low scalability [22]. Since applications with high intensity an low spee-up are typical sequen-

D2.2: White-box methoologies, programming abstractions an libraries 37 Intensity P ctn α m 0.25 0.1 0.91 1 0.5 0.1 1.72 1 0.75 0.1 2.49 1 1 1.97 3.87 1 2 5.25 10 1 4 0.1 10 1 8 0.

37 D2.2: White-box methoologies, programming abstractions an libraries 37 Intensity P ctn α m Table 3: Values of Moel Parameters erive from micro-benchmarks Figure 14: Parameter fitting over the sample of intensity micro-benchmarks on DDR tial applications, which are not our target applications, so we o not inclue them in our valiation. Matrix multiplication Matrix multiplication (matmul) has been implemente base on the optimize assembly coe for matmul using small blocks of ata. Each block is the small submatrix of size 4 4. The matmul algorithm computes matrix C base on two input matrices A an B. All three matrices in this benchmarks are store in DDR RAM. The operation intensity I of the matmul algorithm increase linearly with the matrix size. We store each matrix element with float type which means four bytes. The number of

38 D2.2: White-box methoologies, programming abstractions an libraries 38 Figure 15: Application Categories operations an accesse ata are calculate base on matrix size n as: W = 2 n 3 an Q = 16 n 2. Intensity of matmul is also varie with matrix size as: I = W = n [68]. In our Q 8 experiments, we measure power consumption of matmul with varie matrix size. Figure 16 is the preicte power consumption of matmul over intensity values using our power moel. As shown in the iagram, power consume by computation is the main contribution to the total power. With increase intensity, total power an computation power increase but power from accessing ata ecreases. Apply the moel by using the moel parameters erive from micro-benchmarks, the eviation error of matmul estimate power compare to measure ata on 8 cores is from 24% to 59% as shown in Figure 17. The highest eviation occurs at size because it is performe in short execution time an the multimeter evice can not capture the real consume power. This also happens for matmul with matrix size less than The eviation occurs because parameters have not consiere the ata-access patterns to access the same ata or ifferent ata. Therefore, the moel accuracy can be improve when we consier the ata-access patterns an fin tuning parameters such as α, P ctn an m base on actual measurements of the application. Sparse matrix vector multiplication Sparse matrix vector multiplication (SpMV) is another kernel benchmark that we implement on Myria1. All input matrix an vector of this benchmarks resie in DDR RAM. The operation intensity I of SpMV algorithm oes not epen on matrix size. The ata layout of matrix in this implementation is compresse sparse row (CSR) format. Each element of matrix an vector is also store with float type. From our implementation, the number of operations an accesse ata are calculate base on the size of one matrix imension n as: W = 10 n an Q = 13 4 n. The intensity of SpMV oes not epen on matrix size an is a fixe value: I = W = 5 = As Q 26

39 D2.2: White-box methoologies, programming abstractions an libraries 39 Figure 16: Power Analysis of Matrix Multiplication with 8 cores. Since the intensity of matmul is also varie with matrix size as: I = W = n, its consume power is also varie Q 8 by the intensity. The re an blue stacke lines are the estimate power consume by ata movement an computation which contribute to the total power respectively. Figure 17: Deviation of estimate power from measure power for Matrix Multiplication the result, preicte power consumption of SpMV with varie matrix sizes is the same over matrix size as shown in Figure 18. Each element of matrix an vector is also store with float type. From our implementation, the number of operations an accesse ata is calculate base on the matrix size n n as: W = 10 n an Q = 13 4 n. Intensity of SpMV is not epenent on matrix size an is a fixe value: I = W = 5 = Since the intensity value is small, power Q 26 consume by accessing ata contributes significantly to the total power. The eviation error of SpMV estimate power compare to measure power is from 15% to 23% which is an 8% margin shown in Figure 19. Since we want to preict the tren of power consumption, an 8% margin is acceptable for us to ientify when to use the race-to-hol (RTH) conition [54].

D2.2: White-box methoologies, programming abstractions an libraries 40 Figure 18: Power analysis of Sparse Matrix Vector Multiplication Figure 19: Deviation of estimate power from measure power of

40 D2.2: White-box methoologies, programming abstractions an libraries 40 Figure 18: Power analysis of Sparse Matrix Vector Multiplication Figure 19: Deviation of estimate power from measure power of Sparse Matrix Vector Multiplication Breath First Search. We also implement the Graph500 kernel, namely Breath First Search (BFS), on Myria1. BFS is the graph kernel to explore the vertices an eges of a irecte graph from a starting vertex. We use the algorithm in the current Graph500 benchmark an port it to Myria1. The graph is store in DDR RAM. There are two properties that efine the size of a graph: egree an scale. Scale ientifies the number of noes in the graph while egree is the average number of eges from a noe. In our experiments, we mostly use the efault scale of 16 from the Graph500 [8], which means the graph has 2 16 vertices. The egree is varie from 14 to 17. The operation intensity I of BFS algorithm oes not epen on egree or scale. Each vertex ata is store with float type. From our implementation, the number of operations an accesse ata are calculate base on the number of vertices m an the number of eges n: W = 2 m + 5 n an Q = 8 m + 16 n. Intensity of BFS is a fixe value: I = W = As the result, Q preicte power consumption of BFS with varie egrees is the same over matrix size as shown in Figure 20. Power consume by accessing ata contributes significantly to the total power. Figure 21 is the eviation error of the power moel for BFS compare to

D2.2: White-box methoologies, programming abstractions an libraries 41 Figure 20: Power analysis of Breath First Search Figure 21: Deviation of estimate power from measure power of Breath First

41 D2.2: White-box methoologies, programming abstractions an libraries 41 Figure 20: Power analysis of Breath First Search Figure 21: Deviation of estimate power from measure power of Breath First Search measure ata, which ranges from -14% to 17% Remarks an future works We have presente a power moel for Myria1 platform. In etails, we have escribe the process to buil the moel an evaluate it. We have first valiate the moel with three sets of micro-benchmarks. We have also implemente three application kernels on Myria1 platform, along with analyzing the amount of work W, the number of ata accesses Q an the intensity I for each algorithm. Then we have applie our moel to the implemente algorithms an evaluate its accuracy. In the future, we plan to use the moel with EXCESS execution framework escribe in Deliverable D3.1 an D3.2. In that context, the moel can use the feeback ata of consume power measure by the framework to fin the accurate value of α an P ctn an m for a given platform. We also plan to improve the accuracy of the moel by consiering memory-access patterns of the implementation an instruction-pipeline parallelism.

42 D2.2: White-box methoologies, programming abstractions an libraries Energy Moel for Lock-Free Queues on CPU Platform We consier the problem of moeling the energy behavior of lock-free concurrent queue ata structures, more especially, of lock-free queue implementations an parallel applications that use them. Focusing on steay state behavior we ecompose energy behavior into throughput an power issipation which can be moele separately an later recombine into several useful metrics, such as energy per operation. Base on our moels, instantiate from synthetic benchmark ata, an using only a small amount of aitional application specific information, energy an throughput preictions can be mae for parallel applications that use the respective ata structure implementation. To moel throughput we propose a generic moel for lock-free queue throughput behavior, base on a combination of the equeuers throughput an enqueuers throughput. To moel power issipation we commonly split the contributions from the various computer components into static, activation an ynamic parts, where only the ynamic part epens on the actual instructions being execute. To instantiate the moels a synthetic benchmark explores each queue implementation over the imensions of processor frequency an number of threas. Finally, we show how to make preictions of application throughput an power issipation for a parallel application using a lock-free queue requiring only a limite amount of information about the application work one between queue operations. Our case stuy on a Manelbrot application shows convincing preiction results. In this Section 3.3, we present a so-calle white-box moel for performance an power of lock-free queue implementations. This moel is white-box, in the sense that the phenomena that rive the evolution of the throughput an the power issipation are clearly ientifie, an their contribution to both metrics is transparently state. The Section 3.4 will then explain how to set the parameters of the moel with only a few runs of the synthetic benchmark Motivation an Preliminaries Lock-free implementations of ata structures are a scalable approach for esigning concurrent ata structures. Lock-free ata structures offer high concurrency an immunity to ealocks an convoying, in contrast to their blocking counterparts. Concurrent FIFO queue ata structures are funamental ata structures that are key components in applications, algorithms, run-time an operating systems. The proucer/consumer pattern, e.g., is a common approach to parallelizing applications where threas act as either proucers or consumers an synchronize an stream ata items between them using a share collection. A concurrent queue, a.k.a. share first-in, first-out or FIFO buffer, is a share collection of elements which supports at least the basic operations Enqueue (as an element) an Dequeue (removes the olest element). Dequeue returns the element remove or, if the queue is empty, Null. A large number of lock-free (an wait-free) queue implementations have appeare in the literature, e.g. [81, 65, 79, 66, 52, 37] being some of the most influential or most efficient results. Each implementation of a lock-free queue has obviously its strong

43 D2.2: White-box methoologies, programming abstractions an libraries 43 an weak points so the impact on performance an energy when choosing one particular implementation for any given situation may not be obvious. As the number of known implementations of lock-free concurrent queues is growing, it is of great interest to escribe a framework within which the ifferent implementations can be ranke, accoring to the parameters that characterize the situation. A brute force approach coul achieve this by running the implementations on han on the whole omain of stuy, gathering an comparing measurements. This woul yiel high accuracy, but at a tremenous cost, since the omain is likely to be large. Aitionally, it woul only bring a limite unerstaning on the phenomena that rive the behavior of the queue implementations. Therefore, we propose generic moels for preicting the behavior of lock-free queues uner steay state usage. The moels are instantiate for the queue implementations an machine on han using empirical ata from a limite number of points in the omain. The implementations can be ranke accoring to a plethora of metrics. Traitionally, performance in terms of throughput has been the main metric. Furthermore, the notion of energy efficiency has now extene into every nook an cranny of Information Technology, at any scale, from the Exascale machines that nee huge improvements in terms of power issipation to be feasible [31], to the small electronic evices where the battery lifetime is a critical issue. We ecompose the energy behavior of queues, an subsequently applications, into two components: (i) throughput an (ii) power issipation. We moel these components separately. The preicte throughput an power issipation can be recombine into the energyefficiency metric energy per queue operation, which is the ratio between power issipation an queue throughput. When moeling an application, this metric can be extene to energy per unit of application work. Further, plotting energy per operation or unit of work accoring to throughput allows exploration of the Pareto-optimal frontier of the energy performance bi-criteria optimization problem for the queues or the application. Lock-free queue ata structures generally offer isjoint-access parallelism: enqueuers an equeuers moify only their respective ens of the queue, an compete mostly with operations of the same kin. Nonetheless, when the queue is close to empty, both ens point to the same part of the queue, then enqueue an equeue operations have to be synchronize, an every operation impacts the behavior of any other. Concerning the queue as a whole, a successful event can be seen as the equeue of a non-null item, since this event implies that the item has been enqueue an equeue. Also, the throughput of the queue is naturally efine as the number of such events per unit of time, which is a meaningful performance criterion for queues. In this work, we focus on queues that are in a steay state, i.e. such that the rate of each operation attempt is constant. Then, the throughput T of the queue is the minimum between the throughput of all equeues T, even those returning Null, an throughput of enqueues T e. Inee, if T e > T, then the queue grows an the throughput is etermine by the equeuers, which cannot obtain any Null items; an if T e T, then the queue is mostly empty an Null items are equeue, but the throughput is etermine by the enqueuers.

44 D2.2: White-box methoologies, programming abstractions an libraries 44 Despite this ecomposition, enqueuers an equeuers throughput are still correlate when the queue is mostly empty. In aition, the interactions between them are rather asymmetric, as in broa terms, an enqueue can be elaye by any concurrent equeue, while for a equeue, concurrent enqueues will cease to isturb it as they move away from the equeue en. Base on these facts, we ecorrelate the throughput into several uncorrelate an basic throughputs, an reconstitute the main throughput by combining them. Among the avantages of this process, we earn a better unerstaning of the performance (as the basic throughputs are meaningful), an we reuce the number of measurements neee to instantiate the moel on the whole omain of stuy. The omain of stuy that we envision here can be viewe as the Cartesian prouct of four sets: (i) number of threas accessing the queue, (ii) CPU frequencies, (iii) a range of equeue access rates, (iv) a range of enqueue access rates. The carinality of the first two sets is at most a few tens, while the last two are continuous sets that are not even boune. Thanks to the removal of the epenencies between throughputs, we are able to instantiate the moel with only a few ata points, while the moel covers the whole intervals. Finally, this ecomposition also eases the stuy of power issipation, where we reuse the same ieas as in the throughput estimation part Framework Synthetic Benchmark proceure Enqueuer while!one o Parallel Work() Enqueue() Figure 22: Queue benchmark proceure Dequeuer while!one o res Dequeue() if res Null then Parallel Work() Skeleton We run the synthetic benchmark compose of the two functions escribe in Figure 22, starting with an empty queue. Half of the threas are assigne to be enqueuers while the remaining ones are equeuers. We isable logical cores (hyper-threaing) an map ifferent threas into ifferent cores, also the number of threas never excees the number of cores. In aition, the mapping is one in the following way: when aing an enqueuer/equeuer pair, they are both mappe on the most fille but non-full socket. The parallel sections (Parallel Work) shall be seen as a processing activity, pre-processing for the enqueuers before they enqueue an item, an post-processing on an item from the queue for the equeuers. We assume that memory accesses in the parallel sections are negligible, an represent the parallel sections as sequences of bunches of pause instructions in the benchmark; we note pw e (resp. pw ) the number of bunches of 90 pauses (which correspons to 1000 cycles) that compose the parallel work in the enqueuer (resp. equeuer).

45 D2.2: White-box methoologies, programming abstractions an libraries 45 From a high-level perspective, Enqueue an Dequeue operations follow a retry loop pattern: a threa reas an access point to the ata structure, works locally with this view of the ata structure, possibly performs memory management actions an prepares the new esire value as an access point of the ata structure. Finally, it atomically tries to perform the change through a call to the Compare-an-Swap primitive. If it succees, i.e. if the access point has not been change by another threa between the first rea an the Compare-an-Swap, then it goes to the next parallel section, otherwise it repeats the process. Queue Implementations We stuy some of the most well-known an stuie lockfree an linearizable queues in the literature, as implemente in NOBLE [77].These queue algorithms are escribe in some etail in Section The aim of this work is still to preict the behavior of any lock-free queue algorithm an not only the ones mentione above. These algorithms are use to valiate the moel that we present in the following sections. When we speak about implementations of the queues, we actually refer to the ifferent implementations of enqueuing an equeuing operations, along with their memory management schemes General Power Moel The power is split into three elements: the static part is the cost of turning the machine on, the activation part incorporates a fixe cost for each socket an each core in use, an the ynamic part is a supplementary cost that epens on the running application. In accorance with the RAPL energy counters [26, 17, 84], we further ecompose each part per-component, for memory, CPU, an uncore (enote by a superscript M, C an U, respectively): ( P = P (stat,x) + P (active,x) + P (yn,x)). X {M,C,U} We assume that we alreay know the platform characteristics, i.e. all static an active powers (they can be obtaine as explaine for instance in D1.2 [55]), an we try to fin the application-specific ynamic powers. In orer to keep the formulas reaable, in the following, we enote by P (X) the ynamic power P (yn,x) Notations an Setting We enote by n the number of running threas that call the same operation, an by f the clock frequency of the cores (we only consier the case where all cores share the same clock frequency). We recall that pw e (resp. pw ) is the amount of work in the parallel section of an enqueuer (resp. equeuer), as the number of bunches of 90 pauses. For a given queue implementation, we enote by cw e (resp. cw ) the amount of work in one try of the retry loop of the Enqueue (resp. Dequeue) operation. Associate with these amounts of work, we efine, for o {, e},

46 D2.2: White-box methoologies, programming abstractions an libraries 46 the average execution time of the parallel section (resp. the retry loop an a single try of the retry loop) relate to operation o as t (PS o ) (resp. t (RL o ) an t (SL o )). In the same way, for o {, e}, we enote by P o (C) (resp. P (C) (C) o,ps an P o,rl ) the ynamic CPU power issipate by component X in (resp. the parallel section relate to an the retry loop relate to) operation o. Finally, for o {, e}, we enote by r o the ratio of the time that a threa spens in the retry loop, while it is associate with operation o. In Sections an 3.3.4, in orer to keep expressions as simple as possible, we efine one unit of time as λ sec, where λ is the execution time of 90 f pauses (as the pause instructions are perfectly scalable with clock frequency, λ is constant). Throughput is expresse in number of operations per unit of time, i.e. per λ secs. Finally, we erive the power in Watts. All experiments an their unerlying preictions are one on S ystem A (see Section 2.1), i.e. a platform compose of a ual-socket Intel R Xeon R processor, with eight cores per socket. The sizes of L3, L2 an L1 caches are 25 MB, 256 kb an 32 kb, respectively. We run the implementations at the two extreme frequencies (excluing Turbo moe) 1.2 GHz an 3.4 GHz, for all possible even total numbers of threas, from 2 to 16, i.e. for n {1,..., 8} Throughput Estimation Throughput Decomposition Principles We recall that the throughput of the queue is efine as: T = min (T e, T ), where T e an T are the enqueuers an equeuers throughput, respectively. As we are in steay state, one operation o is performe every t (PS o ) + t (RL o ) unit of time by each threa, an n threas attempt to concurrently execute o, hence the general expression of the throughput T o : T o = n t (PS o ) + t (RL o ). We have seen that the parallel sections of the benchmark are full of pauses, thus the time t (PS o ) spent in a given parallel section is straightforwarly given by t (PS o ) = pw o /f. The execution time of equeue an enqueue operations is more problematic, for two main reasons. Primo, because of the lock-free nature of the implementations. As the number of retries is unknown, the time spent in the function call is not trivially computable. Secuno, when the activity on the queue is high, the threas compete for accessing a share ata, an they stall before actually being able to access the ata. We name this as the expansion, as it leas to an increase in the execution time of a single try of the retry loop. The contention on the queue is twofol. At any time, an even if it coul be negligible, threas that perform the same operation isturb each other, since they try to access the same share ata. In aition, when the queue is mostly empty, enqueuers an equeuers

47 D2.2: White-box methoologies, programming abstractions an libraries 47 try to access the same ata, then interference occurs; enqueuers make equeuers stall an vice versa. We call the former case intra-contention, an the latter one inter-contention. As expecte, we have notice a marke ifference between the execution time of a equeue operation returning Null an one that returns a queue item, i.e. whether the queue was empty or containe at least one item. That is why we ecompose T into throughput of equeue on empty queue T (+) (that returns a Null item), an equeue on non-empty queue T (-) (that oes not return Null). Further, the impact of inter-contention on equeue operations is negligible compare to the impact of the queue being empty; therefore we ignore inter-contention for equeues. In contrast, the queue being empty oes not notably change the execution time of the enqueue operation, while equeue operations can impact the behavior of concurrent enqueue operations greatly when the queue is close to empty. Hence, we split T e into the enqueue throughput T e (+) when the queue is not inter-contene, an the enqueue throughput T e (-) when the queue experiences the maximum possible inter-contention. These basic throughputs fulfill the two following inequalities: T (+) T (-) an T e (+) T (-) e. Thanks to this separation into the four basic throughput cases T (+), T (-), T e (+) an T e (-), we earn a better unerstaning of the factors that influence the general throughput, an we einterlace their epenencies, which ramatically ecreases the number of points in the parallel section sizes set where we nee to take measurements for our moeling. More precisely, by construction, T (+) an T (-) o not inee epen on pw e, while T e (+) an T e (-) o not epen on pw. Nonetheless T (resp. T e ) is efine as a barycenter between T (+) an T (-) (resp. T (-) e an T (+) e ), whose weights epen on both pw an pw e. In Section , we escribe the basic throughputs, we combine them in Section , then we explain how to instantiate the parameters of the moel in Section 3.4.1, an finally exhibit results in Section Basic Throughputs We aim in this section at estimating the throughput T o (b) of one of the basic operations escribe in the previous subsection, where o {e, } an b {+, -}. We assume that T o (b) epens only on pw o, in aition to the tacit epenencies on the clock frequency, number of threas an queue implementation. We enote by cw (b) o the amount of work in a single try of the retry loop relate to operation o in case b when the queue is not intra-contene. Low Intra-Contention We stuy in this section the low intra-contention case, i.e. when (i) the threas o not suffer from expansion ue to threas that perform the same operation, an (ii) a success is obtaine with a single try of the retry loop. As it appears in( Figure ) 23, we have a cyclic execution, an the length of the shortest cycle is t (PS o ) + t. Within each cycle, every threa performs exactly one successful operation, thus SL (b) o

48 D2.2: White-box methoologies, programming abstractions an libraries 48 Retry Loop Parallel Work Cycle Figure 23: Cyclic execution uner low intra-contention the throughput is straightforwar: T (b) o = n ( t (PS o ) + t SL (b) o ) = nf pw o + cw (b) o. (19) High Intra-Contention As explaine in Section , in this case, the irect evaluation of the execution time of a retry loop is more complex, but we have experimentally observe that the throughput is approximately linear with the expecte number of threas that are in the retry loop at a given time. In aition, this expecte number is almost proportional to the amount of work in the parallel section. As a result, a goo approximation of the throughput, in high intra-contention cases, is a function that is linear with the amount of work in the pw o. Figure 24: Intra-contention frontier Frontier We now have to estimate whether the queue is highly intra-contene. We recall that, generally speaking, a long parallel section leas to a low intra-contene queue since threas are most of the time processing some computations an are not trying to access the share ata. Reversely, when the parallel section is short, the ratio of time that threas spen in the retry loop is higher, an gets even higher because of both expansion an retries. That being sai, there exists a simple lower boun of the amount of work in the parallel section, such that there exists an execution where the threas are never failing in their retry

49 D2.2: White-box methoologies, programming abstractions an libraries 49 loop. ( We ) plot in Figure 24 an ieal execution with n = 3 threas an t (PS o ) = (n 1) t SL (b) o. In this execution, all threas always succee at their first try in the retry loop. Nevertheless, if we shorten the parallel section, then there is not enough parallel potential any more, an the threas will start to fail: the queue leaves ( the ) low intra-contention state. In practice, this lower boun (t (PS o ) = (n 1) t SL (b) o ) is actually a goo approximation for the critical point where the queue switches its state Combining Basic Throughputs We are given parallel sections sizes, an show how to link the throughput of the four basic operations, with the equeuers an enqueuers throughput. There are two possible states for the queue: either it is mostly empty (i.e. some Null items are equeue), or it gets larger an larger. In the first case, some of the equeues will occur on an empty queue. In 1 unit of time, T e items are enqueue. These items are equeue in T e /T (-) units of time (the queue is, where equeues, hence the following throughput formula: ( ) non-empty while they are equeue), which leas to a slack of 1 T e /T (-) of Null items can take place at a rate T (+) T = T e T (-) T (-) + 1 T e T (-) T (+). (20) Concerning the enqueuers, we use the same assumption on inter-contention as use on intra-contention in Section , saying that the throughput is linear with the expecte number of threas insie the retry loop. Here, the expecte number of threas insie the equeue operation is proportional to the ratio r of the time spent by one equeuer in its equeue operation. We o not know t (RL ), but we know that in average, to complete a successful operation, a threa nees t (PS ) + t (RL ) units of time, an among this time it will spen t (PS ) in the parallel section. Therefore r = 1 t (PS ) /(t (PS ) + t (RL )) = 1 T pw. n f The minimum intra-contention is reache when this ratio is 0, while the maximum is obtaine when it is 1, thus: T e = T pw ( T e (+) + 1 T pw ) T e (-). (21) n f n f In the secon case, enqueuers an equeuers o not access to the same part of the queue, thus inter-contention oes not take place, then T e = T e (+), an all equeues return a non- Null item, hence T = T (-). The iscrimination of these two cases is trivial when enqueuers an equeuers throughput are given: the queue is in the first state (mostly empty) if an only if T e T.

50 D2.2: White-box methoologies, programming abstractions an libraries 50 ( ) Reversely, if we know the four basic throughputs T e (+), T e (-), T (+), T (-), an aim at reconstituting the equeuers an enqueuers throughput (T, T e ), several solutions coul be consistent, i.e. either a growing queue, such that T e = T e (+) an T = T (-) an fulfilling the inequality T e > T, or a mostly empty queue, fulfilling Equations 20 an 21. ( ) Theorem 3.2. Given T e (+), T e (-), T (+), T (-), there exists a consistent solution (T, T e ) with a growing queue if an only if T e (+) > T (-). In aition, this solution is unique an is such that T e = T e (+) an T = T (-). Proof. ( ) If the queue is growing, then T e > T. Moreover, equeues never occur on an empty queue, hence T = T (-), an there is no inter-contention, thus T e = T e (+). ( ) Let us assume now that T e (+) > T (-). T e = T e (+) an T = T (-) is a vali solution, such that the queue is growing, since then T e > T. By construction, T e T e (+) ; if we ha another solution such that the queue grows an T e < T e (+), it woul mean that enqueues are inter-contene, which is possible only when the queue is mostly empty. This is absur, hence the uniqueness. ( ) T e (+), T e (-), T (+), T (-), there exists a consistent solution (T, T e ) with Theorem 3.3. Given a mostly empty queue if an only if T (-) e T (-) 1 pw ( ) T (+) e T e (-). (22) n f In aition, this solution is unique an is given by Equations 21 an 20. Proof. ( ) Let a solution (T, T e ) with a mostly empty queue. By construction, the throughputs follow Equations 21 an 20. As T e is an increasing function accoring to T (because T e (+) T e (-) ), we erive ( ) T e T (-) pw T e (+) + 1 T (-) pw T e (-). n f n f The queue is mostly empty, thus the equeues of non-null items have to be faster than the enqueues, which translates into T (-) T e. The two inequalities combine show the implication. ( ) Let us assume now that Inequality 22 is fulfille. Equation 20 can be rewritten into T e = T T (+). 1 T (+) T (-)

51 D2.2: White-box methoologies, programming abstractions an libraries 51 Let us consier now T e an T e two functions of T that fulfill the following system of equations: T e (T ) = T T (+) 1 T (+) T (-) ( ) T e (T ) = T pw T (+) n f e + 1 T pw T (-) n f e. ) ( ) ( We have T e ( that T e T (+) T (+) ) = 0 an T e T (-) T (+). In aition, T e = T (-). Accoring to Inequality 22, we know also is a linearly increasing function of T an T e a linearly ecreasing function of T. This shows that there exists a unique T such that T e (T ) = T e (T ), an if we efine T e as T e = T e (T ) = T e (T ), the pair (T, T e ) is such that T (-) T T (+) T e (-) T e T e (+). T e T This implies that (T, T e ) is a solution with an empty queue, an we have shown that this solution is unique. ( ) Corollary Given T e (+), T e (-), T (+), T (-), there exists at least one solution (T, T e ). Proof. We show that if the inequality of Theorem 3.2 is not fulfille, i.e. if T e (+) T (-), then the inequality of Theorem 3.3 is true. We have inee ( 1 pw ( ) ) ( ) ( ) T (+) e T e (-) T (-) T e (-) = 1 pw T e (+) T (-) 1 pw T (-) T e (-) n f n f n f ( 1 pw T e (+) n f T (-) T e (+) ( 1 pw ( ) ) T (+) e T e (-) T (-) T e (-) 0, n f which proves the Corollary. ) T (-) ( 1 pw T (-) n f One can notice that if T e (+) > T (-) an Inequality 22 are fulfille an the queue coul be either mostly empty or growing. In this case, we choose, for each operation, the mean of the two solutions, in orer to minimize the iscontinuities Power Estimation We recall that we are intereste only in the ynamic powers as we assume that static an activation powers are known. ) T (+) e

52 D2.2: White-box methoologies, programming abstractions an libraries CPU Power Firstly, as we map each threa on a eicate core, there is no interference between the CPU power of ifferent cores, so we can compute the ynamic power as P (C) = n P (C) e + n P (C). (23) Seconly, we assume that we can segment time an consier that, given a threa performing operation o {e, }, the power issipate in the retry loop an the power issipate in the parallel section are inepenent. There only remains to weight the previous powers by the time spent in each of these regions: P (C) o = r o P (C) o,rl + (1 r o) P (C) o,ps. (24) As shown in Section , the ratio can be obtaine through r o = 1 T o pw o. (25) n f Altogether, we obtain the final formula for ynamic CPU power ( ) P (C) = n T P (C) o,rl + o pw o P (C) o,ps P (C) o,rl (26) n f o {e,} Memory an Uncore Power We have notice in [55] that the ynamic memory power is proportional to the intensity (number of units of memory accesse per unit of time) of main memory accesses an remote accesses, when the threas rea separate places of the memory. Here, the ata structure oes not irectly involve the main memory since we keep its size reasonably boune (if the queue reaches the maximum size, we suspen the measurements, empty the queue, an resume), hence the power issipation in memory is only ue to remote accesses, which only appears as the threas are sprea across sockets (i.e. when n > 4). Moreover, as the parallel sections are full of pauses, communications can only take place in the retry loop, an there is no ynamic memory power issipate in the parallel sections. Concerning the retry loops, we make the following assumption: the amount of ata accesse per secon in a retry loop epens on the implementation, but given an implementation, once a threa is in the retry loop, it will always try to access the same amount of ata per secon. When the queue is highly intra-contene, if a threa fails then it will retry an will access the ata in the same way as in the previous try; an if there is expansion, then the threa will still try to access the ata for the whole time it is in the retry loop. In aition, the equeuers (an the same line of reasoning hols for the enqueuers) tries here to access the same ata. Therefore either memory requests are batche together when sent outsie the socket, or the Home Agent (cache coherency mechanism/memory controller)

53 D2.2: White-box methoologies, programming abstractions an libraries 53 keeps track of the previous requests. This implies that the number of threas attempting to access the ata oes not impact the ynamic memory power greatly when the rate of requests is high. All things consiere, as a threa working on operation o spens a fraction r o of its time insie its retry loop, we obtain that the ynamic memory power issipate in the retry loop is proportional to r o (times the amount of ata accesse per unit of time in the retry loop, which is a constant). Hence P (M) = r e ρ (M) e + r ρ (M), (27) where ρ (M) e an ρ (M) are constants. The ynamic uncore power is compute exactly in the same way as the ynamic memory power. 3.4 White-box Methoology for Instantiating the Energy Moel of Queues on CPU Platform In this section, we show how to obtain the parameters of the moel, so that preictions can further be mae. Those parameters epen on the architecture where the application is running, thus we nee some measurements on a few runs of the synthetic benchmark to iscover them. This methoology is white-box, since we know which runs to use, so that the calibration of the moel is achieve with a minimum possible set of runs Instantiating the Throughput Moel We recall that, for all o {e, } an b {+, -}, T o (b) epens only on pw o, while T e an T epen on both pw an pw e. We enote now by T (pw, pw e ) (resp. T e (pw, pw e )) the equeuers (resp. enqueuers ) throughput as the amount of work in the parallel section of the equeuers is pw an enqueuers one is pw e. The estimate of a value is enote by a hat on top, while the measure value oes not wear the hat. Let p sma = 1, p mi = 20 an p big = 1000 be three istinctive amounts of work, that correspons to ifferent states of the execution. If pw o = p big, we can neglect the impact of operation o on the queue, pw o = p mi is a low intra-contention case since the non-expane critical sections are experimentally less than 2 units of time, an pw o = p sma correspons to a highly inter- or intra-contention case. We note the we cannot use a 0 size as amount of work since it leas to unesirable results ue to the back-to-back effect (a threa oes not allow other threas to access the queue for several consecutive iterations) Low Intra-Contention The basic throughputs that are not intra-contene can be spawne from cw (b) o (critical section size of operation o in case b), where o {e, } an b {+, -}, which we try to estimate here. We pick four points where the basic throughputs are easy to approximate.

54 D2.2: White-box methoologies, programming abstractions an libraries 54 We have T (p mi, p sma ) < T e (p mi, p sma ), as the amounts of work in the retry loops are in practice less than 10. For the same reason, at this point, we are in low intra-contention from the equeuers point of view. Altogether, T (p mi, p sma ) = T (-) (p mi ) = cw (-) = Then, accoring to Equation 20, we have nf p mi + nf p mi + ĉw (+) ĉw (+) n f p mi + cw (-) n f T (p mi, p sma ) p mi. = T (+) (p mi ), hence = T (p mi, p big ) T e (p mi, p big ) ( ), 1 (-) p mi +ĉw n f T e (p mi,p big ) from which we can extract In the same way, we can compute High Intra-Contention (+) (-) ĉw since we know alreay ĉw. (+) (-) ĉw e then ĉw e, by using (p big, p mi ) an (p sma, p mi ). We aim here at estimating T o (b) on a high intra-contention point. p sma = 1 an p mi = 20 are such that T (p sma, p mi ) T e (p sma, p mi ). Accoring to Equation 20, we have T (p sma, p mi ) = T e (p sma, p mi ) + 1 T e(p sma, p mi ) T (-) (p sma ) In aition, if T (p sma, p sma ) T e (p sma, p sma ), then T (p sma, p sma ) = T e (p sma, p sma ) + 1 T e(p sma, p sma ) T (-) (p sma ) T (+) (p sma ). T (+) (p sma ), (-) otherwise, T (p sma, p sma ) = T (p sma ). In both cases, we can fin the two unknowns T (-) (+) (p sma ) an T (p sma ) thanks to the two equations. This last point is also use in the same way for enqueuers: if T (p sma, p sma ) T e (p sma, p sma ), then T e (p sma, p sma ) = T (p sma, p sma ) p sma (+) T e (p sma )+ n f ( 1 T (p sma, p sma ) p sma n f ) (-) T e (p sma ),

55 D2.2: White-box methoologies, programming abstractions an libraries 55 otherwise, T e (p sma, p sma ) = T (+) e (p sma ). Like previously, we have T (p mi, p sma ) < T e (p mi, p sma ), hence (+) T e (p sma ) = T e (p mi, p sma ). (+) T e (p sma ), but we o not have access to This implies that in any cases we can compute T e (-) (p sma ) if T (p sma, p sma ) < T e (p sma, p sma ). In this case, the bottleneck of the queue is likely (-) (+) to be the equeuers, hence we set the value T e (p sma ) = T e (p sma ) by efault. T (b) o (b) T o (p sma ) to the leftmost point of the low intra- All are then obtaine by joining contention part: T (b) o (pw o ) = f cw (b) o (b) (n 1)ĉw n f (b) pw o +ĉw o (b) T o (p sma) o p sma (pw o p sma ) + (b) (b) T o (p sma ) if pw o (n 1)ĉw o otherwise. Finally, equeuers an enqueuers throughput are reconstitute as explaine in Section : if Equation 22 is fullfille, then they are compute through Equations 20 an 21 that can be rewritten as: T (pw, pw e ) = T (+) (-) T (pw )+ T e (pw e ) 1 (+) (pw ) T (-) (pw ) ( )) 1 pw (+) (-) T nf T e (pw e ) T e (pw e 1 (+) T (-) (pw ) (pw ) ( T e (pw, pw e ) = T (pw,pw e ) pw (+) T n f e (pw e ) + 1 T (pw,pw e ) pw n f ) T (-) e (pw e ). Otherwise, T (pw, pw e ) = T (-) (pw ) an T e (pw, pw e ) = T (+) e (pw e ) Instantiating the Power Moel We use once again p sma = 1, p mi = 20 an p big = 1000 as three istinctive amounts of work, that allows easy approximations for the power issipation expressions. We have seen that if X {M, U}, then P (X) = r ρ (X) + r e ρ (X) e, which can be approximate at (pw, pw e ) = (p big, p sma ) by P (X) (p big, p sma ) = r e (p sma ) ρ (X) e, since r is then nearly 0. It implies that ρ (X) e = P (X) (p big, p sma ) 1 T e(p big,p sma) p sma. n f We obtain ρ (X) similarly at (pw, pw e ) = (p sma, p big ).

56 D2.2: White-box methoologies, programming abstractions an libraries 56 Concerning the ynamic CPU power, we firstly estimate the power issipate in the parallel sections. Accoring to the implementation, the CPU power issipate by the parallel section of enqueuers an equeuers is the same for both, an this power oes not epen on the amount of work. These restrictions are not a loss of generality, since the aim here is to stuy the queue implementations. It can then be estimate by using (p big, p big ), where the ratios r o can be consiere as 0, which leas to P (C) o,ps = P (C) (p big, p big ). 2n We reuse the point (p big, p sma ), where r is very close to 0, to erive that ( P (C) = n r e (p sma ) ) (C) P e,rl + (1 r (C) (C) e(p sma )) P e,ps + n P,PS, which is equivalent to ( P (C) e,rl = P (C) (p big, p sma ) 2 ( ) n 1 T e(p big,p sma)p sma 1 T e(p big,p sma)p sma n f n f ) 1 P (C) o,ps P (C) Once again, we obtain,rl with the same line of reasoning at (pw, pw e) = (p sma, p big ). Finally, P (M) an P (U) (resp. P (C) ) are compute by using Equation 27 (resp. Equations 23 an 24), an the estimates of the ratios that are issue from Section Summary r o = 1 T o pw o. n f To summarize, we have built a moel that nees to be calibrate (instantiate) before the execution of an application that woul use the queue. The calibration phase starts with the run of synthetic benchmarks on the following set of points an the measurements of the equeuers an enqueuers throughputs: (pw, pw e ) { (p mi, p sma ), (p mi, p big ), (p sma, p mi ), (p big, p mi ), (p sma, p sma ), (p big, p sma ), (p big, p big ), (p sma, p big ) }. The calibration phase continues with the extraction of the parameters that rule the four basic throughputs (hence the equeuers an enqueuers throughputs) on the whole omain. After this calibration phase, we are able, given any parallel section values (pw, pw e ), to estimate the throughput an the energy consumption of the application, through a few basic arithmetic operations, as explaine in the previous subsections.

57 D2.2: White-box methoologies, programming abstractions an libraries 57 4 Programming Abstractions an Libraries In this section, we escribe our stuies on programming abstractions an libraries such as concurrent search trees an concurrent lock-free queues. 4.1 Concurrent Search Trees In this section, we present libraries of concurrent search trees an their performance an energy analysis Energy-efficient Concurrent Search Trees DeltaTree ( Tree) Concurrent trees are funamental ata structures that are wiely use in ifferent contexts such as loa-balancing [27, 46, 73] an searching [2, 15, 16, 24, 30, 32]. Most of the existing highly-concurrent search trees are not consiering the fine-graine ata locality. The nonblocking concurrent search trees [16, 32] an Software Transactional Memory (STM) search trees [2, 15, 24, 30] have been regare as the state-of-the-art concurrent search trees. They have been proven to be scalable an highly-concurrent. However these trees are not esigne for fine-graine ata locality. Prominent concurrent search trees which are often inclue in several benchmark istributions such as the concurrent re-black tree [30] by Oracle Labs an the concurrent AVL tree evelope by Stanfor [15] are not esigne for ata locality either. It is challenging to evise search trees that are portable, highly concurrent an fine-graine locality-aware. A platform-customize locality-aware search trees [57, 72] are not portable while there are big interests of concurrent ata structures for unconventional platforms [47, 43]. Concurrency control techniques such as transactional memory [51, 48] an multi-wor synchronization [49, 42, 59] o not take into account fine-graine locality while fine-graine locality-aware techniques such as van Eme Boas layout [71, 83] poorly support concurrency. Base on the new concurrency-aware veb (cf. Section 3.1.5), we implement Tree [80], a portable locality-aware unbalance concurrent search tree. Figure 25 illustrates a Tree U which is compose by a group of subtrees ( Noes). A Noe s internal noes are put together in cache-oblivious fashion using the concurrency-aware veb layout (cf. Section 3.1.5). The search operation for Tree is wait-free. We are aware that the Tree has two major shortcomings, namely being an unbalance tree an having a poor memory utilization because of the inter-noe pointers usage. In orer to aress the Tree shortcomings, we implement the Balance Tree (b Tree). b Tree is evise by improving the structure an the algorithm of Tree to support the concurrent, Btree-like bottom-up insertions, which ensures balance tree. Another major change is that in b Tree, the fat-pointer Noes are replace with pointer-less Noes. The removal of Noes internal pointers mae way for 200% more noes to fit into the tree (in 64-bit x86 systems, a pointer costs 8 bytes of memory while an unsigne variable

58 D2.2: White-box methoologies, programming abstractions an libraries 58 requires only 4 bytes of memory). The concurrent search operations supporte by b Tree are still without locks an waits, but they are no longer wait-free. Base on our stuy, a more-compact tree results in less ata transfers. An less ata transfers leas to better performance an energy efficiency. Finally, we implement the Heterogeneous Tree (h Tree), which is a better performing, more-compact tree than b Tree. h Tree is evise by changing the leaf-oriente (external) tree layout of the leaf Noes into an internal tree layout. This improvement allows 100% more noes to fit insie the tree, which further improves the tree s overall operation performance. Base on experimental insights, our Trees are ifferent from previous theoretical esigns of concurrent cache-oblivious (CO) trees such as the concurrent packe-memory CO tree an concurrent exponential CO tree [11]. The concurrent packe-memory CO tree gives a goo amortize memory transfer cost of Θ(log B N + (log 2 N/B)) for tree upates, assuming that operations occur sequentially. However, the propose ata representation requires each noe to have the parent-chil pointers. Besies the complication in re-arranging those pointers, we have foun that eliminating pointers from the noe to minimize memory footprint is significantly beneficial for cache-oblivious tree in practice (cf. improvement from Tree to b Tree in Section ). In the Tree experimental evaluation (cf. Section 4.1.3), b Tree is 100% faster than Tree in searching, which is attaine by simply removing pointers from the tree noe. In the concurrent exponential CO tree by Bener et al. [11], expecte memory transfer cost for search an upate operations is O(log B N +(log α lgn)), assuming that all processors are synchronous. Cormen et al. [23, pp. 212], however, wrote that although the unerlying exponential tree algorithm [4] is an important theoretical breakthrough, it is complicate an unlikely to compete with similar sorting algorithms. In fact, noes in the exponential tree grow exponentially in size, which not only complicates maintaining inter-noe pointers but also exponentially increases the tree s memory footprint in practice. In contrast, the memory footprint of Tree with the fixe size Noes graually expans on-eman when the tree grows. Thanks to the fixe size Noes, Tree exploits further locality by utilizing a map an an efficient inter-noe connection (cf. Section ). The Tree consists of U Noes of fixe size UB. Each of the Noe contains a leaforiente binary search tree (BST) T i, i = 1,..., U. The Tree U provies the following operations: insertnoe(v, U), which as value v to the set U, eletenoe(v, U) for removing a value v from the set, an searchnoe(v, U), for etermining whether value v exists in the set. We use the term upate operation for either insert or elete operation. We assume that uplicate values are not allowe insie the set an a special value, for example 0, is reserve as an inicator of an Empty value. Data structures. The implementation of Tree utilizes the ata structure in Figure 26. The topmost level of Tree is represente by a struct universe (line 19) that contains a pointer (root) to the root noe of the first Noe (T 1.n 1 ). A Tree is forme by a group of Noes.

D2.2: White-box methoologies, programming abstractions an libraries 59 T 1 U T 3 Figure 25:

1: Struct noe n: 2: member fiels: 3: ti N, if > 0 inicates the noe is root of a Noe with an

true inicates a logically elete noe 6: left, right N, left an right chil pointers 7: isleaf

9: member fiels: 10: noes, a group of pre-allocate noe n {n 1, n 2,.

.., b #threas ) 12: meta, Noe s metaata (single NoeMeta) 13: Struct NoeMeta M: 14: member

upate operations 17: root, pointer to the root noe of the Noe (T x.

59 D2.2: White-box methoologies, programming abstractions an libraries 59 T 1 U T 3 Figure 25: Depiction of a Tree U. Triangles T x represent the Noes. 1: Struct noe n: 2: member fiels: 3: ti N, if > 0 inicates the noe is root of a Noe with an i of ti (T ti ) 4: value N, the noe value, efault is empty 5: mark {true, false}, a value of true inicates a logically elete noe 6: left, right N, left an right chil pointers 7: isleaf true, false, inicates whether the noe is a leaf of a Noe, efault is true 8: Struct Noe T : 9: member fiels: 10: noes, a group of pre-allocate noe n {n 1, n 2,..., n UB } 11: buffer, Noe s buffer (pre-allocate array of b 1, b 2,..., b #threas ) 12: meta, Noe s metaata (single NoeMeta) 13: Struct NoeMeta M: 14: member fiels: 15: locke, inicates whether a Noe is locke 16: opcount, a counter for the active upate operations 17: root, pointer to the root noe of the Noe (T x.n 1 ) 18: mirror, pointer to the root noe of the Noe s mirror (T x.n 1 ) 19: Struct universe U: 20: member fiels: 21: root, pointer to the root of the topmost Noe (T 1.root) Figure 26: Tree s ata structures.

60 D2.2: White-box methoologies, programming abstractions an libraries 60 1: function searchnoe(v, U) 2: p U.root 3: while (TRUE) o 4: lastnoe p atomic assignment 5: if (p.value < v) then 6: p p.left 7: else 8: p p.right 9: if (!p or lastnoe.isleaf = TRUE) then 10: break 11: if (lastnoe.value = v) then 12: if (lastnoe.mark = FALSE) then lastnoe is not elete 13: return TRUE 14: else 15: return FALSE 16: else 17: {Search the last visite Noe s buffer for v} 18: if ({v is foun}) then 19: return TRUE 20: else 21: return FALSE Figure 27: Tree s wait-free search algorithm. Tree s Noes are represente by the struct Noe (line 8). A Noe consists of a collection of UB noes an a buffer array. Noe s buffer array length is equal to the number of operating threas. Each Noe is accompanie by a metaata NoeMeta that hols lock an counters variable. Struct NoeMeta (line 13) acts as metaata for every Noe. This structure consists of a fiel opcount, which is a counter that inicates the number of insert/elete threas that are currently operating within that Noe; an fiel locke that inicates whether a Noe is currently locke by a maintenance operation. When locke is set as true, no insert/elete threas are allowe to get into a Noe. Lastly, NoeMeta contains a root pointer that points to the first noe of NoeMeta s accompanying Noe, an the pointer mirror that points to the root of the Noe s mirror (cf. Section ). Each noe structure (line 1) contains fiel value, which hols a value for guiing the search, or a ata value if it resies in a leaf-noe. Fiel mark inicates a logically elete value, if set to true. A true value of isleaf inicates the noe is a leaf noe, an false otherwise. Fiel ti is a unique ientifier of a corresponing Noe an it is use to let a threa know whether itself has move between Noes. A ti is only efine in the root noe of a Noe.

61 D2.2: White-box methoologies, programming abstractions an libraries 61 Tree functions. Tree provies basic functions such as search, insert an elete functions. We refer insert an elete operations as the upate operations. Function searchnoe(v, U) (cf. Figure 27), is going to walk over the Tree to fin whether the value v exists in U (Figure 27, lines 4 10). The searchnoe(v, U) function returns true whenever v has been foun, or false otherwise (Figure 27, line 12). This operation is guarantee to be wait-free (cf. Lemma 4.1). Lemma 4.1. Tree search operation is wait-free. Proof. (Sketch) The proof can be serve base on these observations on Figure 27: 1. SearchNoe an invoke SearchBuffer (line 17) o not wait for any locks. 2. The number of iterations in the while loop (line 3) is boune by the height T of the tree, or O(T ). The loop is always ene whenever a leaf noe or an empty noe is foun in line SearchBuffer time complexity is boune by the buffer size, which is a constant. Therefore the SearchNoe time is boune by O(T ), where T is the height of the tree.

62 D2.2: White-box methoologies, programming abstractions an libraries 62 1: function insertnoe(v, p) 2: p U.root 3: BEGIN : 4: if ({Entering new Noe T x }) then 5: Decrement previous Noe s opcount (T x.opcount) 6: Wait if Noe is currently maintaine 7: if Not at the tree s last level noe then 8: if (v < p.value) then 9: if Is at leaf (p.isleaf = TRUE) then 10: if (CAS(p.lef t.value, empty, v) = empty) then 11: Do insert to the left 12: Decrement T x.opcount 13: else 14: Re-try insert from p 15: else 16: Go to left chil (p p.left) ) an restart from BEGIN 17: else if (v > p.value) then 18: if (Is at leaf (p.isleaf = TRUE) then 19: if (CAS(p.lef t.value, empty, p.value) = empty) then 20: Do insert to the right 21: Decrement T x.opcount 22: else 23: Re-try insert from p 24: else 25: Go to right chil (p p.right) an restart from BEGIN 26: else if (v = p.value) then 27: if Is at leaf (p.isleaf = TRUE) then 28: if v is not elete (p.mark = FALSE) then 29: Decrement T x.opcount v alreay exist 30: else 31: Do insert right 32: else 33: Go to right chil (p p.right) an restart from BEGIN 34: else 35: if v is alreay in T x.buffer then 36: Decrement T x.opcount 37: else 38: Put v insie T x.buffer buffere insert 39: if (TAS(T x.locke)) then Acquire maintenance lock 40: Decrement T x.opcount 41: Wait for all upates to finish (T x.opcount=0) 42: o rebalance(t x ) or expan(p)

63 D2.2: White-box methoologies, programming abstractions an libraries 63 43: function eletenoe(v, p) 44: p U.root 45: BEGIN : 46: if ({Entering new Noe T x }) then 47: Decrement previous Noe s opcount (T x.opcount) 48: Wait if Noe is currently maintaine 49: if Is at leaf (p.isleaf = TRUE)) or at the tree s last level noe then 50: if (p.value = v) then 51: if (CAS(p.mark, FALSE, TRUE)!= FALSE)) then 52: Decrement T x.opcount v is alreay elete 53: else 54: if p is still the leaf noe then 55: if (TAS(T x.locke)) then Acquire the maintenance lock 56: Decrement T x.opcount 57: Wait for all upates to finish (T x.opcount=0) 58: Do noe merge if neee 59: else 60: Re-try elete from current noe p 61: else 62: Search (T x.buffer) for v 63: if v is foun in T x.buffer.ix then 64: Remove v from T x.buffer.ix buffere elete 65: Decrement T x.opcount 66: else 67: if (v < p.value) then 68: Go to left chil (p p.left) 69: else 70: Go to right chil (p p.right) 71: Restart from BEGIN Figure 28: Tree s concurrent upate (insert an elete) algorithms. Function insertnoe(v, U) (cf. Figure 28, line 1) inserts value v at a leaf of Tree, provie v oes not exist in the tree (Figure 28, line 29). Following the nature of a leaforiente tree, a successful insert operation replaces a leaf with a subtree of three noes [32] (cf. Figure 29a an in Figure 28, lines 10 an 19). Because the Tree structure is preallocate, an insert operation nees only to o a logical replacement, or only replacing the value of a leaf an its chilren. The function eletenoe(v, U) (cf. Figure 28, line 43) marks the leaf that contains value v as elete. eletenoe(v, U) fails if v oes not exist in the tree or the leaf containing v is alreay mark as elete. To avoi conflicting insert an elete operations, insertnoe an eletenoe operations are using the single wor CAS (Compare an Swap) an leaf-checking to coorinate between the operations (cf. Figure 28, lines 51 an 54 for elete, an 10 an 19 for insert).

D2.2: White-box methoologies, programming abstractions an libraries 64 REBALANCING (a) T 1 T 1 T 1 a x InsertNoe(v,U)

Figure 29: (a)rebalancing, (b)expaning, an (c)merging operations on Tree. Maintenance functions.

marke noes an as a safeguar to maintain Tree s small height. Function rebalance(t v.root) (cf.

Figure 29a illustrates the rebalance operation.

All leaves heights are then set to log N + 1, where N is the number of leaves (e.g., a, v, x, z in Figure 29a).

64 D2.2: White-box methoologies, programming abstractions an libraries 64 REBALANCING (a) T 1 T 1 T 1 a x InsertNoe(v,U) Rebalance(T 1 ) a z x' z a v x z v x (b) EXPANDING InsertNoe(x,U) v x (c) T 1 MERGING eletenoe(x,u) a b v x a b T 3 v Figure 29: (a)rebalancing, (b)expaning, an (c)merging operations on Tree. Maintenance functions. Other than the basic functions, Tree has tree maintenance functions that are invoke in certain cases of inserting an eleting a noe from the tree. These functions, namely rebalance, expan an merge, are the unique feature of Tree that acts as a garbage collectors for marke noes an as a safeguar to maintain Tree s small height. Function rebalance(t v.root) (cf. Figure 28 line 42) is responsible for rebalancing a Noe after an insertion. Figure 29a illustrates the rebalance operation. If a new noe v is to be inserte at the last level H of Noe T 1, the Noe is rebalance to a complete BST. All leaves heights are then set to log N + 1, where N is the number of leaves (e.g., a, v, x, z in Figure 29a). In the same Figure 29a, after rebalancing, tree T 1 height reuces from 4 to 3. We also efine the expan(l) function (cf. Figure 28 line 42) that is responsible for creating a new Noe an connecting it to the parent of a leaf noe l (cf. Figure 29b). Expaning is triggere only if 1) after insertnoe(v, U), leaf that contains v will be at the last level of a Noe; an 2) rebalancing will no longer reuce the current height of Noe

GreenBST: Energy-Efficient Concurrent Search Tree

GreenBST: Energy-Efficient Concurrent Search Tree Ibrahim Umar, Otto J. Anshus, Phuong H. Ha Arctic Green Computing Lab Department of Computer Science UiT The Arctic University of Norway Outline of the