Disk Cache-Aware Task Scheduling

Similar documents
Workflow System for Data-intensive Many-task Computing

System Software for Big Data and Post Petascale Computing

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

A Data Diffusion Approach to Large Scale Scientific Exploration

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

Managing and Executing Loosely-Coupled Large-Scale Applications on Clusters, Grids, and Supercomputers

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

Accelerating Large Scale Scientific Exploration through Data Diffusion

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short

MOHA: Many-Task Computing Framework on Hadoop

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

Optimizing Local File Accesses for FUSE-Based Distributed Storage

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

Falkon: a Fast and Light-weight task execution framework

Flash-Conscious Cache Population for Enterprise Database Workloads

Virtual SQL Servers. Actual Performance. 2016

Ioan Raicu. Distributed Systems Laboratory Computer Science Department University of Chicago

Enlightening the I/O Path: A Holistic Approach for Application Performance

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

PARDA: Proportional Allocation of Resources for Distributed Storage Access

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Lecture 13 Input/Output (I/O) Systems (chapter 13)

Data Diffusion: Dynamic Resource Provision and Data-Aware Scheduling for Data-Intensive Applications

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Multi-core Programming Evolution

Module 12: I/O Systems

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Advances of parallel computing. Kirill Bogachev May 2016

Energy and Thermal Aware Buffer Cache Replacement Algorithm

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

Chapter 13: I/O Systems

Chapter 13: I/O Systems

The Quest for Scalable Support of Data-Intensive Workloads in Distributed Systems

Ref: Chap 12. Secondary Storage and I/O Systems. Applied Operating System Concepts 12.1

Data Management in Parallel Scripting

A high-speed data processing technique for time-sequential satellite observation data

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

SMORE: A Cold Data Object Store for SMR Drives

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Chapter 13: I/O Systems

Chapter 13: I/O Systems. Chapter 13: I/O Systems. Objectives. I/O Hardware. A Typical PC Bus Structure. Device I/O Port Locations on PCs (partial)

Data Throttling for Data-Intensive Workflows

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Ioan Raicu. Everyone else. More information at: Background? What do you want to get out of this course?

Oracle Database 12c: JMS Sharded Queues

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

COL862 Programming Assignment-1

AN ELASTIC MULTI-CORE ALLOCATION MECHANISM FOR DATABASE SYSTEMS

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Using Transparent Compression to Improve SSD-based I/O Caches

Platforms Design Challenges with many cores

Recovering Disk Storage Metrics from low level Trace events

Software and Tools for HPE s The Machine Project

Why Does Solid State Disk Lower CPI?

Chapter 5: CPU Scheduling

Chapter 13: I/O Systems

Keywords: disk throughput, virtual machine, I/O scheduling, performance evaluation

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

The Oracle Database Appliance I/O and Performance Architecture

Overcoming Data Locality: an In-Memory Runtime File System with Symmetrical Data Distribution

CSCE 626 Experimental Evaluation.

Chapter 13: I/O Systems

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

LRC: Dependency-Aware Cache Management for Data Analytics Clusters. Yinghao Yu, Wei Wang, Jun Zhang, and Khaled B. Letaief IEEE INFOCOM 2017

Energy Prediction for I/O Intensive Workflow Applications

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Modification and Evaluation of Linux I/O Schedulers

Chapter 12: I/O Systems

Chapter 13: I/O Systems

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

Jyotheswar Kuricheti

White Paper. File System Throughput Performance on RedHawk Linux

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

COSC 6374 Parallel Computation. Parallel Computer Architectures

Deployment Planning and Optimization for Big Data & Cloud Storage Systems

Performance comparisons and trade-offs for various MySQL replication schemes

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Adaptec MaxIQ SSD Cache Performance Solution for Web Server Environments Analysis

Accelerate Applications Using EqualLogic Arrays with directcache

Caching and reliability

Falkon: a Fast and Light-weight task execution framework

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Illinois Proposal Considerations Greg Bauer

Staged Memory Scheduling

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

Transcription:

Disk ache-aware Task Scheduling For Data-Intensive and Many-Task Workflow Masahiro Tanaka and Osamu Tatebe University of Tsukuba, JST/REST Japan Science and Technology Agency IEEE luster 2014 2014-09-24 1

Outline Introduction Workflow Scheduling for Data-Intensive and Many-Task omputing Disk ache-aware Task Scheduling Proposed Method (1) LIFO + HRF (2) Rank Equalization + HRF Evaluation (1) I/O-only workflow (2) Montage astronomy workflow Related Work onclusion IEEE luster 2014 2014-09-24 2

Background Many Task omputing (MT) Raicu et al. (MTAGS 2008) Task throughput for 10 3 10 6 tasks Pwrake : Parallel Workflow extension to Rake Rake = Ruby version of make Target: Data-intensive and Many-task Workflows Gfarm distributed file system Scalable I/O performance Use local storage of compute nodes IEEE luster 2014 2014-09-24 3

Objective: I/O-aware Task Scheduling Issues: File Locality our previous work Disk cache (buffer/page cache) Read performance of Gfarm file (HDD, GbE) 592 600 500 this work process MB/s 400 300 200 100 71 39 70 disk cache Local file cache file disk Gfarm file cache file disk Remote 0 Remote Local IEEE luster 2014 2014-09-24 4

Architecture of Pwrake Master node Worker nodes Pwrake process Task Graph worker thread Task process process files files Queue worker thread SSH Gfarm worker thread enq deq worker thread process files process files IEEE luster 2014 2014-09-24 5

Task Queue of Pwrake Task NodeQueue TaskQueue enq Node 1 Node 2 deq worker thread Node 3 Other nodes IEEE luster 2014 2014-09-24 6

Method to define andidate Nodes 1. Based on the location of input file 47 % local access 2. MGP (Multi-onstraint Graph Partitioning) Our previous work (Grid 2012) use METIS library 88 % local access IEEE luster 2014 2014-09-24 7

Disk ache-aware Task Scheduling Later-saved file has high probability that the file is cached. The order of task execution relates to Disk ache hit rate. In the following slides, I show the behavior of FIFO and LIFO queues. IEEE luster 2014 2014-09-24 8

FIFO behavior Workflow DAG Last NodeQueue First B1 B2 B3 B4 B5 B6 IEEE luster 2014 2014-09-24 9

FIFO behavior Workflow DAG Last NodeQueue First B1 B2 B3 B4 B5 B6 IEEE luster 2014 2014-09-24 10

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 time B1 B2 B3 B4 B5 B6 IEEE luster 2014 2014-09-24 11

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 time B1 B2 B3 B4 B5 B6 Storage of compute node cache disk file2 file1 file2 file1 IEEE luster 2014 2014-09-24 12

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B2 B1 time B1 B2 B3 B4 B5 B6 cache file2 file1 disk file2 file1 IEEE luster 2014 2014-09-24 13

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B2 B1 time B1 B2 B3 B4 B5 B6 cache file2 file1 disk file2 file1 IEEE luster 2014 2014-09-24 14

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B2 B1 time B1 B2 B3 B4 B5 B6 cache file4 file3 file2 file1 disk file4 file3 file2 file1 IEEE luster 2014 2014-09-24 15

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B4 B3 B2 B1 time B1 B2 B3 B4 B5 B6 cache file4 file3 file2 file1 disk file4 file3 file2 file1 IEEE luster 2014 2014-09-24 16

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B4 B3 B2 B1 time B1 B2 B3 B4 B5 B6 cache file4 file3 file2 file1 disk file4 file3 file2 file1 IEEE luster 2014 2014-09-24 17

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B4 B3 B2 B1 time B1 B2 B3 B4 B5 B6 cache file6 file5 file4 file3 file2 file1 disk file6 file5 file4 file3 file2 file1 Evicted IEEE luster 2014 2014-09-24 18

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 B4 B3 B2 B1 time B1 B2 B3 B4 B5 B6 cache file6 file5 file4 file3 file2 file1 disk file6 file5 file4 file3 file2 file1 IEEE luster 2014 2014-09-24 19

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 B4 B3 time B1 B2 B3 B4 B5 B6 B1 B2 cache file6 file5 file4 file3 file2 file1 disk file6 file5 file4 file3 file2 file1 IEEE luster 2014 2014-09-24 20

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 B4 B3 time B1 B2 B3 B4 B5 B6 B1 B2 B2 B1 cache file6 file5 file4 file3 file2 file1 disk file6 file5 file4 file3 file2 file1 Evicted IEEE luster 2014 2014-09-24 21

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 time B3 B4 B5 B6 B1 B2 B2 B1 B4 B3 cache file2b file1b file6 file5 file4 file3 disk file2b file1b file6 file5 file4 file3 B3 B4 Evicted IEEE luster 2014 2014-09-24 22

FIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 time B5 B6 B1 B2 B4 B3 B6 B5 cache file4b file3b file2b file1b file6 file5 disk file4b file4b file2b file1b file6 file5 B3 B5 B4 B6 Evicted IEEE luster 2014 2014-09-24 23

FIFO behavior Workflow DAG ore allocation core1 core2 File output time B1 B2 B3 B4 B5 B6 File input B1 B2 B3 B4 Average difference between A i B i ~ # of A tasks # of cores tasks B5 B6 IEEE luster 2014 2014-09-24 24

LIFO behavior Workflow DAG Last NodeQueue First B1 B2 B3 B4 B5 B6 IEEE luster 2014 2014-09-24 25

LIFO behavior Workflow DAG Last NodeQueue First B1 B2 B3 B4 B5 B6 IEEE luster 2014 2014-09-24 26

LIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 time B1 B2 B3 B4 B5 B6 Storage of compute node cache disk file5 file6 file5 file6 IEEE luster 2014 2014-09-24 27

LIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 time B1 B2 B3 B4 B5 B6 cache file5 file6 disk file5 file6 IEEE luster 2014 2014-09-24 28

LIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 time B1 B2 B3 B4 B5 B6 cache disk B5 B6 file5 file6 file5 file6 IEEE luster 2014 2014-09-24 29

LIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 time B1 B2 B3 B4 B5 B6 B5 B6 cache file5b file6b file5 file6 disk file5b file6b file5 file6 IEEE luster 2014 2014-09-24 30

LIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 time B1 B2 B3 B4 cache file3 file4 file5b file6b file5 file6 disk file3 file4 file5b file6b file5 file6 Evicted IEEE luster 2014 2014-09-24 31

LIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B4 B3 B6 B5 time B1 B2 B3 B4 cache file3 file4 file5b file6b file5 file6 disk file3 file4 file5b file6b file5 file6 IEEE luster 2014 2014-09-24 32

LIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 time B1 B2 B3 B4 B4 B3 B3 B4 cache file3 file4 file5b file6b file5 file6 disk file3 file4 file5b file6b file5 file6 IEEE luster 2014 2014-09-24 33

LIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 time B1 B2 B4 B3 cache file1 file2 file3b file4b file3 file4 disk file1 file2 file3b file4b file3 file4 IEEE luster 2014 2014-09-24 34

LIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B2 B1 B6 B5 time B1 B2 B4 B3 cache file1 file2 file3b file4b file3 file4 disk file1 file2 file3b file4b file3 file4 IEEE luster 2014 2014-09-24 35

LIFO behavior Workflow DAG Last NodeQueue First ore allocation core1 core2 B6 B5 time B1 B2 B4 B3 B1 B2 cache file1 file2 file3b file4b file3 file4 disk file1 file2 file3b file4b file3 file4 B2 B1 IEEE luster 2014 2014-09-24 36

LIFO behavior Workflow DAG B1 B2 B3 B4 B5 B6 File output File input ore allocation core1 core2 B6 B5 time Average difference between A i B i ~ 1 task B4 B2 B3 B1 IEEE luster 2014 2014-09-24 37

Trailing Task Problem Armstrong et al. MTAGS 2010 Idle cores in final phase FIFO: max Task B LIFO: max Task A+B FIFO has advantage in trailing task problem Bn-6 Bn-4 Bn-2 Bn : FIFO Bn-5 Bn-3 Bn-1 idle core An-2 Bn-2 An Bn : LIFO An-1 Bn-1 idle core IEEE luster 2014 2014-09-24 38

FIFO for Trailing Task Problem? FIFO is better for trailing task problem than LIFO. FIFO s weak point 4 and D4 can be trailing tasks. and B4 should have higher priority than -3, B1-3. B1 B2 B3 E B4 4 D4 IEEE luster 2014 2014-09-24 39

Highest Rank First (HRF) Rank: Distance from the last task Same as upward rank in HEFT algorithm except uniform task cost Rank 2 Rank 1.. An B1 B2.. Bn priority high Highest Rank First (HRF) Priority to higher-rank tasks Rank 0 low (FIFO: downward rank ) Solution for Trailing Tasks Problem Bad for Disk ache IEEE luster 2014 2014-09-24 40

Proposed Methods (1) LIFO + HRF (2) Rank Equilization + HRF IEEE luster 2014 2014-09-24 41

Proposed method (1) LIFO+HRF Nc : # cores/node Nr : # tasks in the highest rank Algorithm: LIFO if Nr > Nc HRF if Nr Nc B1 B3 : B2 B4 LIFO An-2 Bn-2 An-1 An HRF Bn-1 Bn IEEE luster 2014 2014-09-24 42

Proposed method (2) Rank Equalization+HRF Purpose: B1 Overlap compute and I/O B2 B3 Bn-4 : An-2 Rank Equalization Task A: ompute intensive Task B: I/O intensive An-1 Bn-1 Bn-2 An Bn Bn-3 HRF IEEE luster 2014 2014-09-24 43

Rank Equalization Nr > Nc: Enq: Store tasks to RankQueue Deq: Select RankQueue with different frequency: enq NodeQueue RankQueue Rank1 Rank2 w[1] w[2] deq w[r] = 1/(average execution time) task invocation frequency Deq a task in LIFO policy. Nr Nc HRF Rank 1 Rank 2 t 1 Task B Task B Task B Task B Task A Task A Task A t 2 time # of tasks/sec 1/t 1 1/t 2 IEEE luster 2014 2014-09-24 44

Summary of Scheduling methods FIFO HRF LIFO Rank Eq. (w/ LIFO) Disk ache Trailing Task Rank Overlap LIFO+HRF Rank Eq.+HRF IEEE luster 2014 2014-09-24 45

Performance Evaluation IEEE luster 2014 2014-09-24 46

Evaluation Environment luster InTrigger Tohoku site PU Intel Xeon E5410 2.33GHz Main Memory 32 GiB # of cores / node 8 Max # of compute node 12 Network 1Gb Ethernet OS Debian 5.0.4 Gfarm ver. 2.5.8.6 Ruby ver. 2.1.1 Pwrake ver. 0.9.9.1 IEEE luster 2014 2014-09-24 47

Evaluation-1: opyfile workload opyfile : I/O intensive workload Load an input file to main memory and write it to an output file. Workflow DAG Task A,B = opyfile program.. An B1 B2 B3.. Bn Scheduling FIFO, LIFO (no +HRF because of 1 core experiment) Input file Mem number n=100 one file size 3 GiB < 32 GiB total size 300 GiB > 32 GiB Used nodes 10 nodes 1 core/node IEEE luster 2014 2014-09-24 48

Elapsed Time of opyfile workflow 2500 Elapsed Time (sec) 2000 1500 1000 500 0 FIFO Estimated time = I/R + O/W 30% speedup measured estimated LIFO I, O = Input, Output filesize R, W = Read, Write bandwidth (depends on access target) IEEE luster 2014 2014-09-24 49

Evaluation-2: Montage Wowkflow Astronomy image processing DAG mprojectpp mdiff+mfitplane mbgmodel mbackground madd mshrink madd mjpeg Number of cores: 96 (12 nodes x 8 cores) Input file SDSS DR7 # of input files 421 Size of one input file 2.52 MB Size of total input files 1061 MB # of intermediary files 4720 Size of intermediary files 63.5 GB # of tasks 2707 Output file size of mprojectpp 12 (nnnnnnnnnn) 1.6 GB IEEE luster 2014 2014-09-24 50

Measurement of Strong Scaling (1-12 nodes, Logarithmic) 1node 8cores x1.9 speedup from FIFO 1/n cores 4nodes 32cores Next Slide IEEE luster 2014 2014-09-24 51

Measurement of Strong Scaling (4-12 nodes, Linear) 4nodes 32cores 12nodes 96cores 12 % speedup from LIFO (trailing task) 1/n cores IEEE luster 2014 2014-09-24 52

FIFO order Task Scheduling mprojectpp mdiff+mfitplane Sequential tasks IEEE luster 2014 2014-09-24 53

LIFO order Task Scheduling Trailing Tasks IEEE luster 2014 2014-09-24 54

LIFO+HRF No Trailing Tasks IEEE luster 2014 2014-09-24 55

LIFO+Rank Equalization IEEE luster 2014 2014-09-24 56

Result of 12-node experiment LIFO is the worst due to trailing task problem Other methods are comparable. Why FIFO is good? Data size per node < ache size Benefit of cache even in FIFO case. Why Rank Equalization is not better? In LIFO+HRF case, mdiff is always less than 50% core utilization. mdiff always overlaps with mprojectpp. IEEE luster 2014 2014-09-24 57

LIFO+HRF (non MGP) >50 %, No overlap IEEE luster 2014 2014-09-24 58

LIFO+Rank Equalization (non-mgp) mdiff overlaps mprojectpp elapsed time: 134.4 s 128.7 s (4.4 % speedup) IEEE luster 2014 2014-09-24 59

Related Work 1: Workflow System Swift + Falkon + data diffusion Swift (Wilde et al. 2011) Workflow language Falkon (Raicu et al. 2007) Task throughput (1500 tasks/sec) data diffusion (Raicu et al. 2008) Staging file management + Task scheduling GXP make (Taura et al. 2013) Workflow system based on GNU make Dispatches tasks invoked by GNU make IEEE luster 2014 2014-09-24 60

Related Work 2: Workflow Scheduling Shankar and DeWitt (HPD 2007) studied DAG-based data-aware workflow scheduling for the ondor system. focused on cached data on a local disk in order to avoid data movement. Armstrong et al. (MTAGS 2010) discussed trailing task problem. proposed tail-chopping approach. IEEE luster 2014 2014-09-24 61

onclusion Objective: Disk cache aware task scheduling for Data-intensive and Many-task workflow. Proposed methods: LIFO+HRF: Rank Equalization+HRF: Evaluation: opyfile workflow: LIFO: ~30% speedup Montage Workflow: LIFO: 1.9x speedup HRF: ~12% speedup Rank Equalization: ~4% speedup IEEE luster 2014 2014-09-24 62

Backup slides IEEE luster 2014 2014-09-24 63

ache FIFO and LIFO.. An core1 core2 core1 core2 B1 B2 B3.. Bn time : : B1 B2 An-2 An-1 B3 B4 An B1 A i B i difference (average) FIFO: n/2 tasks LIFO: 1 tasks LIFO is good for ache utilization B2 B3 : : Bn-3 Bn-2 Bn-1 Bn FIFO B5 B6 : : An-1 An Bn-1 Bn LIFO IEEE luster 2014 2014-09-24 64

Elapsed time and ore utilization (12 nodes) t elap = Elapsed time of workflow t cum = Summated time of all the tasks ore utilization = t cum /(t elap n cores ) Elapsed time (sec) 120 100 80 60 40 20 94.3 104.3 93.3 93.6 ore utilization 100% 80% 60% 40% 20% 87.1% 84.0% 87.6% 87.8% 0 0% IEEE luster 2014 2014-09-24 65