EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Size: px
Start display at page:

Download "EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures"

Transcription

1 EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, The Ohio State University Bench 218

2 Bench Outline Introduction EC-Bench Evaluation Conclusion and Future Work

3 Bench Fault Tolerance & Replication Fault tolerance is a critical requirement for distributed storages Availability: data is still accessible under failures Durability: no data corruptions under failures Traditional storage scheme to achieve fault tolerance is replication Common replication factors: 3-1 Replication factor = 3 D Original data Data used to maintain fault tolerance

4 Bench Booming of Data 1,4 Data Stored in Exabytes 1,2 1, ,327 36% CAGR Source: Cisco Global Cloud Index, The booming of data makes storage overhead of replication considerable

5 Bench Erasure Coding Erasure Coding is a promising redundancy storage scheme Minimize storage overhead by applying erasure encoding on data Deliver higher reliability with same storage overhead than replication Employed in Google, Microsoft, Facebook, Baidu, etc. v Microsoft Azure reduces storage overhead from 3x (3-way replication) to 1.33x - k = 4 - m = 2 (erasure coding) Data Chunks Parity Chunks D1 D2 D3 D4 P1 P2 Original data Data used to maintain fault tolerance

6 Bench Replication vs. Erasure Coding Replication Storage overhead = 2/1 = 2 D Replication factor = 3 Original data Data used to maintain fault tolerance Data Chunks Parity Chunks D1 D2 D3 D4 P1 P2 - k = 4 - m = 2 Erasure Coding Storage overhead = 2/4 =.5 Original data Data used to maintain fault tolerance

7 Bench Erasure Coding: Encoding Ø Takes in k data chunks and generates m parity chunks Ø Distributes (k + m) chunks to (k + m) independent nodes - k = 6 - m = 3 Data Chunks Parity Chunks D1 D2 D3 D4 D5 D6 P1 P2 P3

8 Bench Erasure Coding: Decoding Ø Any k of (k + m) chunks are sufficient to recover the original k data chunks - k = 6 - m = 3 Data Chunks Parity Chunks D1 D2 D3 D4 D5 D6 P1 P2 P3 Read Decode D1 D2 D3 D4 D5 D6

9 Bench Erasure Coding Pros: Low storage overhead with high fault tolerance Cons: High computation overhead introduced by encode and decode Ø High performance hardware-optimized erasure coding libraries to alleviate computation overhead Ø Intel CPUs -> Intel ISA-L Ø Nvidia GPUs -> Gibraltar Ø Mellanox InfiniBand -> Mellanox-EC

10 Bench Onload and Offload Erasure Coders Onload Erasure Coder Erasure coding operations are performed by host processors Jerasure and Intel ISA-L Onload Erasure Coders Jerasure CPU ISA-L Offload Erasure Coder Erasure coding operations are conducted by accelerators like Offload Erasure Coders GPU Gibraltar IB HCA Mellanox-EC GPUs and Mellanox InfiniBand HCAs Gibraltar and Mellanox-EC

11 Bench Our Contributions Ø EC-Bench: An unified benchmark suite to benchmark, measure, and characterize onload and offload erasure coders Performance Throughput Normalized Throughput Ø Evaluations on four popular open source erasure coders with EC-Bench Onload Offload CPU Cache Jerasure Intel ISA-L Heterogeneity Resource Utilization Gibraltar Mellanox-EC

12 Bench Outline Introduction EC-Bench Evaluation Conclusion and Future Work

13 Bench EC-Bench: Encoding Benchmark Data Buffer Big File 1. Read Chunk D1 Chunk D2 Chunk D3 Chunk D4 2. Encode Coding Buffer Chunk P1 Chunk P2 Chunk P3 - k = 6 - m = 3 Chunk D5 Chunk D6

14 Bench EC-Bench: Decoding Benchmark Big File 1. Read Data Buffer 2. Encode Coding Buffer D1 D2 D3 D4 D5 D6 P1 P2 P3 - k = 6 - m = 3 3. Decode

15 Bench EC-Bench: Parameter Space # Data Chunks # Parity Chunks D1 D2 D3 D4 Dk P1 Pm Chunk Size k the number of data chunks m the number of parity chunks c the size of each chunk

16 Bench EC-Bench: Metrics Ø Throughput!h# = % + ' ( ) * Ø Normalized Throughput o Enable to compare the performance across different configurations o Previous studies have demonstrated that optimal erasure codes take k 1 XOR operations to generate one byte!h# +,-. = % 1 ( ' ( ) * = % 1 ( ' % + ' (!h#

17 Bench EC-Bench: Metrics Ø CPU Utilization!"# #$%&%'($%)* =!"#!,-&./ $ Ø Cache Pressure!(-h. "1.//21. = 31!(-h. 5%//./ $

18 Bench Outline Introduction EC-Bench Evaluation Conclusion and Future Work

19 Bench Evaluation: Open Source Libraries Erasure Coder Specific Hardware Support Description Jerasure Intel ISA-L Gibraltar Mellanox-EC CPU CPU with SSE/AVX GPU IB NIC with EC Offload Jerasure is a CPU-based library released in 27 that supports a wide variety of erasure codes. Compiled without SSE support. Intel Intelligent Storage Acceleration Library (ISA-L) is a collection of optimized low-level functions including erasure coding. The erasure coding functions are optimized for Intel instructions, such as Intel SSE, vector and encryption instructions. Gibraltar is a GPU-based library for Reed- Solomon coding. Mellanox-EC proposed by Mellanox is an HCAbased library for Reed-Solomon coding.

20 Bench Evaluation: Experimental Setup 2.4 GHz Intel(R) Xeon(R) CPU E5-268 v4 o 28 cores, 32KB L1 cache, 256KB L2 cache, and 35MB L3 cache 128GB DRAM Nvidia K8 GPU Mellanox ConnectX-5 IB-EDR (1 Gbps) NIC Explored Configurations (RS(k, m)) o RS(3,2), RS(6,3), RS(1,4) and RS(17,3)

21 Bench Evaluation: Experimental Results Throughput (MB/sec) Jerasure ISA-L Mellanox-EC Gibraltar Chunk Size 512 8K 128K Throughput Performance with Varied Chunk Sizes for RS(3, 2) 2M 32M Normalized Throughput (MB/sec)

22 Bench Evaluation: Experimental Results Throughput (MB/sec) Jerasure ISA-L Mellanox-EC Gibraltar Chunk Size For small chunk sizes (< 32B), Jerasure performs better than Intel ISA-L 8K 128K Throughput Performance with Varied Chunk Sizes for RS(3, 2) 2M 32M Normalized Throughput (MB/sec)

23 Bench Evaluation: Experimental Results Throughput (MB/sec) Jerasure ISA-L Mellanox-EC Gibraltar Chunk Size 512 Chunk size = 2 KB 8K 128K 2M 32M Throughput Performance with Varied Chunk Sizes for RS(3, 2) For both onload coders, the best chunk size to carry out is 2 KB Normalized Throughput (MB/sec)

24 Evaluation: Experimental Results Throughput (MB/sec) Jerasure ISA-L Mellanox-EC Gibraltar Chunk Size Chunk size 512 KB 512 8K 128K 2M 32M Throughput Performance with Varied Chunk Sizes for RS(3, 2) For both offload coders, the best chunk size to carry out is 512 KB Because of data transformation overhead Bench Normalized Throughput (MB/sec)

25 Bench Evaluation: Experimental Results Normalized Throughput (MB/sec) Onload Coders 2KB RS(3, 2) RS(6, 3) RS(1, 4)RS(17, 3) RS(1, 4) is the best configuration for Intel ISA-L and Gibraltar Mellanox-EC performs the best with configuration RS(17, 3) Jerasure with RS(3, 2) outperforms other configurations slightly Offload Coders 512KB RS(3, 2) RS(6, 3) RS(1, 4)RS(17, 3) Jerasure ISA-L Mellanox-EC Gibraltar Normalized Throughput Performance of Onload and Offload Coders

26 Bench Evaluation: Experimental Results CPU Utilization (million cycles / sec) K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M CPU Utilization with Varied Chunk Sizes for RS(6, 3) 64M

27 Bench Evaluation: Experimental Results CPU Utilization (million cycles / sec) Chunk size = 32 B K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M CPU Utilization with Varied Chunk Sizes for RS(6, 3) 64M In Intel ISA-L, different approaches for < 32 bytes and 32 bytes (details in the function ec_encode_data_avx2)

28 Bench Evaluation: Experimental Results CPU Utilization (million cycles / sec) K 4K 16K 64K256K Chunk Size Jerasure ISA-L Mellanox-EC Gibraltar 1M 4M16M Offload coders make better use of CPU cycles, thus have less impact on applications computation o CPU Utilization with Varied Chunk Sizes for RS(6, 3) With chunk size = 64 MB,.41 million cycles by Mellanox-EC vs million cycles by ISA-L 64M

29 Bench Evaluation: Experimental Results L1 Cache Miss (millions / sec) Jerasure ISA-L Mellanox-EC Gibraltar K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M

30 Bench Evaluation: Experimental Results L1 Cache Miss (millions / sec) Jerasure ISA-L Mellanox-EC Gibraltar K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M Within onload coders, Intel ISA-L makes better use of L1 cache compared to Jerasure

31 Bench Evaluation: Experimental Results L1 Cache Miss (millions / sec) Jerasure ISA-L Mellanox-EC Gibraltar K 4K 16K Chunk Size 64K 256K Cache Pressure with Varied Chunk Sizes for RS(1, 4) 1M 4M 16M 64M In general, offload coders make less pressure on L1 cache, thus have less impact on applications cache usage

32 Bench Outline Introduction EC-Bench Evaluation Conclusion and Future Work

33 Bench Conclusion and Future Work Ø EC-Bench: An unified benchmark suite to benchmark, measure, and characterize onload and offload erasure coders Ø Evaluations on four popular open source erasure coders with EC-Bench Ø Advanced onload coders outperform offload coders Ø Offload coders make better use of CPU cycles and cache Ø Future work Ø Support more hardware-optimized erasure coders, e.g., FPGA-optimized erasure coders

34 Bench Thank You! {shi.876, lu.932, Network-Based Computing Laboratory The High-Performance Big Data Project

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. Tianli Zhou & Chao Tian Texas A&M University

Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques. Tianli Zhou & Chao Tian Texas A&M University Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques Tianli Zhou & Chao Tian Texas A&M University 2 Contents Motivation Background and Review Evaluating Individual

More information

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of

More information

Software-defined Storage: Fast, Safe and Efficient

Software-defined Storage: Fast, Safe and Efficient Software-defined Storage: Fast, Safe and Efficient TRY NOW Thanks to Blockchain and Intel Intelligent Storage Acceleration Library Every piece of data is required to be stored somewhere. We all know about

More information

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction

More information

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together! HADOOP 3.0 is here! Dr. Sandeep Deshmukh sandeep@sadepach.com Sadepach Labs Pvt. Ltd. - Let us grow together! About me BE from VNIT Nagpur, MTech+PhD from IIT Bombay Worked with Persistent Systems - Life

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

High-Performance Training for Deep Learning and Computer Vision HPC

High-Performance Training for Deep Learning and Computer Vision HPC High-Performance Training for Deep Learning and Computer Vision HPC Panel at CVPR-ECV 18 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

The Future of High Performance Interconnects

The Future of High Performance Interconnects The Future of High Performance Interconnects Ashrut Ambastha HPC Advisory Council Perth, Australia :: August 2017 When Algorithms Go Rogue 2017 Mellanox Technologies 2 When Algorithms Go Rogue 2017 Mellanox

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Efficient and Scalable Multi-Source Streaming Broadcast on Clusters for Deep Learning Ching-Hsiang Chu 1, Xiaoyi Lu 1, Ammar A. Awan 1, Hari Subramoni 1, Jahanzeb Hashmi 1, Bracy Elton 2 and Dhabaleswar

More information

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

Advanced RDMA-based Admission Control for Modern Data-Centers

Advanced RDMA-based Admission Control for Modern Data-Centers Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Building High Speed Erasure Coding Libraries for ARM and x86 Processors. Per Simonsen, CEO, MemoScale May 2017

Building High Speed Erasure Coding Libraries for ARM and x86 Processors. Per Simonsen, CEO, MemoScale May 2017 Building High Speed Erasure Coding Libraries for ARM and x86 Processors Per Simonsen, CEO, MemoScale May 2017 Agenda MemoScale company and team Erasure coding - brief intro MemoScale erasure codes Performance

More information

Paving the Road to Exascale

Paving the Road to Exascale Paving the Road to Exascale Gilad Shainer August 2015, MVAPICH User Group (MUG) Meeting The Ever Growing Demand for Performance Performance Terascale Petascale Exascale 1 st Roadrunner 2000 2005 2010 2015

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload. Dror Goldenberg VP Software Architecture Mellanox Technologies

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload. Dror Goldenberg VP Software Architecture Mellanox Technologies Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Dror Goldenberg VP Software Architecture Mellanox Technologies SNIA Legal Notice The material contained in this tutorial is

More information

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017 In-Network Computing Sebastian Kalcher, Senior System Engineer HPC May 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait

More information

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics 1 Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics Mingzhe Li Sreeram Potluri Khaled Hamidouche Jithin Jose Dhabaleswar K. Panda Network-Based Computing Laboratory Department

More information

Oasis: An Active Storage Framework for Object Storage Platform

Oasis: An Active Storage Framework for Object Storage Platform Oasis: An Active Storage Framework for Object Storage Platform Yulai Xie 1, Dan Feng 1, Darrell D. E. Long 2, Yan Li 2 1 School of Computer, Huazhong University of Science and Technology Wuhan National

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Repair Pipelining for Erasure-Coded Storage

Repair Pipelining for Erasure-Coded Storage Repair Pipelining for Erasure-Coded Storage Runhui Li, Xiaolu Li, Patrick P. C. Lee, Qun Huang The Chinese University of Hong Kong USENIX ATC 2017 1 Introduction Fault tolerance for distributed storage

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

InfiniBand Networked Flash Storage

InfiniBand Networked Flash Storage InfiniBand Networked Flash Storage Superior Performance, Efficiency and Scalability Motti Beck Director Enterprise Market Development, Mellanox Technologies Flash Memory Summit 2016 Santa Clara, CA 1 17PB

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies Meltdown and Spectre Interconnect Evaluation Jan 2018 1 Meltdown and Spectre - Background Most modern processors perform speculative execution This speculation can be measured, disclosing information about

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Paving the Road to Exascale August 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait for the Data

More information

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K. In the multi-core age, How do larger, faster and cheaper and more responsive sub-systems affect data management? Panel at ADMS 211 Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department

More information

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved. Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1 We believe data can make what is impossible today, possible tomorrow 2 We empower people to transform complex data into clear

More information

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing

Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Designing High-Performance Non-Volatile Memory-aware RDMA Communication Protocols for Big Data Processing Talk at Storage Developer Conference SNIA 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu

More information

Storage. CS 3410 Computer System Organization & Programming

Storage. CS 3410 Computer System Organization & Programming Storage CS 3410 Computer System Organization & Programming These slides are the product of many rounds of teaching CS 3410 by Deniz Altinbuke, Kevin Walsh, and Professors Weatherspoon, Bala, Bracy, and

More information

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017 In-Network Computing Paving the Road to Exascale 5th Annual MVAPICH User Group (MUG) Meeting, August 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

FPGA Implementation of Erasure Codes in NVMe based JBOFs

FPGA Implementation of Erasure Codes in NVMe based JBOFs FPGA Implementation of Erasure Codes in NVMe based JBOFs Manoj Roge Director, Data Center Xilinx Inc. Santa Clara, CA 1 Acknowledgement Shre Shah Data Center Architect, Xilinx Santa Clara, CA 2 Agenda

More information

Erasure coding and AONT algorithm selection for Secure Distributed Storage. Alem Abreha Sowmya Shetty

Erasure coding and AONT algorithm selection for Secure Distributed Storage. Alem Abreha Sowmya Shetty Erasure coding and AONT algorithm selection for Secure Distributed Storage Alem Abreha Sowmya Shetty Secure Distributed Storage AONT(All-Or-Nothing Transform) unkeyed transformation φ mapping a sequence

More information

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware CLUSTER TO CLOUD Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware Carl Trieloff cctrieloff@redhat.com Red Hat, Technical Director Lee Fisher lee.fisher@hp.com Hewlett-Packard,

More information

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT Krishna Kandalla (1), Hari Subramoni (1), Karen Tomko (2), Dmitry Pekurovsky

More information

In-Network Computing. Paving the Road to Exascale. June 2017

In-Network Computing. Paving the Road to Exascale. June 2017 In-Network Computing Paving the Road to Exascale June 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect -Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates

More information

High Performance Computing

High Performance Computing High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and

More information

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Evaluating the Impact of RDMA on Storage I/O over InfiniBand Evaluating the Impact of RDMA on Storage I/O over InfiniBand J Liu, DK Panda and M Banikazemi Computer and Information Science IBM T J Watson Research Center The Ohio State University Presentation Outline

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC INTRODUCTION With the EPYC processor line, AMD is expected to take a strong position in the server market including

More information

Improving Cloud Storage Cost and Data Resiliency with Erasure Codes Michael Penick

Improving Cloud Storage Cost and Data Resiliency with Erasure Codes Michael Penick Improving Cloud Storage Cost and Data Resiliency with Erasure Codes Michael Penick Commodity Storage Hosting storage FTP backup Goals Inexpensive (use commodity hardware) Resilient to failures Highly available

More information

Deep Learning Performance and Cost Evaluation

Deep Learning Performance and Cost Evaluation Micron 5210 ION Quad-Level Cell (QLC) SSDs vs 7200 RPM HDDs in Centralized NAS Storage Repositories A Technical White Paper Rene Meyer, Ph.D. AMAX Corporation Publish date: October 25, 2018 Abstract Introduction

More information

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters High Performance File System and I/O Middleware Design for Big Data on HPC Clusters by Nusrat Sharmin Islam Advisor: Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State

More information

A Performance Evaluation of Open Source Erasure Codes for Storage Applications

A Performance Evaluation of Open Source Erasure Codes for Storage Applications A Performance Evaluation of Open Source Erasure Codes for Storage Applications James S. Plank Catherine D. Schuman (Tennessee) Jianqiang Luo Lihao Xu (Wayne State) Zooko Wilcox-O'Hearn Usenix FAST February

More information

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding Yin Li, Hao Wang, Xuebin Zhang, Ning Zheng, Shafa Dahandeh,

More information

Enhancing Checkpoint Performance with Staging IO & SSD

Enhancing Checkpoint Performance with Staging IO & SSD Enhancing Checkpoint Performance with Staging IO & SSD Xiangyong Ouyang Sonya Marcarelli Dhabaleswar K. Panda Department of Computer Science & Engineering The Ohio State University Outline Motivation and

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Modern Erasure Codes for Distributed Storage Systems

Modern Erasure Codes for Distributed Storage Systems Modern Erasure Codes for Distributed Storage Systems Storage Developer Conference, SNIA, Bangalore Srinivasan Narayanamurthy Advanced Technology Group, NetApp May 27 th 2016 1 Everything around us is changing!

More information

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh Information Technology Research

More information

Accelerating HPL on Heterogeneous GPU Clusters

Accelerating HPL on Heterogeneous GPU Clusters Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline

More information

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based

More information

Learning with Purpose

Learning with Purpose Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts

More information

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE Carl Trieloff cctrieloff@redhat.com Red Hat Lee Fisher lee.fisher@hp.com Hewlett-Packard High Performance Computing on Wall Street conference 14

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

Application Acceleration Beyond Flash Storage

Application Acceleration Beyond Flash Storage Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage

More information

Notices and Disclaimers

Notices and Disclaimers Greg Tucker, Intel Notices and Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com,

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications Outline RDMA Motivating trends iwarp NFS over RDMA Overview Chelsio T5 support Performance results 2 Adoption Rate of 40GbE Source: Crehan

More information

Designing High Performance DSM Systems using InfiniBand Features

Designing High Performance DSM Systems using InfiniBand Features Designing High Performance DSM Systems using InfiniBand Features Ranjit Noronha and Dhabaleswar K. Panda The Ohio State University NBC Outline Introduction Motivation Design and Implementation Results

More information

Decentralized Distributed Storage System for Big Data

Decentralized Distributed Storage System for Big Data Decentralized Distributed Storage System for Big Presenter: Wei Xie -Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University Outline Trends in Big and Cloud Storage

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

Modern Erasure Codes for Distributed Storage Systems

Modern Erasure Codes for Distributed Storage Systems Modern Erasure Codes for Distributed Storage Systems Srinivasan Narayanamurthy (Srini) NetApp Everything around us is changing! r The Data Deluge r Disk capacities and densities are increasing faster than

More information

Deep Learning Performance and Cost Evaluation

Deep Learning Performance and Cost Evaluation Micron 5210 ION Quad-Level Cell (QLC) SSDs vs 7200 RPM HDDs in Centralized NAS Storage Repositories A Technical White Paper Don Wang, Rene Meyer, Ph.D. info@ AMAX Corporation Publish date: October 25,

More information

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

More information

This Unit: Main Memory. Building a Memory System. First Memory System Design. An Example Memory System

This Unit: Main Memory. Building a Memory System. First Memory System Design. An Example Memory System This Unit: Main Memory Building a Memory System Application OS Compiler Firmware CPU I/O Memory Digital Circuits Gates & Transistors Memory hierarchy review DRAM technology A few more transistors Organization:

More information

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda The Ohio State University

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks Presented By : Esthela Gallardo Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Jonathan Perkins, Hari Subramoni,

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

10GE network tests with UDP. Janusz Szuba European XFEL

10GE network tests with UDP. Janusz Szuba European XFEL 10GE network tests with UDP Janusz Szuba European XFEL Outline 2 Overview of initial DAQ architecture Slice test hardware specification Initial networking test results DAQ software UDP tests Summary 10GE

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information