Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors

Similar documents
An ATCA framework for the upgraded ATLAS read out electronics at the LHC

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

HPC and the AppleTV-Cluster

KeyStone II. CorePac Overview

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Adaptive Scientific Software Libraries

Building supercomputers from commodity embedded chips

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES

Techniques and tools for measuring energy efficiency of scientific software applications

F28HS Hardware-Software Interface: Systems Programming

Six-Core AMD Opteron Processor

Performance Benefits of OpenVMS V8.4 Running on BL8x0c i2 Server Blades

Computing on Low Power SoC Architecture

Composite Metrics for System Throughput in HPC

The Mont-Blanc Project

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

ECE 471 Embedded Systems Lecture 2

Performance of the AMD Opteron LS21 for IBM BladeCenter

Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc.

ATLAS TDAQ RoI Builder and the Level 2 Supervisor system

Technical Brief: Specifying a PC for Mascot

Godson Processor and its Application in High Performance Computers

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

The Mont-Blanc approach towards Exascale

Modeling Resource Utilization of a Large Data Acquisition System

ELE 455/555 Computer System Engineering. Section 1 Review and Foundations Class 5 Computer System Performance

Performance Analysis in the Real World of Online Services

Systems Design and Programming. Instructor: Chintan Patel

Introduction to Microprocessor

Benchmark Results. 2006/10/03

High-Value PXI Embedded Controller for Windows. High-Value Embedded Controllers for PXI Express NI PXI-8101, NI PXI NI PXIe-8101, NI PXIe-8102

Altera SDK for OpenCL

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

THE ATLAS DATA ACQUISITION SYSTEM IN LHC RUN 2

Practical Scientific Computing

Scheduling the Intel Core i7

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation

Benchmark of a Cubieboard cluster

Pedraforca: a First ARM + GPU Cluster for HPC

High Performance Computing on ARM

Benchmarking Real-World In-Vehicle Applications

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

Outline Marquette University

PowerEdge 3250 Features and Performance Report

ATS-GPU Real Time Signal Processing Software

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Heterogeneous Grid Computing: Issues and Early Benchmarks

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

The GAP project: GPU applications for High Level Trigger and Medical Imaging

Partitioning of computationally intensive tasks between FPGA and CPUs

Modeling and Validating Time, Buffering, and Utilization of a Large-Scale, Real-Time Data Acquisition System

i960 Microprocessor Performance Brief October 1998 Order Number:

REAL TIME DIGITAL SIGNAL PROCESSING

Making a Case for a Green500 List

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Simplify System Complexity

HP Z Turbo Drive G2 PCIe SSD

Arm Processor Technology Update and Roadmap

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

IBM Power Systems HPC Cluster

MM5 Modeling System Performance Research and Profiling. March 2009

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

Ch. 7: Benchmarks and Performance Tests

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort

Brand-New Vector Supercomputer

Emulex LPe16000B 16Gb Fibre Channel HBA Evaluation

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Quad-core Press Briefing First Quarter Update

IBM System p5 550 and 550Q Express servers

The ATLAS Data Acquisition System: from Run 1 to Run 2

Performance comparison between a massive SMP machine and clusters

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

All Programmable SoC based on FPGA for IoT. Maria Liz Crespo ICTP MLAB

Performance of Variant Memory Configurations for Cray XT Systems

Cache Justification for Digital Signal Processors

Please cite this article using the following BibTex entry:

Scientific Instrumentation using NI Technology

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

A Fast Ethernet Tester Using FPGAs and Handel-C

Benchmarking CPU Performance

Improving Throughput in Cloud Storage System

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Meet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

White Paper Assessing FPGA DSP Benchmarks at 40 nm

Vertical Scaling of Oracle 10g Performance on Red Hat

HPC Enabling R&D at Philip Morris International

Benchmarking CPU Performance. Benchmarking CPU Performance

IBM System p5 510 and 510Q Express Servers

PCI-express data acquisition card DAQ0504M User Guide

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

Parallel Simulation Accelerates Embedded Software Development, Debug and Test

Transcription:

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors Mitchell A Cox, Robert Reed and Bruce Mellado School of Physics, University of the Witwatersrand. 1 Jan Smuts Avenue, Braamfontein, Johannesburg, South Africa, 2000 E-mail: mitchell.cox@students.wits.ac.za Abstract. Projects such as the Large Hadron Collider at CERN generate enormous amounts of raw data which presents a serious computing challenge. After planned upgrades in 2022, the data output from the ATLAS Tile Calorimeter will increase by 200 times to over 0 Tb/s. ARM System on Chips are common in mobile devices due to their low cost, low energy consumption and high performance and may be an affordable alternative to standard x86 based servers where massive parallelism is required. High Performance Linpack and CoreMark benchmark applications are used to test ARM Cortex-A7, A9 and A15 System on Chips CPU performance while their power consumption is measured. In addition to synthetic benchmarking, the FFTW library is used to test the single precision Fast Fourier Transform (FFT) performance of the ARM processors and the results obtained are converted to theoretical data throughputs for a range of FFT lengths. These results can be used to assist with specifying ARM rather than x86-based compute farms for upgrades and upcoming scientific projects. 1. Introduction Projects such as the Large Hadron Collider (LHC) generate enormous amounts of raw data which presents a serious computing challenge. After planned upgrades in 2022, the data output from the ATLAS Tile Calorimeter will increase by 200 times to over 1 Tb/s (Terabits/s) [1]. It is not feasible to store this data for offline computation. A paradigm shift is necessary to deal with these future workloads and the cost, energy efficiency, processing performance and I/O throughput of the computing system to achieve this task are vitally important to the success of future big science projects. Current x86-based microprocessors such as those commonly found in personal computers and servers are biased towards processing performance and not I/O throughput and are therefore less-suitable for high data throughput applications otherwise known as Data Stream Computing [2]. ARM System on Chips (SoCs) are found in almost all mobile devices due to their low energy consumption, high performance and low cost [3]. The modern ARM Cortex-A range of processors have 32- and 6-bit cores and clock speeds of up to 2.5 GHz, making them potential alternatives to common x86 CPUs for scientific computing. This paper presents benchmarks of three common ARM Cortex CPUs, namely the Cortex-A7, A9 and A15 with more information on these platforms in Table 1. The benchmark results are useful for specifying an ARM-based system in new scientific projects such as ATLAS read-out and trigger system upgrades. SA Institute of Physics ISBN: 978-0-620-65391-6 180

A brief discussion of the ATLAS Triggering and Data Acquisition System (TDAQ) and where ARM processors may potentially be used is presented in Section 2. CPU benchmark results are given in Section 3. Fast Fourier Transform (FFT) benchmarks are in Section. Section 5 concludes with a brief discussion of future work. Table 1: Specifications and other details of the ARM platforms used. Cortex-A7 Cortex-A9 Cortex-A15 Platform Cubieboard A20 Wandboard Quad ODROID-XU+E SoC Allwinner A20 Freescale i.mx6q Samsung 510 Cores 2 (+ Cortex-A7) Max. CPU Clock (MHz) 1008 996 1600 L2 Cache (kb) 256 102 208 Floating Point Unit VFPv + NEONv2 VFPv3 + NEON VFPv + NEONv2 RAM (MB) 102 208 208 RAM Type 32 MHz 32 bit 528 MHz 6 bit 800 MHz 6 bit 201 Retail (USD) 65 129 169 Linux Kernel 3..61 3.10.17 3..5 GCC.7.1.7.3.7.3 2. ATLAS Triggering and Data Acquisition System The ATLAS experiment is composed of several sub-detectors, each of which has separate data processing requirements. The massive amount of raw data is reduced by a process called triggering. In the Tile Calorimeter, there are currently three main levels of triggering, shown in Figure 1. The read-out system is based on FPGAs (Field Programmable Gate Arrays) and DSPs (Digital Signal Processors) to form the level one trigger which serves to reduce the data rate (event rate) from 0 MHz to about 100 khz. Each ATLAS event consists of about 1.5 MB data, of which a portion is made up from TileCal data. Some components of the read-out subsystems will be replaced by the Super Read Out Driver (superrod or srod) in the 2022 upgrade [1]. An ARM-based Processing Unit (PU) is in development at the University of the Witwatersrand, Johannesburg, to complement the srod with higher level processing tasks on the raw data, before it has been reduced by the triggering system. The level two and three triggers (LVL2 and Event Filter) are implemented with compute clusters. Data from the level one trigger system is fed to the Read Out System (ROS) which passes the data to the Event Builder and LVL2 trigger at a rate of about 120 GB/s. [] 2.1. Level Two Trigger, Event Builder and Event Filter The LVL2 filter only works on a portion of the data to determine whether it is interesting - if the portion is interesting then all of the associated data is let through to the Event Builder. The LVL2 cluster is built from approximately 500 machines, each of which has two 2.5 GHz quad-core Intel Harpertown CPUs, two 1 Gb/s Ethernet interfaces and 16 GB RAM [5]. This CPU achieves about 59 000 CoreMarks. The Event Builder consists of about 100 rack-mounted servers, each of which has two 2.6 GHz AMD Opteron 252 CPUs with 1 Gb/s Ethernet. According to the CoreMark online database, the AMD Opteron 25 CPU - which is a dual core variant of the Opteron 252 used in the Event Builder - achieves about 13700 CoreMarks [6]. SA Institute of Physics ISBN: 978-0-620-65391-6 181

Figure 1: ATLAS TileCal Trigger and Data Acquisition System flow diagram before (above) and after the 2022 upgrade (below) [1]. The Event Filter is a much larger cluster of about 1900 machines which is the last level of triggering. The same machines as for the LVL2 filter are used. Data is output from the Event Filter at about 300 MB/s for storage [5]. 3. General CPU Benchmarks High Performance Linpack (HPL) is an industry standard benchmark application for measuring floating point performance [7]. HPL is the primary benchmark used when rating supercomputers on the Green500 and Top500 lists [8]. Both single and double precision floating point were tested. CoreMark is another industry standard benchmark for measuring general CPU performance with a balance of integer, floating point and common algorithms [6]. ARM has recommended that CoreMark be used to replace the older Dhrystone benchmark for integer MIPS performance [9]. All three CPUs were manually forced to 1 GHz for a fair comparison. The peak power consumption of the test platforms were measured. The HPL and CoreMark results as well as the power consumption is presented in Table 2. A cluster of eight Wandboards has been built at The University of the Witwatersrand in order to test scientific algorithms on ARM. In total, 32 1 GHz Cortex-A9 cores are available, with 16 GB RAM and interconnected with 1 Gb/s Ethernet. A photo of the cluster mounted in a rack is shown in Figure 2. No in depth testing is available at present.. Fast Fourier Transform Benchmarks FFTs have numerous uses in science, mathematics and engineering. FFTs are computationally intensive and are an O(n log(n)) algorithm which stresses CPU and memory subsystems. FFTW is an open source, high performance FFT library which has a benchmark facility which reports the time, t, (in microseconds) and the estimated MFLOPS of a run [10]. The length, N, and type of FFT is specified for each run. One dimensional, single-precision complex FFTs (8 bytes per point) were tested. The CPUs were manually set to their maximum frequencies. Figure 3 a) shows the MFLOPS results for a wide range of FFT lengths. The maximum results of multi-core and summed multi-process runs are reported. This methodology was used SA Institute of Physics ISBN: 978-0-620-65391-6 182

Proceedings of SAIP201 Table 2: General CPU benchmarking results and power consumption. Cortex-A7 Cortex-A9 Cortex-A15 1008 2 1.76 0.70 858 2.85 0.25 0.62 996 5.12 2.0 11327 5.03 0.8 1.02 1000 10.56 6.0 199 7.8 0.81 1.1 CPU Clock (MHz) CPU Cores HPL (Single Precision GFLOPS) HPL (Double Precision GFLOPS) CoreMark Peak Power (W) Double Precision GFLOPS/Watt Single Precision GFLOPS/Watt 7 6 5 3 2 1 0 21 26 211 216 FFT Length [N ] (a) 221 Throughput [MB/s] Performance [GFLOPS] Figure 2: Photo of the 32 core Cortex-A9 cluster at The University of the Witwatersrand, Johannesburg. 1,500 Cortex-A7 Cortex-A9 Cortex-A15 1,200 900 600 300 0 21 26 211 216 FFT Length [N ] 221 (b) Figure 3: a) The maximum score of multi-core and multi-process runs to always utilise all processors. b) Theoretical maximum calculated FFT throughput. to ensure 100% utilisation of the CPU for the tests. Figure 3 b) shows the calculated theoretical FFT throughput based on the FFT size in Bytes and the run time: Throughput = (8N )/t [MB/s]. SA Institute of Physics ISBN: 978-0-620-65391-6 183

5. Discussion, Conclusions and Future Work The processing performance of the Cortex-A7, A9 and A15 CPUs is comparable to an older x86 CPU, but the power efficiency is excellent. The Cortex-A15 achieves 0.81 GFLOPS/W with double precision floating point operations. The ARM processors that have been tested are optimised for single precision floating point and so are likely to be used in applications where double precision floating point is not necessary. The power efficiency is above 1 GFLOPS/W for the Cortex-A9 and A15 and is 0.62 GFLOPS/W for the Cortex-A7. The compute performance of the Cortex-A15 is significantly higher when it is run at its maximum clock speed of 1.6 GHz but this would be true for any processor at a higher clock speed. The CoreMark results confirm that the Cortex-A15 is significantly higher performance than the Cortex-A7 and A9. It should be noted that the Cortex-A9 and A15 SoCs that were tested could be used as a substitute based on CoreMark results and specifications for the AMD CPU used in the Event Builder section of the ATLAS TDAQ system. If ARM SoCs were used, power consumption would decrease by an order of magnitude with the ARM SoC consuming approximately 5 W and the current AMD-based system consuming over 68 W [11]. The FFT performance also indicates that the Cortex-A15 is superior to the Cortex-A9 and A7. The theoretical FFT throughputs are over 300 MB/s for most FFT lengths with the Cortex- A15 sustaining over 300 MB/s up to the largest FFT tested at 2097152 points. Based on the results presented in this paper, specifically the low cost, high performance and good power efficiency, it is clear that ARM SoCs should be considered for future upgrades and new computing systems in big science projects. Acknowledgements The financial assistance of the National Research Foundation (NRF) towards this research is hereby acknowledged. We would also like to acknowledge the School of Physics, the Faculty of Science and the Research Office at the University of the Witwatersrand, Johannesburg. References [1] Carrió F et al. 201 Journal of Instrumentation 9 C02019 C02019 ISSN 178-0221 URL http://stacks. iop.org/178-0221/9/i=02/a=c02019 [2] Cox M A, Reed R, Wrigley T, Harmsen G and Mellado B 201 In Review [3] Krazit T 2006 ARMed for the living room URL http://news.cnet.com/armed-for-the-living-room/ 2100-1006\_3-6056729.html [] Beck H P et al. 2008 IEEE Transactions on Nuclear Science 55 176 181 [5] Winklmeier F 2009 The ATLAS High Level Trigger infrastructure, performance and future developments 2009 16th IEEE-NPSS Real Time Conference (IEEE) pp 183 188 ISBN 978-1-2-5-0 URL http: //ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5321918 [6] EEMBC 201 CoreMark Online Database URL http://www.eembc.org/coremark/ [7] Petitet A, Whaley R C, Dongarra J and Cleary A 2008 HPL - A Portable Implementation of the High- Performance Linpack Benchmark for Distributed-Memory Computers URL http://www.netlib.org/ benchmark/hpl/ [8] Feng W c and Cameron K 2007 Computer 0 50 55 ISSN 0018-9162 URL http://ieeexplore.ieee.org/ lpdocs/epic03/wrapper.htm?arnumber=0810 [9] ARM 2011 Application Note 273: Dhrystone Benchmarking for ARM Cortex Processors URL http: //infocenter.arm.com/help/topic/com.arm.doc.dai0273a/dai0273a\_dhrystone\_benchmarking.pdf [10] Frigo M and Johnson S 2005 Proceedings of the IEEE 93 216 231 URL http://www.fftw.org [11] CPU World 201 AMD Opteron 252 Specifications URL http://www.cpu-world.com/cpus/k8/ AMD-Opteron252-OSP252FAA5BL.html SA Institute of Physics ISBN: 978-0-620-65391-6 18