IWES st Italian Workshop on Embedded Systems Pisa September 2016

Similar documents
Feature Detection Plugins Speed-up by

Liquid Computing: The Delft Reconfigurable VLIW Processor

Getting to Work with OpenPiton. Princeton University. OpenPit

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

AUTOMATIC SMT THREADING

Fundamentals of Quantitative Design and Analysis

Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory - CAPSL. Introduction

Industrial Multicore Software with EMB²

PACO: Paderborn CPU Core for Approximate Computing

Outline Marquette University

The Delft Reconfigurable VLIW Processor

Portable Power/Performance Benchmarking and Analysis with WattProf

Dr. Yassine Hariri CMC Microsystems

General introduction: GPUs and the realm of parallel architectures

SDSoC: Session 1

Placement de processus (MPI) sur architecture multi-cœur NUMA

Did I Just Do That on a Bunch of FPGAs?

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

The Mont-Blanc approach towards Exascale

Update on Physical Scalability

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

HPC projects. Grischa Bolls

Embedded Systems: Projects

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

n N c CIni.o ewsrg.au

Extending the Power of FPGAs to Software Developers:

Adaptable Intelligence The Next Computing Era

SUSE Linux Entreprise Server for ARM

Getting to Work with OpenPiton

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Building supercomputers from embedded technologies

Medical practice: diagnostics, treatment and surgery in supercomputing centers

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

GreenBST: Energy-Efficient Concurrent Search Tree

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Parallel Architectures

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

WHY PARALLEL PROCESSING? (CE-401)

Lecture 1: Course Introduction and Overview Prof. Randy H. Katz Computer Science 252 Spring 1996

Microprocessor Trends and Implications for the Future

More Course Information

Intel Many Integrated Core (MIC) Architecture

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

OPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications

Computer Architecture!

Computer Architecture!

High-Performance Real-Time Lab (HiPeRT) Marko Bertogna University of Modena, Italy

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

An MPSoC for Energy-Efficient Database Query Processing

Introduction to Microprocessor

GOING ARM A CODE PERSPECTIVE

Low energy and High-performance Embedded Systems Design and Reconfigurable Architectures

FPGA-based Supercomputing: New Opportunities and Challenges

Barcelona Supercomputing Center

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Efficient Parallel Programming on Xeon Phi for Exascale

for Exascale Architectures

PetaBricks. With the dawn of the multi-core era, programmers are being challenged to write

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Pactron FPGA Accelerated Computing Solutions

XPU A Programmable FPGA Accelerator for Diverse Workloads

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Embedded Systems: Hardware Components (part I) Todor Stefanov

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

A Real Time Controller for E-ELT

Part IV. Review of hardware-trends for real-time ray tracing

Early Experiences Writing Performance Portable OpenMP 4 Codes

GPPD Grupo de Processamento Paralelo e Distribuído Parallel and Distributed Processing Group

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Maximizing heterogeneous system performance with ARM interconnect and CCIX

45-year CPU Evolution: 1 Law -2 Equations

Multi-Core Microprocessor Chips: Motivation & Challenges

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Embedded Systems. 7. System Components

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

When Will SystemC Replace gcc/g++? Bodo Parady, CTO Pentum Group Inc. Sunnyvale, CA

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Intel VTune Amplifier XE

SILECS Super Infrastructure for Large-scale Experimental Computer Science

High performance computing and numerical modeling

CNS Update. José F. Martínez. M 3 Architecture Research Group

UPCRC Overview. Universal Computing Research Centers launched at UC Berkeley and UIUC. Andrew A. Chien. Vice President of Research Intel Corporation

The DEEP (and DEEP-ER) projects

Supercomputing and Mass Market Desktops

Compute Node Design for DAQ and Trigger Subsystem in Giessen. Justus Liebig University in Giessen

Computer Architecture

High Performance Computing Activities funded by the Future Emerging Technologies European Unit

Microarchitecture Overview. Performance

Transcription:

IWES 2016 1st Italian Workshop on Embedded Systems Pisa -- 19 September 2016 Research Group Overview Roberto Giorgi University of Siena, Italy http://www.dii.unisi.it/~giorgi

Siena on Earth 2

Engineering Faculty in Siena ~2km Our venue 3

Computer Architecture Lab

Research Group @ UNISI 1 Associate prof.: Roberto Giorgi 1 Postdoc, 1 PhD Student, 1 Kernel Hacker HR Throughput: 2 Full-Prof., 2 Researchers, 9 Postdocs, 6 PhD students ---- during the last 7 years Courses by me: Bachelor (L1): Computer Architecture (6 credits) -3 rd year (Italian) Master (L2): High Performance Computer Architecture (9 credits) 1 st and 2 nd year(english) Lab Resources 64-core (x86) CC-NUMA w/1024gib RAM 48-core+256GiB, about 15 simulation servers (8-core+32GiB) 12 different FPGA boards ranging from Virtex-6 to Zynq Ultrascale+ (6- core 64 bits) Xeon Phi, Maxeler Dataflow computer, GPUs, 50+ embedded boards, (20+ workstations)

Agile, extensible, fast I/O Module for the cyber-physical era 2015-2017 -- 4Meuro funding Brief introduction to the AXIOM project Roberto Giorgi University of Siena, Italy University of Siena (Coordinator Partner)

FUTURE AND EMERGING TECHNOLOGIES 6.13 Meuro funding 2010-13 Exploiting Dataflow Parallelism in Teradevice Computing Università di Siena (Coordinator) Barcelona Supercomputing Center TERA F LUX http://teraflux.eu University of Augsburg University of Manchester Contact: Prof. Roberto Giorgi (project coordinator): http://www.dii.unisi.it/~giorgi University of Cyprus INRIA University of Delaware (USA)

ERA Embedded Reconfigurable Architectures Project number: 249059 FP7 2010-2013 --~ 3 Meurofunding Xilinx Virtex6

Cooperation with SantaChiaraLab(100m from us), Dept. Cognitive Science (in-building) & SECO (Arezzo) http://santachiaralab.unisi.it/ SECO/UNISI achievements: 2014: UDOO-ARM (99 $ PC+Arduino) 600k$ on Kickstarter 2016: UDOO-x86 (PC+Arduino, 10x faster than Raspberry-3) 800k$ on Kickstarter

ACM International Conference on Computing Frontiers 17 15-17 May 2017 Siena, Italy www.computingfrontiers.org

Siena, the city 11

Siena sorroundings 12

TERAFLUX

(Nearer) Future Scenarios == 3D stacking, 8nm, 3D transistors, Graphene G. Hendry, K. Bergman, Hybrid On-chip Data Networks, HotChips-22, Stanford, CA Aug. 2010 Fab D1X (OR), 42 (AZ), 24 (Ireland) starting the 14nm node in 2013 Pawloski, May 2011, Exascale Seminar, Ghent 14 TERAFLUX Project -- Roberto Giorgi -- http://teraflux.eu

DATAFLOW A Scheme of Computation in which an activity is initiated by presence of the data it needs to perform its function (Jack Dennis) 15 TERAFLUX Project -- Roberto Giorgi -- http://teraflux.eu

WP5 WP6 WP7 WP8 16 WP2 WP3 WP4 WP9 Programming Model Compilation Tools Abstraction Layer and Reliability Layer Teradevice hardware (simulated) APPLICATIONS Data dependencies Source code Extract TLP Threads Virtual CPUs Transactional memory Locality optimizations T1 T2 VCPU VCPU VCPU VCPU VCPU T2 possibly 1,000-10,000 cores... PCPU PCPU PCPU PCPU P PCPU PCPU PCPU TERAFLUX Project -- Roberto Giorgi -- http://teraflux.eu TERA FLUX.EU Working Hypothesis 1000 Billion- or 1 TERAdevice computing platforms pose new challenges: (at least) programmability, complexity of design, reliability TERAFLUX context: High performance computing and applications (not necessarily embedded) TERAFLUX scope: Exploiting a less exploited path (DATAFLOW) at each level of abstraction

UD s Involvement in Exascale Computing Traleika Glacier: System Architecture X-STACK project Control Engine (CE) Optimized for execution model and resiliency Execution Engine (XE) Optimized for apps Large local stores for data locality Node Processor Chip Large local stores (locality) Sensors for self-awareness Fine grain energy management Processor board with DRAM Cluster Hierarchical, heterogeneous interconnect Hardware System E Efficiency Scalability Programmability Perf Portability Resiliency 17 TERAFLUX Project -- Roberto Giorgi -- http://teraflux.eu Courtesy of W.Pinfold and S.Borkar, Intel Corp

DF-Threads (DataFlowThreads) We built a demonstrator based on HP-Labs COTSon simulator for Dataflow Based execution Running 220 instances of 32-core Linux (full-system) virtual machines 7000+ cores We seamleaslyrun millions of DF-threads (x86 binaries compiled with GCC) with almost linear speed-up on a 1024-core (x86) without need of global cache coherence General computation are supported Transactional Memory is joined to Dataflow to support undeterministic computations with shared state R. Giorgi, P. Faraboschi, "An Introduction to DF-Threads and their Execution Model",IEEE MPP, Paris, France, Oct. 2014, pp. 60-65.

ERA

ERA Target System Smartphone with FPGA-based SoCs(e.g Zynq) Exploring the energy efficiency of reconfigurable hardware

Benchmarks analysis from energy+delay viewpoint DYNAMYC ENERGY consumption and DELAY while varying L2 cache-size, issue-width, frequency Delay significantly decreases with L2 cache size, frequency, issue-width (total energy as in previous slide) These behaviors have been confirmed across all the EBS applications Avg Energy Dissipation (ac3 across configurations) 2-wide Proc. energy 4-wide proc. energy 8-wide Proc. energy Delay (2-wide) Delay (4-wide) Delay (8-wide) AC3 benchmark Processor Energy [mj] * 1400 1200 1000 800 600 400 3000 2000 1000 Delay [ms] 200 0 64kB 128kB 256kB 512kB 1024kB 64kB 128kB 256kB 512kB 1024kB 64kB 128kB 256kB 512kB 1024kB 64kB 128kB 256kB 512kB 1024kB 64kB 128kB 256kB 512kB 1024kB 800 MHz 1100 MHz 1400 MHz 1700 MHz 2000MHz Roberto Giorgi giorgi@unisi.it 10M Instruction Intervals * Total energy (static+dynamic) 21

AXIOM

AXIOM main goal Building a ready-to-market Single Board Computer (SBC) able to address high performance computations in scenarios like Smart Video-surveillance Smart Living & Smart Home

CAN WE DO THAT? SECO/UNISI achievements: 2014: UDOO-ARM (99 $ PC+Arduino) 600k$ on Kickstarter 2016: UDOO-x86 (PC+Arduino, 10x faster than Raspberry-3) 800k$ on Kickstarter 24 Roberto Giorgi www.dii.unisi.it/~giorgi 24

FIRST PROTOTYPE UNDER REVIEW Roberto Giorgi www.dii.unisi.it/~giorgi 25