Compilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs DREAM seminar

Similar documents
Compilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs

A New Compilation Flow for Software-Defined Radio Applications on Heterogeneous MPSoCs

Leveraging Mobile GPUs for Flexible High-speed Wireless Communication

Reliable Embedded Multimedia Systems?

5G the next major wireless standard

Dynamic Dataflow. Seminar on embedded systems

Simulation, prototyping and verification of standards-based wireless communications

Easy Multicore Programming using MAPS

Modelling, Analysis and Scheduling with Dataflow Models

Are Polar Codes Practical?

Software Defined Modem A commercial platform for wireless handsets

MPSoC Design Space Exploration Framework

Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture

NUMA Profiling for Dynamic Dataflow Applications

Simplifying FPGA Design for SDR with a Network on Chip Architecture

HW/SW Cyber-System Co-Design and Modeling

Heterogeneous vs Homogeneous MPSoC Approaches for a Mobile LTE Modem

EE382V: System-on-a-Chip (SoC) Design

Reliable Dynamic Embedded Data Processing Systems

Dataflow programming for heterogeneous computing systems

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Node Prefetch Prediction in Dataflow Graphs

A Software Development and Validation Framework for SDR Platforms

Overview of Dataflow Languages. Waheed Ahmad

High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI

Performance Monitoring of Throughput Constrained Dataflow Programs Executed On Shared-Memory Multi-core Architectures

FAUST: ON-CHIP DISTRIBUTED SOC ARCHITECTURE FOR A 4G BASEBAND MODEM CHIPSET Yves Durand, Christian Bernard, Didier Lattard CEA/LETI Grenoble, France

Programming Heterogeneous Embedded Systems for IoT

Automatic Instrumentation of Embedded Software for High Level Hardware/Software Co-Simulation

Nutaq. PicoSDR FPGA-based, MIMO-Enabled, tunable RF SDR solutions PRODUCT SHEET I MONTREAL I NEW YORK I. nutaq. .com QUEBEC

Wireless access. Dr. Christian Hoymann Principal Researcher, Ericsson Research

Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs

Distributed Operation Layer Integrated SW Design Flow for Mapping Streaming Applications to MPSoC

Application-Platform Mapping in Multiprocessor Systems-on-Chip

The Impact of 5G Air Interfaces on Converged Fronthaul/Backhaul. Jens Bartelt TU Dresden / 5G-XHaul

Nutaq. PicoSDR FPGA-based, MIMO-Enabled, tunable RF SDR solutions PRODUCT SHEET I MONTREAL I NEW YORK I. nutaq. .com QUEBEC

OpenRadio. A programmable wireless dataplane. Manu Bansal Stanford University. Joint work with Jeff Mehlman, Sachin Katti, Phil Levis

3D TECHNOLOGIES: SOME PERSPECTIVES FOR MEMORY INTERCONNECT AND CONTROLLER

PREESM: A Dataflow-Based Rapid Prototyping Framework for Simplifying Multicore DSP Programming

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

Seamless Dynamic Runtime Reconfiguration in a Software-Defined Radio

Communication Systems Design in Practice

Part 2: Principles for a System-Level Design Methodology

ADAPTING A SDR ENVIRONMENT TO GPU ARCHITECTURES

Computational Process Networks a model and framework for high-throughput signal processing

Modeling a 4G LTE System in MATLAB

System-level Synthesis of Dataflow Applications for FPGAbased Distributed Platforms

Intel Corporation. Software Development Environment for Reconfigurable Communications Architecture Intel Corporation.

Interface-Based Design Introduction

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Lars Schor, and Lothar Thiele ETH Zurich, Switzerland

Flexible wireless communication architectures

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs

EE382N.23: Embedded System Design and Modeling

Compositionality in system design: interfaces everywhere! UC Berkeley

Content. New Challenges Memory and bandwidth

Software defined radio networking: Opportunities and challenges

Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling

Programming in the Brave New World of Systems-on-a-chip

WiMAX Capacity Enhancement: Capacity Improvement of WiMAX Networks by Dynamic Allocation of Subframes

VLSI Design Automation. Maurizio Palesi

Hardware-Software Codesign. 1. Introduction

Cover Page. The handle holds various files of this Leiden University dissertation

Computational Models for Concurrent Streaming Applications

Custom computing systems

Multimedia in Mobile Phones. Architectures and Trends Lund

Embedded SDR for Small Form Factor Systems

The Open-Source SDR LTE Platform for First Responders. Software Radio Systems

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

A System Solution for High-Performance, Low Power SDR

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Analyses, Hardware/Software Compilation, Code Optimization for Complex Dataflow HPC Applications

Algorithm-Architecture Co- Design for Efficient SDR Signal Processing

Communication Systems Design in Practice

RFNoC : RF Network on Chip Martin Braun, Jonathon Pendlum GNU Radio Conference 2015

PicoSDR goes GNU Radio. Tristan Martin Jan 2013

Implementing FFT in an FPGA Co-Processor

Wireless Networking: An Introduction. Hongwei Zhang

MODELING OF BLOCK-BASED DSP SYSTEMS

LANCOM Techpaper IEEE n Indoor Performance

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

MPSOC 2011 BEAUNE, FRANCE

Reconfigurable Cell Array for DSP Applications

White Paper Using Cyclone III FPGAs for Emerging Wireless Applications

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

ATS-GPU Real Time Signal Processing Software

Challenges. Shift to Reuse Strategy Higher Level of Abstractions Software!!!

Equinox: A C++11 platform for realtime SDR applications

VLSI Design Automation

On mapping to multi/manycores

fakultät für informatik informatik 12 technische universität dortmund Data flow models Peter Marwedel TU Dortmund, Informatik /10/08

A Stream Compiler for Communication-Exposed Architectures

SysteMoC. Verification and Refinement of Actor-Based Models of Computation

VLSI Design Automation. Calcolatori Elettronici Ing. Informatica

Computational Process Networks

Is dynamic compilation possible for embedded system?

LabVIEW Based Embedded Design [First Report]

Peripheral State Persistence and Interrupt Management For Transiently Powered Systems

Original Circular Letter

Transcription:

Compilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs DREAM seminar Mickaël Dardaillon Research Intern with NOKIA Technologies January 27th, 2015

2 / 33 What we know about 5G demands Higher capacity, lowest latency and more consistent experience Evolution of telecommunication protocols Tactile Real-time control 1ms Visual 10ms NextGen media Monitoring & sensing Multimedia Mail? Tactile M2M MTC 3G 4G Fle for un tod Audio 100ms Text Voice 1G 2G Push & pull of technology 3 13/01/2015

3 / 33 4G LTE-Advanced: Downlink 1 frame (10 ms) 1 sub-frame (1 ms) 0 1 2 3 4 5 6 7 8 9

3 / 33 4G LTE-Advanced: Downlink 1 frame (10 ms) 1 sub-frame (1 ms) 0 1 2 3 4 5 6 7 8 9 2048 subcarriers (20 MHz)... 14 OFDM Symbols Control Data User 1 User 2 User 3

3 / 33 4G LTE-Advanced: Downlink 1 frame (10 ms) 1 sub-frame (1 ms) 0 1 2 3 4 5 6 7 8 9 MIMO: 4 2 antennas 2048 subcarriers (20 MHz)... Control Data User 1 LTE throughput: 1.4 Gbps LTE-Advanced: 7 Gbps Latency: 2 ms Power budget: 500 mw User 2 User 3 14 OFDM Symbols

Magali SDR LTE demonstrator [Clermidy et al., 09] Power consumption: 231mW 4 / 33

Magali SDR dsp4 dsp1 dsp3 dsp5 dsp2 LTE demonstrator [Clermidy et al., 09] Power consumption: 231mW 4 / 33

Magali SDR MOD mod dsp4 OFDM ofdm3 OFDM ofdm4 dsp1 LDPC ldpc OFDM ofdm1 dsp3 dsp5 TURBO turbo OFDM ofdm2 dsp2 DEMOD demod WIFLEX wiflex LTE demonstrator [Clermidy et al., 09] Power consumption: 231mW 4 / 33

Magali SDR MOD mod DMA dma4 dsp4 OFDM ofdm3 OFDM ofdm4 dsp1 DMA dma5 LDPC ldpc DMA dma1 OFDM ofdm1 DMA dma2 dsp3 dsp5 TURBO turbo OFDM ofdm2 dsp2 DMA dma3 DEMOD demod WIFLEX wiflex LTE demonstrator [Clermidy et al., 09] Power consumption: 231mW 4 / 33

Magali SDR MOD mod DMA dma4 dsp4 OFDM ofdm3 OFDM ofdm4 dsp1 ARM arm 8051 8051 DMA dma5 LDPC ldpc DMA dma1 OFDM ofdm1 DMA dma2 dsp3 dsp5 TURBO turbo OFDM ofdm2 dsp2 DMA dma3 DEMOD demod WIFLEX wiflex LTE demonstrator [Clermidy et al., 09] Power consumption: 231mW 4 / 33

5 / 33 Problem statement How should we program a Cell processor?

5 / 33 Problem statement How should we program a Cell processor? Any way you want! How to program and compile a telecommunication protocol to an heterogeneous MPSoC?

6 / 33 Outline Context Programming Model for SDR Dataflow Model of Computation Dataflow Refinement and Buffer Verification Mapping and Scheduling Micro-Scheduling Experimentations on Magali Code Generation Experimental Results Perspectives

7 / 33 State of the Art in SDR Programming Imperative Concurrent Platform ExoCHI [Wang et al., 07] BEAR [Derudder et al., 09] Language OpenMP + C Matlab + C Dataflow Platform Simulink LabView GNU Radio RVC-CAL [Lucarz et al., 08] DiplodocusDF [Gonzalez-Pina et al., 12] MAPS [Castrillon et al., 13] Language Python + C XML + C UML C like

8 / 33 Static Dataflow (SDF) [Lee et al., 87] Src 1 10 10 1 1 Decod 1 Ctrl

9 / 33 Phase Approach with Static Dataflow Src 1 10 10 1 1 Decod 1 Ctrl Src 2 100 10 Decod 2 1 10 Sink... Src 2 100 10 Decod 2 2 10 Sink Src 2 100 10 Decod 2 3 10 Sink

10 / 33 Dynamic Dataflow (DDF) [Buck, 93] SDF Analysable KPN DDF Expressive Kahn Process Network (KPN) [Kahn, 74]

10 / 33 Dynamic Dataflow (DDF) [Buck, 93] SDF MCDF SADF PiMM SPDF BPDF KPN DDF Analysable Expressive Scenario Aware DataFlow (SADF) [Theelen et al., 06] Mode Controlled DataFlow (MCDF) [Moreira et al., 12] Schedulable Parametric DataFlow (SPDF) [Fradet et al., 12] Parameterized and Interfaced dataflow Meta-Model (PiMM) [Desnos et al., 13] Boolean Parametric DataFlow (BPDF) [Bebelis et al., 13] Kahn Process Network (KPN) [Kahn, 74]

11 / 33 Schedulable Parametric DataFlow (SPDF) Src 10 10 1 1 Ctrl Decod 1 [Fradet et al., 12] Model of Computation Analysis Quasi-Static Scheduling

11 / 33 Schedulable Parametric DataFlow (SPDF) Src 10 100 10 1 1 Ctrl Decod 1 set p[1] p 10 p 10 Decod 2 Sink [Fradet et al., 12] Model of Computation Analysis Quasi-Static Scheduling...

Front End Implementation Front End PaDaF (C++) C++ Front End (CLang) LLVM IR Graph Construction SDR Programming Model Propose SPDF for SDR C++ input format Front End Based on LLVM framework Derived from SystemC analysis [Marquet et al., 10] Static graph structure Graph + LLVM IR 12 / 33

13 / 33 Outline Context Programming Model for SDR Dataflow Model of Computation Dataflow Refinement and Buffer Verification Mapping and Scheduling Micro-Scheduling Experimentations on Magali Code Generation Experimental Results Perspectives

14 / 33 SPDF Mapping Src dma1 10 100 10 1 Decod 1 10 Decod 2 p p 1 10 arm Ctrl set p[1] Sink demod dma2 DEMOD demod ARM arm DMA dma1 DMA dma2

15 / 33 SPDF Quasi-Static Scheduling [Fradet et al., 12] Src dma1 10 100 10 1 Decod 1 10 Decod 2 p p 1 10 arm Ctrl set p[1] Sink demod dma2 S(dma1) = (Src) S(arm) = (Ctrl; set(p)) S(demod) = ( Decod 1 ; get(p); (Decod 2 ) 10) S(dma2) = (get(p); (Sink) p )

16 / 33 SPDF Symbolic Execution dma1 Src demod arm D1 (D2) 10 Ctrl dma2 (Sink) p Time S(dma1) = (Src) S(arm) = (Ctrl; set(p)) S(demod) = ( Decod 1 ; get(p); (Decod 2 ) 10) S(dma2) = (get(p); (Sink) p )

17 / 33 SPDF Buffer Sizing arm Src dma1 [10] 10 100 [100] 10 1 [1] Decod 1 1 p 10 Decod 2 p[10*p max ] 10 Ctrl set p[1] Sink demod dma2 Problem: overestimates buffer size e.g. Magali FFT size: 2048 Buffer size: 16

18 / 33 SPDF Model Refinement arm Src dma1 10 100 [10] [10] 10 Decod 1 1 10 Decod 2 p [1] 1 p [p max ] 10 Ctrl set p[1] Sink demod dma2 Src::compute() { [...] out[1].push(ctrl, 10); for(int i=0; i<10; i++) out[2].push(data[i],10); } Idea: model each individual data communication Micro-Scheduling

19 / 33 Micro-Scheduling: an Example dma1 demod arm Src D1 (D2) 10 Ctrl dma2 µs(src) = µs(d 2 ) = µs(sink) = (Sink) p Time ) (push Src,D1 (10); push Src,D2 (10) 10 ) (pop Src,D2 (10); push D2,Sink ) (pop (p) D2,Sink (1)10

20 / 33 Buffer Sizing Verification How to verify buffer sizes using micro-schedules?

Buffer Sizing Verification How to verify buffer sizes using micro-schedules? Proposed Verification Method Based on Model Checking Derived from buffer minimization [Geilen et al., 05] Model Schedule Buffer sizes + Micro-Schedule + Parameter values Model Checker SPIN Check for deadlocks 20 / 33

Micro-Scheduling Implementation Front End PaDaF (C++) C++ Front End (CLang) LLVM IR Back End Mapping Scheduling Buffer Verification (SPIN) Micro-Scheduling SPDF model refinement Sequential communications Buffer Verification Model checking Graph Construction Graph + LLVM IR 21 / 33

22 / 33 Outline Context Programming Model for SDR Dataflow Model of Computation Dataflow Refinement and Buffer Verification Mapping and Scheduling Micro-Scheduling Experimentations on Magali Code Generation Experimental Results Perspectives

Code Generation Graph + LLVM IR OFDM DEMOD TURBO DMA ARM code generation communication code generation control code generation Control code (C) ARM code generation MOD DMA OFDM OFDM mod dsp1 dma4 ARM arm dsp4 8051 8051 ofdm3 DMA dma5 ofdm4 LDPC ldpc Magali code (ASM) DMA OFDM DMA TURBO dma1 ofdm1 dma2 dsp3 dsp5 turbo OFDM DMA DEMOD WIFLEX ofdm2 dsp2 dma3 demod wiflex 23 / 33

24 / 33 Benchmarks using LTE OFDM: compilation Src 7168 1024 1024 1024 FFT Defram 600 4200 Sink dma1 ofdm1 dma3 Demodulation: communications Src 1200 1200 900 900 Word 1200 Demap Deinter 900 dma2 dma3 57 Sink dma4 Src 1200 900 Bit Deinter 900 300 1353 1353 Depunct Turbo Decod 57 dma1 demod turbo

25 / 33 Benchmarks using LTE Parametric Demodulation: parameter Src 1440 60 Bit 60 30 93 93 Turbo 4 Deinter Depunct Decod dma2 1440 1440 Split Split 240 240 1200 240 240 1200 1200 1200 Demap p Demap 60 60 Word 60 Deinter 300p 300p Word 300p Deinter 8 57 arm Control set p[1] p Sink dma3 dma4 Src 1440 300p Bit Deinter 300p 300 1353 1353 Depunct Turbo Decod 57 dma1 demod turbo

26 / 33 Results: Estimated Development Time Compiler Development Front-End : 4 man-months Back-End : 8 man-months Native PaDaF Application C / ASM (#lines) (hours) C++ (#lines) (hours) OFDM 150 / 200 40 60 1 Demodulation 300 / 600 160 160 4 Param. Demod. 500 / 800 480 260 8 Takeaway Message: Reduces development time

Results: Buffer Verification Time Evaluation framework 2.4 GHz Intel Core i5, 8 GB RAM, OS X 10.9.2. SPIN Model Checker Application States Transitions Exec. Time (s) OFDM 1.28 10 4 2.56 10 4 0.1 Demodulation 2.12 10 6 1.07 10 7 9 Param. Demod. 6.07 10 7 2.22 10 8 199 Takeaway Message: Reduces development time, improves verification 27 / 33

Results: Execution Time Evaluation framework SystemC TLM based on 65 nm CMOS implementation ARM code run on QEMU Virtual Machine Application Native Generated (µs) (µs) OFDM 149 168 (+13%) Demodulation 180 283 (+57%) Param. Demod. 419 558 (+33%) Takeaway Message: Reduces development time, improves verification 28 / 33

Execution Model Src 7168 1024 1024 1024 FFT Defram 600 4200 Sink dma1 ofdm1 dma3 Phase Approach arm dma1 ofdm1 dma3 Time Distributed arm dma1 ofdm1 dma3 Time 29 / 33

29 / 33 Execution Model Phase Approach arm dma1 ofdm1 dma3 25 µs 37 µs 16 µs 21 µs Time Distributed arm dma1 ofdm1 dma3 25 µs 74 µs 23 µs 25 µs Time

Results: Execution Time Evaluation framework SystemC TLM based on 65 nm CMOS implementation ARM code run on QEMU Virtual Machine Application Native Generated Optimized (µs) (µs) (µs) OFDM 149 168 (+13%) 149 (+0%) Demodulation 180 283 (+57%) 180 (+0%) Param. Demod. 419 558 (+33%) 288 (-31%) Takeaway Message: Reduces development time, improves verification, maintains performances 30 / 33

31 / 33 Back End Implementation Front End PaDaF (C++) C++ Front End (CLang) LLVM IR Graph Construction Graph + LLVM IR Back End Mapping Scheduling Buffer Verification (SPIN) Code Generation MPSoC Code (ASM) Magali Support Computation Communication Control LTE Experimentation Performance close to native Buffer verification Central controller

32 / 33 Outline Context Programming Model for SDR Dataflow Model of Computation Dataflow Refinement and Buffer Verification Mapping and Scheduling Micro-Scheduling Experimentations on Magali Code Generation Experimental Results Perspectives

33 / 33 Perspectives On dataflow programming Compiler Runtime Front End PaDaF (C++) Back End Mapping C++ Front End (CLang) Scheduling LLVM IR Buffer Verification (SPIN) Graph Construction Code Generation Graph + LLVM IR MPSoC Code (ASM)

Perspectives On dataflow programming On heterogeneous MPSoC Future of dedicated platforms What we know about 5G demands Higher capacity, lowest latency and more consistent experience Development on such platforms Tactile Real-time control 1ms Visual 10ms NextGen media Monitoring & sensing Multimedia Mail? Tactile M2M MTC 3G 4G Flexibility for what is unknown today Audio 100ms Text Voice 1G 2G Push & pull of technology 3 13/01/2015 33 / 33

33 / 33 Perspectives On dataflow programming On heterogeneous MPSoC Publications Survey: [Dardaillon et al., IWCMC 12] Compilation flow: [Dardaillon et al., CASES 14] INSA-Lyon, CITI-Inria Tanguy Risset Kevin Marquet CEA Grenoble Jérôme Martin Henri-Pierre Charles