Notos: Efficient Emulation of Wireless Sensor Networks with Binary-to-Source Translation

Similar documents
Hardware Support for a Wireless Sensor Network Virtual Machine

Hardware Emulation and Virtual Machines

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

Towards a Resilient Operating System for Wireless Sensor Networks

Automation Engineers AB Pvt Ltd, NOIDA Job-Oriented Course on Embedded Microcontrollers & Related Software Stack

Practical Malware Analysis

Microcontrollers. Microcontroller

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

ECE 471 Embedded Systems Lecture 2

COMPUTER ORGANIZATION & ARCHITECTURE

A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler

Martin Kruliš, v

CS399 New Beginnings. Jonathan Walpole

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

System Architecture Directions for Networked Sensors[1]

Embedded programming, AVR intro

COMPLEX EMBEDDED SYSTEMS

Virtual Memory COMPSCI 386

FIFTH SEMESTER DIPLOMA EXAMINATION IN ENGINEERING/ TECHNOLOGY-MARCH 2014 EMBEDDED SYSTEMS (Common for CT,CM) [Time: 3 hours] (Maximum marks : 100)

System Energy Efficiency Lab seelab.ucsd.edu

EE 308: Microcontrollers

Lecture1: introduction. Outline: History overview Central processing unite Register set Special purpose address registers Datapath Control unit

CHAPTER 1 Introduction to Computers and Java

Kampala August, Agner Fog

The Instruction Set. Chapter 5

Basic Concepts COE 205. Computer Organization and Assembly Language Dr. Aiman El-Maleh

Assembly Language. Lecture 2 x86 Processor Architecture

CSc 453 Interpreters & Interpretation

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

1.Explain with the diagram IVT of 80X86. Ans-

Concurrent Programming

Computer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Concurrent Programming. Implementation Alternatives. Content. Real-Time Systems, Lecture 2. Historical Implementation Alternatives.

Computer Architecture

Real instruction set architectures. Part 2: a representative sample

Chapter 13 Reduced Instruction Set Computers

Lecture 9 Dynamic Compilation

Binghamton University. CS-220 Spring Cached Memory. Computer Systems Chapter

Four Components of a Computer System

Computer System Architecture

Digital System Design Using Verilog. - Processing Unit Design

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler

CPE 323 MSP430 INSTRUCTION SET ARCHITECTURE (ISA)

ECE Lab 8. Logic Design for a Direct-Mapped Cache. To understand the function and design of a direct-mapped memory cache.

Capriccio : Scalable Threads for Internet Services

Hardware and Software Architecture. Chapter 2

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Basic operators, Arithmetic, Relational, Bitwise, Logical, Assignment, Conditional operators. JAVA Standard Edition

ARM Cortex A9. ARM Cortex A9

ECE 471 Embedded Systems Lecture 2

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Interaction of JVM with x86, Sparc and MIPS

Chapter 7: Processor and Memory

Lecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

BASIC COMPUTER ORGANIZATION. Operating System Concepts 8 th Edition

Chapter 5. Introduction ARM Cortex series

Memory - Paging. Copyright : University of Illinois CS 241 Staff 1

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Implementing Secure Software Systems on ARMv8-M Microcontrollers

Memory Management. Dr. Yingwu Zhu

Free Development Environment for Bus Coupling Units (BCUs)

Intentionally Blank 0

Assembly Language: Overview!

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation

Main Memory and the CPU Cache

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Computers and Microprocessors. Lecture 34 PHYS3360/AEP3630

Registers. Registers

When an instruction is initially read from memory it goes to the Instruction register.

Chapter 1: Introduction to Computers and Java

8/23/2014. Chapter Topics. Introduction. Java History. Why Program? Java Applications and Applets. Chapter 1: Introduction to Computers and Java

Energy-aware Reconfiguration of Sensor Nodes

Contiki a Lightweight and Flexible Operating System for Tiny Networked Sensors

Implementation Alternatives. Lecture 2: Implementation Alternatives & Concurrent Programming. Current Implementation Alternatives

CN310 Microprocessor Systems Design

Java language. Part 1. Java fundamentals. Yevhen Berkunskyi, NUoS

Binary Translation 2

Overhead Evaluation about Kprobes and Djprobe (Direct Jump Probe)

Older geometric based addressing is called CHS for cylinder-head-sector. This triple value uniquely identifies every sector.

Lecture 03 Bits, Bytes and Data Types

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

Syllabus for Computer Science General Part I

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

The Slide does not contain all the information and cannot be treated as a study material for Operating System. Please refer the text book for exams.

Tizen/Artik IoT Lecture Chapter 3. JerryScript Parser & VM

Chapter 8: Main Memory

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

CSE P 501 Compilers. Java Implementation JVMs, JITs &c Hal Perkins Winter /11/ Hal Perkins & UW CSE V-1

Lecture 1 Introduction to Android. App Development for Mobile Devices. App Development for Mobile Devices. Announcement.

Jackson Marusarz Intel Corporation

Transport layer issues

Readings and References. Virtual Memory. Virtual Memory. Virtual Memory VPN. Reading. CSE Computer Systems December 5, 2001.

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Question Bank. 10CS63:Compiler Design

AVR MICROCONTROLLER ARCHITECTURTE

Wind River. All Rights Reserved.

Memory management. Knut Omang Ifi/Oracle 10 Oct, 2012

Transcription:

Schützenbahn 70 45127 Essen, Germany Notos: Efficient Emulation of Wireless Sensor Networks with Binary-to-Source Translation Robert Sauter, Sascha Jungen, Richard Figura, and Pedro José Marrón, Germany

Motivation (Short Version) Experts Agree In Simulation, Faster is Better 2

Sensor Network Nodes 16 Bit CPU, 4-8 MHz 10 KB RAM, 48 KB ROM Low-Power Low-Bandwidth Wireless Communication Battery Powered 3

WSN Applications 4

Motivation (2) Software development for WSN is challenging Severe resource constraints à optimized code necessary Very dynamic;; unreliable nodes;; unreliable links Large-scale distributed system à Simulation to the Rescue! 5

Motivation (2) Software development for WSN is challenging Severe resource constraints à optimized code necessary Very dynamic;; unreliable nodes;; unreliable links Large-scale distributed system à Simulation to the Rescue! But Application code, protocols, and operating system tightly coupled Cross-layer optimizations Large impact of environment on protocol/system behavior Switch among simulation, testbeds and exploratory deployments à Cycle-accurate emulation to the Rescue! 6

Cycle-Accurate Emulation Code written and compiled for target platform Emulator simulates each cycle of the target platform Emulator contains model of the micro-controller Core CPU: supported instructions, flags and registers RAM and program memory (flash) Timers, ADC, DAC, I/O ports and the relevant periphery Radio External Flash, sensors High fidelity and usability, fully deterministic, but computationally expensive 7

Core Engine Core engine uses discrete-event simulation Events simulate hardware interrupts E.g., by timers, radio, Main loop Read current instruction Modify simulated state Check event queue public void visit(legacyinstr.adc i){ nextpc = pc + 2; int tmp_0 = getregisterunsigned(i.r1); int tmp_1 = getregisterunsigned(i.r2); int tmp_2 = bit(c); int tmp_3 = tmp_0 + tmp_1 + tmp_2; int tmp_4 = tmp_0 & 0x0000000F; int tmp_5 = tmp_1 & 0x0000000F; boolean tmp_6 = (tmp_0 & 128)!= 0; boolean tmp_7 = (tmp_1 & 128)!= 0; boolean tmp_8 = (tmp_3 & 128)!= 0; H = (tmp_4 + tmp_5 + tmp_2 & 16)!= 0; C = (tmp_3 & 256)!= 0; N = (tmp_3 & 128)!= 0; Z = low(tmp_3) == 0; V = tmp_6 && tmp_7 &&!tmp_8!tmp_6 &&!tmp_7 && tmp_8; S = N!= V; byte tmp_9 = low(tmp_3); writeregisterbyte(i.r1, tmp_9); cyclesconsumed++; } 8

Avrora Main Loop ef Instructions r1: Register r2: Register... ADC accept(v:visitorinterface): void v.visit(this) WDR accept(v:visitorinterface): void v.visit(this) Interpreter... LegacyInstr i = shared_instr[nextpc]; i.accept(this); cycles = cyclesconsumed; Commit();... Visitor Interface public void visit(legacyinstr.adc i) { nextpc = pc + 2; int tmp_0 = getregisterunsigned(i.r1); int tmp_1 = getregisterunsigned(i.r2); int tmp_2 = bit(c); int tmp_3 = tmp_0 + tmp_1 + tmp_2;... } 9

Binary-to-Source Translation Adapt emulator to program Transform AVR instructions to corresponding simulator code and compile for host platform (here: Java) Replace main loop and integrate with event handling and simulation of components Java limitations à Partition code into methods with switch statements à Higher level constructed binary search tree while(innerloop && nextpc >= 0 && nextpc <= 690) { switch (nextpc) { case 0: // jmp 436 nextpc = 436 * 2; pc = nextpc; eventqueue.advance(3); inst_692_to_1338(); break;... case 140: // adc r3, r9 nextpc = 140 + 2; tmpint_0 = getregisterunsigned(r3); tmpint_1 = getregisterunsigned(r9); tmpint_2 = bit(c);... C = (tmpint_3 & 256)!= 0; tmpbyte_0 = low(tmpint_3); writeregisterbyte(r3, tmpbyte_0); pc = nextpc; eventqueue.advance(1); if(!innerloop) { break; }... } } 10

Binary-to-Source Translation (2) Reduce method calling overhead Best case: next instruction begins with next statement Usually: within switch statement without use of the BST Foundation for static analysis and optimizations Individual code per instruction (vs. per instruction type) Use additional knowledge, e.g., addresses Consider instruction sequences Lazy timer evaluation Calculate timer value on demand and only translate simulated interrupts to events instead of every tick Significantly increase occurrences of sequence optimizations 11

Optimization for Individual Instructions Optimized utility functions, for example: Hardware maps I/O ports to RAM: accessing RAM can involve multiple simulated components If address is known, runtime checks can be removed (done by dedicated streamlined utility functions) Small savings but very frequently used à evaluation shows highest impact Constant folding and constant propagation of immediates Allows pre-calculation and removal of intermediate expressions Iterative optimization algorithm Simple example: instruction pointer Peephole optimizations For some instructions, result is already known (e.g., XOR R1, R1) 12

Conditional Execution of Code Emulation of arithmetic and logical instructions 1. Bookkeeping (instruction pointer, event queue, ) 2. Fetch input operands (e.g., registers) 3. Calculate intermediate expressions 4. Store result and update flags public void visit(legacyinstr.adc i){ nextpc = pc + 2; int tmp_0 = getregisterunsigned(i.r1); int tmp_1 = getregisterunsigned(i.r2); int tmp_2 = bit(c); int tmp_3 = tmp_0 + tmp_1 + tmp_2; int tmp_4 = tmp_0 & 0x0000000F; int tmp_5 = tmp_1 & 0x0000000F; boolean tmp_6 = (tmp_0 & 128)!= 0; boolean tmp_7 = (tmp_1 & 128)!= 0; boolean tmp_8 = (tmp_3 & 128)!= 0; H = (tmp_4 + tmp_5 + tmp_2 & 16)!= 0; C = (tmp_3 & 256)!= 0; N = (tmp_3 & 128)!= 0; Z = low(tmp_3) == 0; V = tmp_6 && tmp_7 &&!tmp_8!tmp_6 &&!tmp_7 && tmp_8; S = N!= V; byte tmp_9 = low(tmp_3); writeregisterbyte(i.r1, tmp_9); cyclesconsumed++; } 13

Conditional Execution of Code Emulation of arithmetic and logical instructions 1. Bookkeeping (instruction pointer, event queue, ) 2. Fetch input operands (e.g., registers) 3. Calculate intermediate expressions 4. Store result and update flags public void visit(legacyinstr.adc i){ nextpc = pc + 2; int tmp_0 = getregisterunsigned(i.r1); int tmp_1 = getregisterunsigned(i.r2); int tmp_2 = bit(c); int tmp_3 = tmp_0 + tmp_1 + tmp_2; int tmp_4 = tmp_0 & 0x0000000F; int tmp_5 = tmp_1 & 0x0000000F; boolean tmp_6 = (tmp_0 & 128)!= 0; boolean tmp_7 = (tmp_1 & 128)!= 0; boolean tmp_8 = (tmp_3 & 128)!= 0; H = (tmp_4 + tmp_5 + tmp_2 & 16)!= 0; C = (tmp_3 & 256)!= 0; N = (tmp_3 & 128)!= 0; Z = low(tmp_3) == 0; V = tmp_6 && tmp_7 &&!tmp_8!tmp_6 &&!tmp_7 && tmp_8; S = N!= V; byte tmp_9 = low(tmp_3); writeregisterbyte(i.r1, tmp_9); cyclesconsumed++; } 14

Conditional Execution of Code Emulation of arithmetic and logical instructions 1. Bookkeeping (instruction pointer, event queue, ) 2. Fetch input operands (e.g., registers) 3. Calculate intermediate expressions 4. Store result and update flags public void visit(legacyinstr.adc i){ nextpc = pc + 2; int tmp_0 = getregisterunsigned(i.r1); int tmp_1 = getregisterunsigned(i.r2); int tmp_2 = bit(c); int tmp_3 = tmp_0 + tmp_1 + tmp_2; int tmp_4 = tmp_0 & 0x0000000F; int tmp_5 = tmp_1 & 0x0000000F; boolean tmp_6 = (tmp_0 & 128)!= 0; boolean tmp_7 = (tmp_1 & 128)!= 0; boolean tmp_8 = (tmp_3 & 128)!= 0; H = (tmp_4 + tmp_5 + tmp_2 & 16)!= 0; C = (tmp_3 & 256)!= 0; N = (tmp_3 & 128)!= 0; Z = low(tmp_3) == 0; V = tmp_6 && tmp_7 &&!tmp_8!tmp_6 &&!tmp_7 && tmp_8; S = N!= V; byte tmp_9 = low(tmp_3); writeregisterbyte(i.r1, tmp_9); cyclesconsumed++; } 15

Conditional Execution of Code Emulation of arithmetic and logical instructions 1. Bookkeeping (instruction pointer, event queue, ) 2. Fetch input operands (e.g., registers) 3. Calculate intermediate expressions 4. Store result and update flags public void visit(legacyinstr.adc i){ nextpc = pc + 2; int tmp_0 = getregisterunsigned(i.r1); int tmp_1 = getregisterunsigned(i.r2); int tmp_2 = bit(c); int tmp_3 = tmp_0 + tmp_1 + tmp_2; int tmp_4 = tmp_0 & 0x0000000F; int tmp_5 = tmp_1 & 0x0000000F; boolean tmp_6 = (tmp_0 & 128)!= 0; boolean tmp_7 = (tmp_1 & 128)!= 0; boolean tmp_8 = (tmp_3 & 128)!= 0; H = (tmp_4 + tmp_5 + tmp_2 & 16)!= 0; C = (tmp_3 & 256)!= 0; N = (tmp_3 & 128)!= 0; Z = low(tmp_3) == 0; V = tmp_6 && tmp_7 &&!tmp_8!tmp_6 &&!tmp_7 && tmp_8; S = N!= V; byte tmp_9 = low(tmp_3); writeregisterbyte(i.r1, tmp_9); cyclesconsumed++; } 16

Conditional Execution of Code Emulation of arithmetic and logical instructions 1. Bookkeeping (instruction pointer, event queue, ) 2. Fetch input operands (e.g., registers) 3. Calculate intermediate expressions 4. Store result and update flags public void visit(legacyinstr.adc i){ nextpc = pc + 2; int tmp_0 = getregisterunsigned(i.r1); int tmp_1 = getregisterunsigned(i.r2); int tmp_2 = bit(c); int tmp_3 = tmp_0 + tmp_1 + tmp_2; int tmp_4 = tmp_0 & 0x0000000F; int tmp_5 = tmp_1 & 0x0000000F; boolean tmp_6 = (tmp_0 & 128)!= 0; boolean tmp_7 = (tmp_1 & 128)!= 0; boolean tmp_8 = (tmp_3 & 128)!= 0; H = (tmp_4 + tmp_5 + tmp_2 & 16)!= 0; C = (tmp_3 & 256)!= 0; N = (tmp_3 & 128)!= 0; Z = low(tmp_3) == 0; V = tmp_6 && tmp_7 &&!tmp_8!tmp_6 &&!tmp_7 && tmp_8; S = N!= V; byte tmp_9 = low(tmp_3); writeregisterbyte(i.r1, tmp_9); cyclesconsumed++; } 17

Conditional Execution of Code Emulation of arithmetic and logical instructions 1. Bookkeeping (instruction pointer, event queue, ) 2. Fetch input operands (e.g., registers) 3. Calculate intermediate expressions 4. Store result and update flags Often: significant overhead for computation of flags public void visit(legacyinstr.adc i){ nextpc = pc + 2; int tmp_0 = getregisterunsigned(i.r1); int tmp_1 = getregisterunsigned(i.r2); int tmp_2 = bit(c); int tmp_3 = tmp_0 + tmp_1 + tmp_2; int tmp_4 = tmp_0 & 0x0000000F; int tmp_5 = tmp_1 & 0x0000000F; boolean tmp_6 = (tmp_0 & 128)!= 0; boolean tmp_7 = (tmp_1 & 128)!= 0; boolean tmp_8 = (tmp_3 & 128)!= 0; H = (tmp_4 + tmp_5 + tmp_2 & 16)!= 0; C = (tmp_3 & 256)!= 0; N = (tmp_3 & 128)!= 0; Z = low(tmp_3) == 0; V = tmp_6 && tmp_7 &&!tmp_8!tmp_6 &&!tmp_7 && tmp_8; S = N!= V; byte tmp_9 = low(tmp_3); writeregisterbyte(i.r1, tmp_9); cyclesconsumed++; } 18

Conditional Execution of Code Emulation of arithmetic and logical instructions 1. Bookkeeping (instruction pointer, event queue, ) 2. Fetch input operands (e.g., registers) 3. Calculate intermediate expressions 4. Store result and update flags Often: significant overhead for computation of flags public void visit(legacyinstr.adc i){ nextpc = pc + 2; int tmp_0 = getregisterunsigned(i.r1); int tmp_1 = getregisterunsigned(i.r2); int tmp_2 = bit(c); int tmp_3 = tmp_0 + tmp_1 + tmp_2; int tmp_4 = tmp_0 & 0x0000000F; int tmp_5 = tmp_1 & 0x0000000F; boolean tmp_6 = (tmp_0 & 128)!= 0; boolean tmp_7 = (tmp_1 & 128)!= 0; boolean tmp_8 = (tmp_3 & 128)!= 0; H = (tmp_4 + tmp_5 + tmp_2 & 16)!= 0; C = (tmp_3 & 256)!= 0; N = (tmp_3 & 128)!= 0; Z = low(tmp_3) == 0; V = tmp_6 && tmp_7 &&!tmp_8!tmp_6 &&!tmp_7 && tmp_8; S = N!= V; byte tmp_9 = low(tmp_3); writeregisterbyte(i.r1, tmp_9); cyclesconsumed++; } 19

Conditional Execution of Code (2) Flags may be overwritten by subsequent instruction à Only calculate flags (and necessary intermediate values) that are read or unmodified by subsequent instruction But: event may be scheduled at any time à Move calculation to conditional block and only execute if emulation continues uninterrupted At generation time Build graph with flags and intermediates Identify flags for which computation may be unnecessary Identify intermediates that are not used by necessary computations 20

Optimized Paths for Instruction Sequences Extend optimization to sequences May not have branches (amount of cycles must be constant) No events may be scheduled for duration of sequence Checking this has significant overhead Rare occurrence at runtime 21

Manually Constructed Alternative Code Paths Idea: replace common and often used instruction sequences with optimized code variants OS functions, compiler artifacts and standard library functions Framework to specify patterns (e.g., to handle varying register allocations) Advantage: savings can be huge Example: integer division of the avr-gcc standard library (193-209 cycles) 99% reduction from ~2800 statements to ~20 Disadvantage: significant manual overhead Calculate result, flags, and number of cycles 22

Evaluation Setup Hosts: Desktop PCs with quad-core (+SMT) CPUs @ 2.8 GHz, 8 GB RAM 64-Bit JVM (2 GB RAM) on 64 Bit Linux Applications (based on TinyOS examples) MViz: canonical WSN app: periodic sensing + collection routing MVizComplex: enable low power listening radio stack and innetwork processing with computation load Grid topologies with 16, and 100 nodes 23

Evaluation Methodology Challenge: evaluate performance of Java software Significant impact of just in-time compilation Performance varies over time especially compared to startup Goal: evaluate so-called steady-state performance Better indication for long-running simulations Approach: within one JVM session Run simulation twice as warm-up Run simulation 5 times for 300 simulated seconds Plots show average, minimum, and maximum time Generation overhead ~20 seconds for a complex application Caching alleviates overhead for multiple simulations (e.g., varying network sizes) 24

Network Emulation Performance (1) 250 200 better Wall time in seconds 150 100 50 0 Avrora noopt optutility optcond noshortcuts allopt Mviz, 16 nodes 25

Network Emulation Performance (2) 1000 900 800 better Wall time in seconds 700 600 500 400 300 200 100 0 Avrora noopt optutility optcond noshortcuts allopt Mviz, 100 nodes 26

Network Emulation Performance (3) 2500 2000 better Wall time in seconds 1500 1000 500 0 Avrora noopt optutility optcond noshortcuts allopt MvizComplex, 100 nodes 27

Conclusions Binary-to-source translation for the emulation of WSN Analysis and optimization framework Instruction and instruction sequence optimizations Most advantageous when simulating computation Small networks Complex applications with in-network processing smaller benefit when simulating waiting Large networks Translation overhead must be offset by number of simulations or simulation time 28

Thank you for your attention! 29