Design and Implementation of an Asymmetric Multiprocessor Environment Within an SMP System. Roberto Mijat, ARM Santa Clara, October 2, 2007

Similar documents
Application Note 228

Application Note 176. Interrupts on MPCore Development Boards. Released on: 28 March, Copyright 2006, All rights reserved.

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Linux drivers - Exercise

Operating systems for embedded systems

CSE 333 SECTION 3. POSIX I/O Functions

Cortex-A9 MPCore Software Development

Kernel Modules. Kartik Gopalan

Linux Device Drivers. 3. Char Drivers. 1. Introduction 2. Kernel Modules 3. Char Drivers 4. Advanced Char Drivers 5. Interrupts

CSE 333 SECTION 3. POSIX I/O Functions

Operating systems for embedded systems. Embedded Operating Systems

An Implementation Of Multiprocessor Linux

CS5460/6460: Operating Systems. Lecture 24: Device drivers. Anton Burtsev April, 2014

Cortex-A15 MPCore Software Development

Processes & Threads. Today. Next Time. ! Process concept! Process model! Implementing processes! Multiprocessing once again. ! More of the same J

Operating systems. Lecture 7

ARMv8-A Software Development

MP3: VIRTUAL MEMORY PAGE FAULT MEASUREMENT

CS 3305 Intro to Threads. Lecture 6

EECS 3221 Operating System Fundamentals

EECS 3221 Operating System Fundamentals

Important Dates. October 27 th Homework 2 Due. October 29 th Midterm

Developing Real-Time Applications

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits

NPTEL Course Jan K. Gopinath Indian Institute of Science

SMP/SMT. Daniel Potts. University of New South Wales, Sydney, Australia and National ICT Australia. Home Index p.

CISC2200 Threads Spring 2015

NPTEL Course Jan K. Gopinath Indian Institute of Science

CPU Scheduling (Part II)

Device Drivers. CS449 Fall 2017

PCS - Part Two: Multiprocessor Architectures

Memory Management. Fundamentally two related, but distinct, issues. Management of logical address space resource

Multicore and Multiprocessor Systems: Part I

ECE 650 Systems Programming & Engineering. Spring 2018

CS333 Intro to Operating Systems. Jonathan Walpole

Lecture 3. Introduction to Unix Systems Programming: Unix File I/O System Calls

CS 423 Operating System Design: Introduction to Linux Kernel Programming (MP1 Q&A)

Files. Eric McCreath

POSIX Shared Memory. Linux/UNIX IPC Programming. Outline. Michael Kerrisk, man7.org c 2017 November 2017

Emulation 2. G. Lettieri. 15 Oct. 2014

Operating Systems. Lecture 06. System Calls (Exec, Open, Read, Write) Inter-process Communication in Unix/Linux (PIPE), Use of PIPE on command line

ADVANCED OPERATING SYSTEMS

The Beast We Call A3. CS 161: Lecture 10 3/7/17

Applying User-level Drivers on

POSIX Semaphores. Operations on semaphores (taken from the Linux man page)

Using kgdb and the kgdb Internals

Operating Systems. Spring Dongkun Shin, SKKU

MaRTE OS Misc utilities

Mid Term from Feb-2005 to Nov 2012 CS604- Operating System

Lec 26: Parallel Processing. Announcements

Virtual File System (VFS) Implementation in Linux. Tushar B. Kute,

Chapter 1: Introduction. Operating System Concepts 8th Edition,

CS2506 Quick Revision

This is an open book, open notes exam. But no online or in-class chatting.

Shared Memory Memory mapped files

Step Motor. Step Motor Device Driver. Step Motor. Step Motor (2) Step Motor. Step Motor. source. open loop,

Dirty COW Attack Lab

USB. Development of a USB device driver working on Linux and Control Interface. Takeshi Fukutani, Shoji Kodani and Tomokazu Takahashi

MPSoC System Architecture ARM Multiprocessing

(MCQZ-CS604 Operating Systems)

«Real Time Embedded systems» Multi Masters Systems

ARM Linux & GNU. Philippe Robin. Beijing June 2004

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Processes. Today. Next Time. ! Process concept! Process model! Implementing processes! Multiprocessing once again. ! Scheduling processes

Consolidating Linux Power Management on ARM multiprocessor systems

Chapter 5: CPU Scheduling

UNIX System Programming

CS 378 (Spring 2003)

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5 CPU scheduling

This Document describes the API provided by the DVB-Multicast-Client library

Cortex-A5 MPCore Software Development

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

SMP bring up on ARM SoCs

Signal Example 1. Signal Example 2

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Multithreaded Programming

Final Exam. Fall Semester 2016 KAIST EE209 Programming Structures for Electrical Engineering. Name: Student ID:

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

UNIX System Calls. Sys Calls versus Library Func

The UtePC/Yalnix Memory System

RM0327 Reference manual

CS377P Programming for Performance Multicore Performance Cache Coherence

CARRIER-SW-82. Linux Device Driver. IPAC Carrier Version 2.2.x. User Manual. Issue November 2017

Processes, Threads, SMP, and Microkernels

Operating Systems. VI. Threads. Eurecom. Processes and Threads Multithreading Models

PROCESS CONTROL BLOCK TWO-STATE MODEL (CONT D)

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

ARM Trusted Firmware: Changes for Axxia

The Embedded I/O Company TIP700-SW-82 Linux Device Driver User Manual TEWS TECHNOLOGIES GmbH TEWS TECHNOLOGIES LLC

Multiprocessors and Locking

High Performance Multiprocessor System

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

COSC 6385 Computer Architecture. - Thread Level Parallelism (III)

Process Management! Goals of this Lecture!

CS370 Operating Systems

PESIT Bangalore South Campus Hosur road, 1km before Electronic City, Bengaluru -100 Department of Information Sciences and Engineering

Transcription:

Design and Implementation of an Asymmetric Multiprocessor Environment Within an SMP System Roberto Mijat, ARM Santa Clara, October 2, 2007 1

Agenda Design Principle ARM11 MPCore TM overview System s considerations / optimizations Implementation Work Plane Data Plane Communication Module Usage cases 2 2

Design principle Primary CPU Single processor OS + application + driver for off-loaded API Secondary CPUs Wait and/or process background work-tasks Primary CPU driver abstracts shared RAM and provides message based API to data plane processor farm Single processor OS/RTOS Single Multi-core Processor - ARM11 MPCore control plane data plane CPU 0 CPU 1 CPU 2 CPU 3 Message based communications SMP µkernel CPU 0 Only Shared Memory Physical Memory SMP Only 3 3

ARM11 MPCore Processor Interrupts PPA * Speed Opt 90nm process Area Opt Standard Cells Advantage-HS Metro Memories Advantage Metro Frequency ( MHz ) 620 320 Area with cache (mm2) 2.54 1.46 64-bit AMBA 3 AXI x2 Area without cache ( mm2 ) Cache size 1.80 16K/16K 0.90 16K/16K Power with cache (mw/mhz) 0.43 0.23 Power w/o cache (mw/mhz) 0.37 0.18 * Either quoted from fully floor-planned layout/synthesis trials or scaled with respect to process and library performance 4 4

ARM11 MPCore Usage Models SMP Symmetric Multi-Processing AMP Asymmetric Multiprocessing Shared resources Operating System CPU0 Private resources AMP SW 0 Shared resources CPU0 CPU1 AMP SW 1 Private resources Private resources CPU1 Private resources ARM11 MPCore supports both models 5 5

ARM11 MPCore Processor Supports both AMP and SMP Each CPU: Boots independently Share physical memory map Elects whether to participate in coherency domain This provides software portability and flexibility Provides advanced optimised coherence protocol Advanced Snoop Control Unit Optimized MESI implementation Provides integrated GIC Highly configurable Interrupt Distributor/Interfaces Inter-Processor Communication 6 6

The MESI coherency protocol MESI protocol is a write-invalidate cache protocol When writing to a shared location, the related cache line is invalidated in all caches in L1 memory system Coherent cache line attributes : Every cache line is marked with one of the four following states (coded in two bits) Modified Coherent cache line is not up to date with main memory Exclusive Up to date and no other copies exist Shared Coherent cache line which is up to date with main memory Invalid This coherent cache line is not present in the cache 7 7

MPCore MESI optimizations The key is to avoid main memory accesses Duplicated Tag RAMs Stored in Snoop Control Unit Check if requested data is in other MP11 caches without accessing them Direct Data Intervention (cache-2-cache transfer) Copy clean data from one MP11 cache to another Migratory Lines Move dirty data from one MP11 to another and skip MESI shared state. Avoids writing to L2 and reading the data from external memory 8 8

Generic Interrupt Controller (GIC) Distributor Detects and prioritizes interrupts Used for: configuring interrupt inputs routing interrupts CPU interfaces One per CPU Used for: acknowledging interrupts masking interrupts clearing interrupts 9 9

Inter-Processor communication Performed by means of Inter Processor Interrupts (IPIs) Using reserved Interrupts 0 to 15 Software triggered by writing to a register 31 26 2524 23 16 15 11 9 0 SBZ CPU Target List SBZ Interrupt ID Target list filter Different broadcast modes Sent Interrupt to a specific CPU Send to all but self Send to self Treated like normal (external) interrupts by Distributor 10 10

The Development Environment Hardware RealView Versatile Emulation Baseboard CT11MPCore Core Tile RealView ICE 3.1 Software RealView Development Suite 3.1 Linux 2.6.17 GCC 4.2.0 11 11

Implementation design Linux control plane Single Multi-core Processor - ARM11 MPCore SMP µkernel data plane CPU 0 CPU 1 CPU 2 CPU 3 /dev device driver Linux CPU0 only Shared buffer SMP ukernel only 0 Mb 126 Mb 128 Mb 256 Mb Physical Memory NOT IN SCALE 12 12

The Control Plane (Linux) Linux kernel v2.6.17 from kernel.org Applied MPCore support patches (now included with 2.6.21) Kernel built for 1 CPU but with SMP support We need SCU and shared memory for communication buffer SMP Linux boots on CPU0 Unaware of any other CPUs Only aware of 0-126Mb Physical RAM (out of 256Mb) Using mem=126m in bootargs 13 13

The Data Plane (SMP ukernel) Light-weight bare-metal scheduler and threading library NOT AN OS! Initialise CPUs partaking in SMP subsystem Implements subset of POSIX Threads Providing well known programming abstraction Simple round-robin pre-emptive scheduling algorithm No optimization (created for demonstrative purposes only) 14 14

SMP System initialization (boot) Primary CPU Secondary CPUs CPU1 CPU2 CPU3 Common Reset handler Core Init Core Init Core Init Operating- System Initializes here C/C++ Library Thread Library Timer/Interrupts Holding pen Enable IPIs for scheduling SMP scheduler ready Create threads main() 15 15

Memory Layout RealView Versatile EB 256 Mb RAM Linux loaded at 0x00008000 mem = 126M ukernel loaded at 0x08008000 typedef volatile struct { int lock; int num_packets; char *base; unsigned int head; unsigned int tail; unsigned int size; unsigned int available; /* more */ } control_data_buffer_t; 0M 126M 128M 256M 4096M LINUX ukernel Communication Buffers NOT IN SCALE CTL INPUT CHANNEL CTL OUTPUT CHANNEL 126M 128M 16 16

The communication driver Linux Kernel Module /dev/ukernel abstracts the communication buffer Implements open(), close(), write(), read() and ioctl()s Loading module Boots ukernel on data plane (via secondary boot register) Maps in shared communication buffer (126-128Mb) We benefit from DDI and cache line migration Unloading module Exits ukernel on data plane Sending to WFI power saving Resets and un-maps communication buffer 17 17

Loading the module (1) # Clean up if [ -e $COMMS_FILE ]; then rm $COMMS_FILE if [ "$?" -ne "0" ]; then echo -e "ERROR: Failed to remove $COMMS_FILE!\n" exit 1 fi fi COMMS_FILE=/dev/ukernel PRIMARY_NUM=42 SECONDARY_NUM=0 MODULE_NAME=ukernel.ko # Create /dev character driver file for the communication module mknod -m a+rw $COMMS_FILE c $PRIMARY_NUM $SECONDARY_NUM if [ "$?" -ne "0" ]; then echo -e "ERROR: Failed to change permissions for $COMMS_FILE!\n" exit 1 fi # Install the module insmod $MODULE_NAME if [ "$?" -ne "0" ]; then echo -e "ERROR: Failed to install the $MODULE_NAME module!\n" exit 1 fi exit 0 18 18

Loading the module (2) static int init ukernel_mod_init(void) { writel((ukernel_base_pa + entrypoint_offset), (IO_ADDRESS(VERSATILE_HDR_BASE) + VERSATILE_HDR_FLAGSS_OFFSET)); comms_base_va = ioremap_pfn(comms_base_pa >> PAGE_SHIFT, 0, shared_size, L_PTE_CACHEABLE L_PTE_BUFFERABLE L_PTE_SHARED); memset_io(comms_base_va, 0, shared_size); [...] /* Initialise control data structures etc... */ ipi_broadcast_all_but_self(wake_work_plane_ipi); /* Register the device driver - only after we initialised everything */ result = register_chrdev(major_number, DRIVER_NAME, &device_fops); [...] /* Handle failure to register */ } return 0; 19 19

Unloading the module (1) COMMS_FILE=/dev/ukernel MODULE_NAME=ukernel.ko # Uninstall the module rmmod $MODULE_NAME if [ "$?" -ne "0" ]; then echo -e "ERROR: Failed to uninstall the $MODULE_NAME module!\n" exit 1 fi # Clean up if [ -e $COMMS_FILE ]; then rm $COMMS_FILE if [ "$?" -ne "0" ]; then echo -e "ERROR: Failed to remove $COMMS_FILE!\n" exit 1 fi fi echo "Success.\n" exit 0 20 20

Unloading the module (2) static void exit ukernel_mod_exit(void) { /* Unmap the shared region */ iounmap(comms_base_va); /* Unregister the device driver */ result = unregister_chrdev(major_number, DRIVER_NAME); if (result < 0) { printk(kern_err "Failed to unregister driver (result = %d).\n", result); return; } printk(kern_info "Exiting module.\n"); } /* Reset the other CPUs - Send them to _wfi() */ ipi_broadcast_all_but_self(quit_work_plane_ipi); 21 21

write() file operation static ssize_t device_write(struct file *file, char *buf, size_t count, loff_t *offset) Called when a user program calls write() Writes to the communications buffer up to count bytes Returns number of bytes successfully written [...] ret = write(dev_fd, buffer, PACKET_SIZE); [...] OR cat data_file.dat > /dev/ukernel 22 22

read() file operation static ssize_t device_read(struct file *file, char *buf, size_t count, loff_t *offset) Called when a user program calls read() Retrieves from the communications buffer up to count bytes Returns number of bytes successfully read [...] /* Try to read the packet. If buffer empty, keep trying */ while ( (ret = read(dev_fd, buffer, PACKET_SIZE)) == 0) { printf("."); sleep(1); } [...] OR cat /dev/ukernel > data_file.dat 23 23

Control operations static int device_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg) Used to send control command signals to the work plane CONTROL COMMAND TYPE DATA SEQ NUM IPIs UKERNEL_SET_NUM_THREADS _IOW int 1 no UKERNEL_SET_CONSUMER_ID _IOW int 2 no UKERNEL_GET_CONS_DESCR _IORW consumer_thread_t * 3 no UKERNEL_LAUNCH_CONSUMERS _IO N/A 4 yes UKERNEL_COMMS_FINISHED _IO N/A 5 yes 24 24

Opening the communications device #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/unistd.h> #include <string.h> #include <stdio.h> typedef struct { int id; void *(*function)(void *); char description[32]; } consumer_thread_t; #include "ukernel_mod.h" #define DEVICE_NAME "/dev/ukernel" int main(int argc, char **argv) { int dev_fd; int ret; int num_threads, i; consumer_thread_t consumer_threads[max_consumer_threads]; dev_fd = open(device_name, O_RDWR); if (dev_fd == -1) { printf("\nnot able to open device file \"%s\".\n", DEVICE_NAME); return -1; } 25 25

List available consumers printf("\nthe list of available consumers is:\n"); ret = ioctl(dev_fd, UKERNEL_GET_CONS_DESCR, (consumer_thread_t *) &consumer_threads[0]); if (ret!= 0) perror("ioctl"); for (i = 0; i < MAX_CONSUMER_THREADS; i++) if (consumer_threads[i].function!= NULL) printf("\n%d -> %s", i, consumer_threads[i].description); 26 26

Launch consumers printf("\nlaunching %d threads of ID %d. \n", num_threads, consumer_id); /* Setting the consumer ID */ ret = ioctl(dev_fd, UKERNEL_SET_CONSUMER_ID, &consumer_id); if (ret!= 0) perror("ioctl"); /* Setting the number of threads */ ret = ioctl(dev_fd, UKERNEL_SET_NUM_THREADS, &num_threads); if (ret!= 0) perror("ioctl"); /* Launch them */ ret = ioctl(dev_fd, UKERNEL_LAUNCH_CONSUMERS); if (ret!= 0) perror("ioctl"); 27 27

Interacting with the data plane Once consumer(s) have been launched Data Plane will be polling for data Power saving mode (WFI while no data available) ioctl() signals no more data to be sent from control plane Quits after buffer has been thereafter emptied Implementation defined what to do with this data Depending on application requirements Passing data to data plane as simple as: $ data_file.dat > /dev/ukernel 28 28

Usage cases Data engines Inside the work plane (the ukernel) Register consumer threads void register_consumer_function(int id, void *(*function)(void *), char *description) Provide implementation of consumer threads POSIX Thread s thread-function format Data consumption and returning implementation defined: Blowfish Encrypt/Decrypt Work out CRC of data packet etc Input stream packetized or not Output stream order dependant 29 29

Usage cases Video decoder Additional Linux patch Secondary frame buffer Linux writes to Primary buffer original RGB out Work plane writes to secondary buffer Swapping between the two done with IOCTL command Installed XviD decoder consumer engine Displays color-converted frames to secondary frame buffer $ cat video_stream.mp4 > /dev/ukernel Data copied to internal temporary storage inside work plane by concurrent data-retriever thread CPU utilisation on CPU0 is none while video is processed! 30 30

Conclusions ARM11 MPCore processor designed for: Flexibility All CPUs configurable as AMP or partaking SMP subsystem SCU designed to optimised migration of data within L1 boundaries IPIs designed to allow for CPUs communications without memory Scalability Adding processing power when required Offload computation to separate cores/data-engine Power saving Limiting power utilization when resource not necessary Switch off work plane WFI when data not available 31 31