KeyStone Training. Keystone Device Tooling

KeyStone Training Keystone Device Tooling Agenda Code Composer Studio v4 Keystone Architecture Simulator Multicore Application Deployment OpenMP Initiative

Code Composer Studio v4 Code Composer Studio v4 Keystone Architecture Simulator Multicore Application Deployment OpenMP Initiative Code Composer Studio v4 Summary Whatis it? Major upgrade to CCS Major architectural changes Based on Eclipse open source software framework New registration/licensing/updating mechanism and model Why Eclipse? Quickly becoming a standard for IDEs Excellent software architecture Ability to leverage the work of others Cross platform support (i.e. Windows & Linux) Wide selection of 3 rd party plug ins available When? Now: RTM can be downloaded from: http://tiexpressdsp.com How? Restructuring of the debug stack Porting of existing features to Eclipse Taking the time to make sure migration will be as smooth as possible

CCSv4 Environment Customize toolbars & menus Perspectives contain separate window arrangements depending on what you are doing. Tabbed editor windows Tab data displays together to save space Fast view windows don t display Until you click on them CCSv4 Multicore Environment Global run / halt / step operations Supports multiple projects, each of which can be launched on a different core Use the Debug view to select the context Memory and Cache views can be pinned to a specific CPU core Displays show content for the current debug context Integrated scripting console Memory Analysis tooltips show memory hierarch details If desired you can open a top level IDE for any core

Multicore Tooling Elements Instrumentation Correlated multicore event views Multicore trace streams correlated with s/w instrumentation and bus events MIPI.org compliant System Trace Sync Point Events enable correlation with global timestamp without adding overhead to each event Bus analysis EMIF performance monitoring Bus performance monitoring (throughput, bus contention, event timing) Just in time instrumentation Low overhead installable benchmarking events Using silicon based Advanced Event Triggering to hook in calls to event logging software Monitor based real time instrumentation control Target side filtering and event triggering used to control what data is logged Extensible Eclipse DSDP Target Communications Framework infrastructure IPC event monitoring Annotated multi core transaction view Context aware hardware and simulation trace Injecting information into the trace stream to provide thread context, overlay context and correlation of trace events with global timebase UIA Logs Context aware, Trace based and SIM based multicore profiling = enabled by C66x architecture = enabled by C64X+ architecture Multicore Transaction Viewer Multicore Event Correlation Multicore Debugging Tools to debug memory corruption problems Memory access outside of spinlock Memory corruption / configuration problems caused by DMA or peripherals Cross triggering Ability to turn on / off hardware instrumentation from an event on a CorePac Ability to enable a trigger on a CorePac in response to a bus event Real time Multi core Debugging Monitor based task level debugging with target side global trigger generation and response Remote real time multicore debugging Debug Control Interface + DSP Monitor Multicore scripting support Scriptable loading, testing, debugging Application level debugging Supports both AMP and SMP applications JTAG based Linux Task level Debugging DVT: Data Visualization Technology CCSv4: Eclipse-based IDE Tooling Support CCStudio v4 (Eclipse-based IDE) DVT Data Visualization Technology (Used to build tools like the SoCAnalyzer and Trace Analyzer) Data Visualization Scripting Console X DSS Debug Server Scripting Eclipse 3.2 RCP Eclipse 3.2 Debug Server Scripting Debug Server HW Trace Sub-system (Triggers and Decodes) Emulation Driver Simulator Emulator XDS560 v2/ XDS560-Trace Trace Receiver STM Receiver JTAG Emulation Multicore Target Device JTAG C64X+ HW Trace STM

Developer s Desktop JTAG and STM Transports Instrumentation Client Host (DTS) DSS Scripts CCS4 DVT Scriptable Java Classes Metadata XML Endpoint Description System Memory Map Trace Trace ETB Trace ETB Trace ETB Trace ETB Trace ETB ETB XDS 560 Trace Trace Data Trace TCF for Back-channel communications JTAG Target Device CPU Core C66x AET Trace Lib DCI Monitor TCF Agent Transport Adaptor Application LogWrite( OSAL & HAL ILogger STM Library (OST compliant) STM ETB Event Logs Decoder STM RX Large Memory Buffer Local Rx Timestamp Event Data System Trace Module CP_Tracer Modules Master ID Version Channel OST CPU Sequence ID Header Timestamp Count Entity ID Protocol ID Length (8b) Extended Length (64b) Event Code 4-8 Event Parameters Legend: Control & Status Path Data Path Interface 1 or more device pins STM (System Trace Module) OST (Open System Trace) UIA (Unified Instr. Arch.) Multicore System Optimization Bus analysis provides visibility into system bus bottlenecks: Bus performance monitoring using CP_Tracer modules throughput, bus contention, event timing EMIF performance monitoring Multicore event monitoring and correlated CPU trace provides visibility into the realtime performance of the application: Monitoring can see when a real time deadline is missed on any CorePac, the bus activity and application events that occurred prior to and following the missed deadline, etc. Multi core CorePac trace streams correlated with software instrumentation and bus events: Capture traces for all CorePacs leading up to the missed deadline. Function execution graph provides visibility into the amount of time spent in each function / thread leading up to the missed deadline. Context aware Trace based and simulation based multicore profiling: Profiling view shows how much time was being spent in processing each thread, and in processing each function within each thread.

Context Aware Profiling Context Aware Profiling: Thread aware Overlay aware Application aware Basic Purpose: Store a software event log that contains info about the target context when the context changes e.g. can instrument a task switch hook function, such as the OSEck swap hook Inject a reference to this info into the trace stream / simulation event log when the event log is generated Events that occur after that point in the trace stream / simulation event log are known to have occurred Mechanism: C66x OVERLAY register allows 30b of information to be injected into the trace stream. Sync Point Events are logged that contain the context info as well as the local CPU timestamp and global timestamp. The sequence number that identifies this software event is written into the overlay register. Application level Profiling Whenever the application creates a new thread / task (on any core), it logs a sync point event that stores the application ID and the thread ID. DVT can collect these events and identify all of the thread IDs that are associated with a particular application. It can then filter the trace data so that only entries that have executed within the context of the specified application are included. Multicore Application level Profiling As above, but for multiple cores. Shows e.g. a function profile for all threads of an application across multiple cores. DVT Overview DVT provides a component framework for rapid creation of advance analysis and visualization solutions. Retrieve data from transport or file Data Sources Components Text File Reader TCP/IP Trace & STM Data Processors Decoders Correlation Analysis Time Correlator Count Analyzer Profile Analyzer Time Base Analysis State Machine Store processed data Storage Unlimited Buffer Circular Buffer File Buffer Visualize processed data Viewers Line Graph State Graph Discrete Graph Table

Solution Creation & Run Time Component Properties Solution Editor Solution Graphical Solution Builder for wiring up components to create data analysis and visualization solutions Feature rich solution runtime platform SDK for easy component creation Data correlation from multiple sources Eclipse Plug in Scriptable Standalone and integrate with CCStudio Available Components Control Panel: control and configure solutions Find, Filter, Zoom, Measurement Markers, View Correlation, Alignment, Export Visualization Features -- zooming, filtering, measurement marker, synchronous scrolling, and a color representing each core.

Profiling Use Case / Profiling Spec CONTROL CLIENT Visualization tools (DVT) Standalone clients (gprof) Compiler/Linker 3P Tools Standard Formats APPLICATION Trace Breakpoint based Sim Compiler Instrumentation TRANSPORT Trace Pod/Cable JTAG Control JTAG-Printf Sim Raw Formats POST-PROCESS Target Host Function level Profiling: gprof equivalent cycles per function, inclusive and (more importantly) exclusive dynamic call graph Task level Profiling cycles spent in each task context switching overhead Path Profiling taken/not taken (code coverage) frequency counts misses vs hits Event Profiling cache (CacheTune) stalls internal/external user defined Pipeline Behavior CPU stalls by address Trace Analyzer Integrated w/ CCSv4 Trace Analyzer 1.1 Supports ETB trace, which enables tracing of multiple cores simultaneously Each core has its own ETB (4K) Problems with ETB approach: Memory access time to ETB is slow Adds load to system bus throughput

Provides UI for configuring target specific breakpoint and trace features E.g. AET (Advanced Event Triggering) Supports conditional breakpoints Stop mode evaluation Supports executing scripts in response to a breakpoint hit Breakpoint Manager Annotated Multicore Transaction View Challenges Difficult to view interactions (e.g., Message based communications) between cores Difficult to understand what DMA is doing or to correlate it with other events Difficult to correlate hardware C66x trace from each core with trace from other cores, software, or system events Solutions Annotated transition points and frame markers, such as tooltips (or, if zoomed in, text labels) show the associated event text description right next to the transition Top to bottom UML style timeline makes it easier to read text labels STM events correlated with C66x Trace Events logged to HW Trace act as bookmarks. If logged to ETB, can correlate trace collected from multiple cores with each other and with STM events (hardware & software) Clicking on an event in the timeline causes the HW Trace display to jump to that event Frame based Onion Skin view allows you view many Frames at once and see how transaction timing varies from frame to frame Easier to spot timing anomalies & potential race conditions Multicore Transaction View Frame 10 Frame 11 ARM Cortex A9 #1 Fork() Join() ARM CortexA9 #2 DataXfer() Ack() AsyncCompress() Complete() ARM Cortex M3 #1 Spinlock() Synchronize() Ack() 5 1 5 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 TI C66x

Internal Bus Monitoring Counters logged Initial Access Latency: Total cycles between new transfer request and first data received Average throughput per master id with min and max markers Logical Access Latency: Total cycles between new transfer request and last data received. 4 counters Throughput plot - accumulates byte count presented at the initiation of a new transfer Analysis per master id Sliding Time Window: Specifies the measurement interval for all the statistic counters Filter Modes: Except for idle counter, can be filtered on: Master ID Group of Master IDs KeyStone Device Simulator Code Composer Studio v4 KeyStone Device Simulator Multicore Application Deployment OpenMP Initiative

Tunneling SRIO Messages Over ETH Ethernet Packet Processing Payload Application Running on DSP CorePac, NETCP, PKTDMA Windows Network Drivers and Protocol Stack ETH Header Layer 3-7 Header Payload Winpcap Drivers KeyStone Device Functional Simulator All Ethernet packets to the DSP are forwarded to Simulator Network

Multicore Application Deployment Code Composer Studio v4 KeyStone Device Simulator Multicore Application Deployment OpenMP Initiative Multi Application Programs A program may consist of multiple applications. All applications will be linked into the executable and loaded into the device memory at boot time. A main routine will be able to branch into each application. A device may run multiple applications at a time. But, a core can only run one application at a time A core may dynamically switch to another application. The switch is controlled externally and only happens when the application is idle (no active connections).

Application Overlay Segments of different applications may be overlaid in the virtual address space. There is a need to reconfigure memory (MPAX/MPPA/MAR) registers when switching applications. Segments of different applications may be overlaid in the physical address space. There is a need to load segments and to reconfigure memory (MPAX/MPPA/MAR) registers when switching applications. Parking unloaded segments in (external MSMC) memory will speed up the transition between applications. The overlay manager takes care of loading overlay segments and reconfiguring memory registers at run time. Tooling Overview Applications linked separately Outfile consists of code and data segments Segments are bound to virtual addresses Map tool Input is a set of applications and a physical memory map Tool partitions physical memory and assigns each segment to a physical address Run tool Input is an application binary and the map tool output For each segment: copy it to the assigned physical address program the address translation HW to map its virtual address to its physical address Can run on target, or host

Tooling Illustrated 1. Static Link (Creates ELF files).obj.obj.obj.obj.obj.obj.obj green.exe blue.exe lib.so Virtual Address Space 2a. Prelink (binds virtual addresses) 2b. Map Tool (allocates physical addresses) Physical Address Space green.exe blue.exe Physical Map lib.so Shared Code Partition CorePac 0 Data Partition 3. Create Load Image Physical Memory Load Image DIR MAP 4. Activate 5. Activate Different Application CorePac 0 Virtual Space CorePac 1 Virtual Space CorePac 1 Data Partition Link Tool Processing The link tool is an evolution of existing link tools. Ideally, it needs no modifications for supporting multi core application deployment. The link tool generates an application image for each application. Input The link command file describes the segments and specifies their attributes. The programmer does not specify virtual/physical addresses. The programmer has to enforce the size and alignment constraints MPPA architecture imposes constraints on segments residing in PMC/DMC/UMC memory. MPAX architecture imposes constraints on segments residing in MSMC internal/external memory. The relocatable ELF files provide the content for the segment images. Output The link tool stores the application image in an executable ELF file.

Map Tool Processing The map tool generates a map image The map image specifies for each application where the loaded image will reside (load address) where each running segment will reside (run address) Input The deployment template file defines the memory layout of the device. for each application points to an executable ELF file controls the memory allocation of the loaded image controls the memory allocation of each running segment The executable ELF files Output The deployment load file stores the map image, followed by the application images Load and Run Tools The load tool stores the map image and the application images at the load address The run tool is able to start an application on a given core. may need to copy a segment from the load address to the run address needs to configure MPPA/MPAX registers is able to stop an application. waits until the application decides that it is convenient to stop

Flow Debug Preemption: replacing one definition or a symbol with a definition in a separately linked module Would like to do this at runtime Would like to do this without re linking the original image Issues: Compiler must be aware of possibility for preemption avoid inlining avoid inferences based on analyzing behavior of called function generate preemptable addressing If call is in shared code, it may be preempted for some applications and not for others Obvious runtime and debug issues Usual solution Functions marked as exported are candidates for preemption Keep the address in (private) data (GOT) and reference indirectly Preemption happens at dynamic link (load) time: replace GOT entry Requires dynamic symbol tables and relocation information

OpenMP Initiative Code Composer Studio v4 KeyStone Device Simulator Multicore Application Deployment OpenMP Initiative OpenMP for Parallel Programming Rationale: Defacto industry standard for shared memory parallel programming Supported on most major compiler/isa platforms: gcc, intel, arm, pgi, sun, ibm/cell, etc.. Language is evolving to support tasking models, heterogeneous systems, and streaming programming models Easy migration for existing code base: C/C++ based directives (#pragma) used to express parallelism

What is OpenMP? Open specifications for Multi Processing (OpenMP) API for specifying shared memory parallelism in C, C++, and Fortran Consists of compiler directives, library routines, and environment variables Portable across shared memory architectures Jointly defined and endorsed by group of interested parties from hardware and software industry, government, and academia Website: http://www.openmp.org OpenMP Parallel Computing Solution Stack User Layer Application End User Prog. Layer (OpenMP API) Directives, Compiler OpenMP library Environment Variables System Layer Runtime Library OS/system support for shared memory.

OpenMP Features Provides the means to: create and destroy threads assign / distribute work (a task) to threads specify which data is shared and which is private to a thread coordinate actions of threads on shared data Syntax: Most of the constructs in OpenMP are compiler directives or pragmas. For C and C++, the pragmas take the form: #pragma omp construct [clause [clause] ] Include file and the OpenMP lib module #include <omp.h> OpenMP Execution Model Program begins as single thread of execution When thread encounters a parallel region, it forks a team consisting of itself (the master) and or more other (slave) threads Parallel tasks defined by OpenMP directives are assigned to the OpenMP threads Task is a specific instance of executable code and its data OpenMP thread is an execution entity managed by OpenMP runtime, with its own stack and static memory Implicit barrier at end of region, after which only the master thread resumes execution Master Thread Parallel Regions A Nested Parallel region

OpenMP Memory Model Threads have access to a shared memory For shared data Each thread can have a temporary view of the shared memory (e.g. registers, cache, etc.) between synchronization barriers. Threads have private memory For private data Each thread has a stack for data local to each task it executes Each thread has access to a static memory area for threadprivate data Thread Creation parallel Worksharing Constructs Directives for, sections, single, master, task Data scoping Clauses shared, private, firstprivate, lastprivate, reduction, threadprivate Synchronization Constructs critical, barrier, atomic, flush, taskwait

Run Time Library and Environment Function based locking omp_init_lock omp_destroy_lock omp_set_lock omp_unset_lock omp_test_lock Thread execution and control omp_get_num_threads omp_get_thread_num omp_in_parallel omp_get_max_threads omp_get_num_procs omp_get_dynamic omp_get_nested omp_get_wtime omp_set_num_threads omp_set_dynamic omp_set_nested Environment variables omp_num_threads omp_schedule Work Sharing Constructs Sequential code OpenMP parallel region OpenMP parallel region and a worksharing forconstruct for(i=0;i<n;i++) { a[i] = a[i] + b[i];} #pragma omp parallel { int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i];} } #pragma omp parallel #pragma omp for schedule(static) for(i=0;i<n;i++) { a[i] = a[i] + b[i];}

Summary Parallel programming model Data parallelism (omp parallel for) Task parallelism (omp task) Productivity and flexibility (run time load balance) Runtime requirements Thread create/destroy on multiple cores Barriers locks (semaphores, atomics, mutex, ) Shared/private memory management Coherency For More Information Code Composer Studio 4 (CCSv4) http://processors.wiki.ti.com/index.php/category:code_composer_studio_v4 Code Composer Studio 4 (CCSv4) http://processors.wiki.ti.com/index.php/category:code_composer_studio_v5 Using OpenMP to Maximize Performance http://learningmedia.ti.com/public/asp wtbu/tech_day/using OpenMP to Maximize Performance from Multicore DSP.wmv For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website.