Towards an automatic co-generator for manycores. architecture and runtime: STHORM case-study

Procedia Computer Science Towards an automatic co-generator for manycores Volume 51, 2015, Pages 2809 2813 architecture and runtime: STHORM case-study ICCS 2015 International Conference On Computational Science Charly Bechara, Karim Ben Chehida and Farhat Thabet CEA, LIST, 91191 Gif-sur-Yvette CEDEX, FRANCE charly.bechara@cea.fr, karim.ben-chehida@cea.fr, farhat.thabet@cea.fr Keywords: Runtime Manycore IP-XACT - Automatic generator STHORM - SESAM Introduction The increasing design complexity of manycore architectures at the hardware (HW) and software (SW) levels imposes to have powerful tools capable of validating every functional and non-functional property of the architecture. At the design phase, the chip architect needs to explore several parameters from the design space, and iterate on different instances of the architecture, in order to meet the defined requirements. Each new architectural instance requires the configuration and the generation of a new hardware model/simulator, its runtime, and the applications that will run on the platform, which is a very long and error-prone task. In this context, the IP-XACT [3] standard has become widely used in the semiconductor industry to package IPs and provide low level SW stack to ease their integration. In this work, we present a primer work on a methodology to automatically configuring and assembling an IP-XACT golden model and generating the corresponding manycore architecture HW model, low-level software runtime and applications. We use the STHORM [1] manycore architecture as a case study. Automatic generator methodology The idea is to work on a unique IP-XACT model with different abstractions (mainly at the interface level) commonly used in the design space exploration (DSE) and implementation phases to guarantee the coherency of the TLM (Transaction Level Modeling) and the RTL (Register Transfer Level) architecture models. The DSE phase is based on fast TLM simulations, result analysis considering the target optimization criteria (performance, power, and reliability) and global parameters modification of the IP-XACT model to close the loop and guide its convergence throughout iterations. The IP-XACT design flow methodology, shown in Figure 1, is composed of four main steps: 1. IP-XACT platform model: assembling an IP-XACT model of the manycore architecture from the IP-XACT IP (Intellectual Property) library considering the different IP parameters. From the IP- XACT platform model, which is an xml format, two design configurations could be derived to target TLM level and RTL level interconnect abstractions. 2. Platform Generators: in order to build a platform simulator corresponding to the design parameters of the current DSE iteration, it is important to automate the generation of the corresponding TLM or RTL simulators, the software runtime and the application (using for example the IP-XACT standardized Tight Generator Interface (TGI)) and adapt them to take into account a set of parameters corresponding to the DSE iteration (such as the number of processors/clusters, degree of parallelism, custom IPs used, etc ). a. TLM/RTL simulator: Starting from TLM/RTL models, IP libraries and the configuration parameters, a custom generator can produce the corresponding TLM or RTL simulator. Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2015 c The Authors. Published by Elsevier B.V. doi:10.1016/j.procs.2015.05.439 2809

b. SW runtime: the low level hardware dependent software (HDS) layer (corresponding mainly to simple register accesses and the system memory map) can be generated by aggregating the IP level HDS information. The SW runtime used in this study [4] is a set of libraries (communication, execution engines, synchronization, resource management ) where the resource management library is built on top of the HDS layer. A custom generator can build a new runtime for this design iteration. c. Application: a custom generator can exploit the new configuration parameters to restructure the application accordingly. For instance, OpenMP pragmas can be inserted. Figure 1 The unified IP-XACT based design flow for fast design space exploration 3. Manycore architecture simulator: The fast simulation phase is based on a Timed TLM simulator designed in the laboratory called SESAM [2] that delivers reports and statistics on some functional and non-functional criteria such as performance, power and reliability. The SESAM simulator will take as input the generated TLM top netlist, the TLM IP library, the generated SW runtime, and the compiled application to launch a global simulation. SESAM supports also the integration of RTL 2810

models for co-simulation. After convergence of the DSE loop, the final step will be the generation of the RTL netlist for the overall manycore architecture from the IP-XACT model, and then follow the traditional hardware simulation and emulation flow with the corresponding EDA (Electronic Design Automation) tools. 4. Design analysis & optimization: the design analysis tool is in charge of the comparison of the resulting metrics with respect to the initial system requirements. Based on the comparison results, the design optimization engine modifies the initial IP-XACT model parameters and even its specifications, based on heuristics. STHORM case-study In this work, we use STHORM [1] manycore architecture and HBDC (Human Body Detection Counter) application as a case study. In order to model the STHORM architecture in SESAM (Figure 2), we extract the following information from the architectural description: the modules that do the actual computation or processing (such as the processor STxP70, the Hardware Synchronizer HWS [5], the Fabric Controller, and other elements), the memories and caches, the interconnection networks, and the latencies of the different modules (measured using special counters from the HW emulated design, or on the real chip). Each component is a SystemC model with TLM interfaces. From the IP-XACT model of the whole architecture, the toolchain generates the top level netlist for SESAM, the low level runtime software, and the system map of the architecture. This corresponds to phases 1, 2.a and part of the 2.b of our methodology. 2811

Figure 2 STHORM model in SESAM The HBDC application runs in an airport security context, and counts the number of passengers that passes in front of the camera or multi-camera configuration. In our case, the real-time requirements are: 4 cameras with HD resolution, 30 fps, and 10 detected humans by image. The overall computation power needed is around 50 GOPS. The profiling of the application resulted that 90% of the execution time is passed in the human extraction part. This part is highly parallelizable by sub-images and dynamic, thus can be run on multiple processors. This is a promising property for the DSE. Conclusion and Future work In this preliminary study, we have introduced the problem of system model coherency in the design space exploration flow for digital systems. The current work consists of building the automation system of the generator for configurable SW runtime and the applications. In addition, we are currently working on the 4 th last phase of the methodology (design analysis & optimization) in order to have a closed-loop automated DSE flow. 2812

References [1] D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit. 2012. Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications. In Proceedings of the 49th Annual Design Automation Conference (DAC '12). [2] N. Ventroux, A. Guerre, T. Sassolas, L. Moutaoukil, G. Blanc, C. Bechara, R. David, "SESAM: An MPSoC Simulation Environment for Dynamic Application Processing," Computer and Information Technology, 10th IEEE International Conference on Computer and Information Technology, June 2010. [3] IEEE Standard for IP-XACT, Standard Stricture for Packaging, Integrating, and Reusing IP within Tool Flows, IEEE Computer Society and the IEEE Standards Association Corporate Advisory Group. IEEE std 1685TM-2009, 18 Feb. 2010. [4] Y. Lhuillier, M. Ojail, A. Guerre, J.M. Philippe, K. Ben Chehida, F. Thabet, C. Andriamisaina, C. Jaber, and R. David. 2014. HARS: A hardware-assisted runtime software for embedded many-core architectures. ACM Trans. Embed. Comput. Syst, March 2014 [5] Thabet, Farhat; Lhuillier, Yves; Andriamisaina, Caaliph; Philippe, Jean-Marc; David, Raphael, "An efficient and flexible hardware support for accelerating synchronization operations on the STHORM many-core architecture," Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013, vol., no., pp.531,534, 18-22 March 2013 2813