Matisse: A system-on-chip design methodology emphasizing dynamic memory management

Size: px

Start display at page:

Download "Matisse: A system-on-chip design methodology emphasizing dynamic memory management"

Lee Todd
5 years ago
Views:

1 Matisse: A system-on-chip design methodology emphasizing dynamic memory management Diederik Verkest, Julio Leao da Silva Jr., Chantal Ykman, Kris Croes, Miguel Miranda, Sven Wuytack, Gjalt de Jong, Francky Catthoor, Hugo De Man Abstract MATISSE is a design environment intended for developing systems characterized by a tight interaction between control and data-flow behavior, intensive data storage and transfer, and stringent real-time requirements. Matisse bridges the gap from a system specification, using a concurrent object-oriented language, to an optimized embedded single-chip hardware/software implementation. Matisse supports stepwise exploration and refinement of dynamic memory management, memory architecture exploration, and gradual incorporation of timing constraints before going to traditional tools for hardware synthesis, software compilation, and inter-processor communication synthesis. With this approach, specifications of embedded systems can be written in a high-level programming language using data abstraction. Application of Matisse on telecom protocol processing systems in the ATM area shows significant improvements in area usage and power consumption. 1 Introduction The complexity of modern telecommunication systems is rapidly increasing. A wide variety of services has to be transported and elaborate network management is needed. Such complex systems require a combination of hardware and software components to implement the required functionality at the desired performance level. For applications in this domain, the desired behavior is often characterized by complex algorithms that operate on large, dynamically allocated, stored data structures (e.g. linked lists, trees,... ) resulting in intensive data transfers and data storage. Ideally the specification should reflect the conceptual partitioning of the problem, which typically corresponds to abstract data types (ADTs) along with services provided on the ADTs, and algorithms for the different processing tasks. As these conceptual entities can be readily specified in an object-oriented programming model using data abstraction and class inheritance features, the Alcatel Telecom, F. Wellesplein 1, B-2018 Antwerp, Belgium MATISSE system uses the C++ programming language as the basis for the behavioral specification. The MATISSE language extends the standard C++ language with features for expressing concurrent tasks and synchronization. Behavioral hardware synthesis has been an active area of research for more than a decade (see e.g. [10]), but commercial behavioral synthesis tools offer only very limited support for complex data structures: usually only statically declared arrays and records are supported. All these synthesis environments provide scheduling and resource allocation capabilities that permit the designer to abstract from timing and hardware partitioning details. However, the designer is still largely responsible for specifying the memory hierarchy and organization. Also manipulation and modification of stored data structures must be specified in terms of explicit memory I/O operations. For signal processing oriented applications, significant progress has been made towards high-level memory management and synthesis capabilities [5, 7, 8]. However, these techniques typically rely on the stream-based nature of the applications to minimize the size of large intermediate arrays. The applications targeted by the MATISSE system require the support of irregular data structures, such as heaps, hash tables, trees, and linked lists, that are dynamically created and destroyed at run-time. In a traditional software run-time environment, the underlying operating system is responsible for all the background memory and storage related tasks. In addition, the memory hierarchy is usually fixed. However, for embedded systems solutions, relying on software run-time support may be expensive in terms of area, performance, and power. In addition, dedicated distributed memory architectures may be used. Hence, dynamic memory management behavior must be synthesis ed in the embedded system implementation itself. In this paper we discuss MATISSE, a design environment that takes care of the background memory management problem for dynamic data structure intensive applications by bridging the gap between the conceptual design entry specification, based on C++, and traditional behavioral synthesis. The MATISSE environment addresses all

2 the aforementioned tasks to synthesize a custom distributed memory architecture. It permits to explore different architectures so that an optimal choice can be made, which is crucial as memory bandwidth often is the main performance bottleneck in this type of applications. Another important benefit of the MATISSE environment is that the specification level is lifted to a higher level than currently used for behavioral synthesis. The design entry point can be a highlevel program using data abstraction, where the designer is not burdened unnecessarily by all the details of the implementation of data structures in a memory architecture. In the next section we present the MATISSE specification model and design flow. In subsequent sections, we elaborate on those steps in the design flow directly related to the dynamic memory management. The extensive exploration feasible with the MATISSE environment will be illustrated on an industrial ATM application. 2 Matisse System The system design flow, starts from a concurrent objectoriented system specification using the MATISSE model [6], and targets an optimized embedded single-chip hardware/software implementation. We first introduce the MA- TISSE model, and then discuss the MATISSE design flow. 2.1 Model Protocol processing applications are conceptually seen as sets of concurrent processes that access data (defined as sets of records). Although the target implementation of protocol processing applications is often a mixture of hardware and software components, they are best conceived at the top level from a software perspective. Concurrent objectoriented models play a central role in large scale hardware/software system design, since they allow system specification and fast system-level simulation. In [6], the MATISSE language is presented in detail. It is a concurrent object-oriented specification language, extended from the widely used programming language C++. Minimal syntactic extensions to C++ are introduced to allow the specification of concurrent processes, inter-process communication and synchronization. We summarize the main features of the underlying MATISSE model below. Processes and concurrency - It is possible to specify processes, called active objects, and data, called passive objects. Processes have their own local virtual memory space and default thread of control. They are only created at compile-time. Concurrency is specified at the process level, by the concurrent execution of the default threads of control of all created processes. Data may be created and destroyed in the local virtual memory space of the processes, either at compile-time or at run-time. Communication - Within one process, communication is specified using C++ pointers. Between processes, communication is specified using global pointers. Except for their potentially higher cost of use, global pointers are used just like C++ pointers. Synchronization - Due to concurrent computations, simultaneous accesses to data should be synchronized by using atomic functions. Whenever several threads call an atomic function, the function is executed the required number of times in a sequential order. The execution of an atomic function never interleaves with the execution of another atomic function within one process. 2.2 Design ow The MATISSE design flow is depicted in Figure 1. The input to the design flow is a system together with its environment specified at the algorithmic level, using the MA- TISSE language. HW synthesis Matisse specification Abstract machine generation Dynamic memory Management Process concurrency management Physical memory management HW/SW i/f synthesis SW synthesis Figure 1. Matisse design Flow Abstract Machine (AM) generation creates an executable specification, suitable for simulation, exploration, and refinement of the system specification. The AM consists of a set of communicating concurrent processes, an ultra-light operating system to manage the execution of these processes and a user interface to allow the designer to make refinements of the MATISSE specification. The AM allows profiling of record accesses, inter-process communication, and virtual memory accesses. These profiling data are used to select an optimized implementation for the records, perform process concurrency management, and physical memory management, respectively. Dynamic memory management - Protocol processing applications are often characterized by algorithms that operate on large data structures, which are dynamically allocated. The MATISSE language allows the designer to define these data structures using Abstract Data Types (ADTs), without

3 low-level specification details. When implementing these applications on a chip, efficient organization and implementation of the ADTs is crucial [2, 9, 13], and dynamic memory allocation must be handled efficiently both in terms of time and number of memory accesses. Therefore, refinement of the specification of the ADTs (ADT refinement) and of the memory management (Virtual Memory Management) is required before proceeding with synthesis. Process concurrency management - The goal of process concurrency management is to meet the overall real-time requirements imposed on the application. This step involves process concurrency extraction, thread scheduling, processor allocation, process to processor assignment and interprocess communication refinement. Physical memory management - Typically, protocol processing applications require large storage capacities and very high I/O bandwidth to achieve the real-time requirements. This step aims to synthesize area and power efficient distributed memory architectures and memory management units, meeting the real-time requirements. Finally, software compilation proceeds using traditional software compilers, hardware synthesis proceeds using high-level synthesis tools and interface synthesis generates software device drivers for each software processor and VHDL specifications of the necessary hardware blocks allowing communication between hardware and software processors. The interface synthesis is performed using the hardware/software co-design environment COWARE [4, 1]. In the next three sections, we elaborate on the three steps that are relevant for the dynamic memory management: ADT refinement (section 3), virtual memory management (section 4), and physical memory management (section 5). 3 ADT refinement In an implementation independent specification, complex data structures are typically specified by means of ADTs that represent a certain functionality without imposing implementation decisions. A dictionary type, i.e. a set of records indexed by means of keys, is a typical example of an ADT occurring in transport layer network interface applications. The ADT provides a number of services (e.g., inserting, locating, or removing a record from a set) which can be used to specify the functionality of an application without knowing their implementation. A set of records accessible through one or more keys can be represented by many different data structures. All of these have different characteristics in terms of memory occupation, number of memory accesses to locate a certain record, power dissipation,... To allow the designer to make a motivated choice, all possible data structures have to be represented in a model such that the best solutions for a given application can be searched for. 3.1 A hierarchical ADT model In our model there are four primitive data structures (linked lists, trees, arrays, and pointer arrays) that can be combined to create more complex data structures. A complex ADT is represented as a tree composed of primitive data structures. With every key corresponds a layer in the tree. The bottom layer is the record layer which has no key associated with it. The top layer (i.e., the root of the tree) represents the entire set of records. Each layer below represents a partitioning of the whole set into a number of subsets. Specifying a value for the key corresponding to a layer, selects the subset of records for which the key has the specified value. This process can be applied hierarchically from the top layer till the records are selected. Each node in the tree (except for the bottom layer) has to associate values of the corresponding key with a node on the next layer. This functionality can be implemented with a single primitive data structure. Up to this point, we have assumed that every key corresponds to one layer in the hierarchy. This is not necessary, however. Keys can also be split into sub-keys, or several keys can be combined into one super key. This may heavily impact the implementation cost. Also, the order in which the keys are used to access the data structures heavily impacts the required memory size, the average number of memory accesses to locate a record, and the power cost. Therefore, it is important to find the optimal key ordering for the given application as well as the optimal number of layers. When the keys are not uniformly distributed, hashing can be used to improve the results (hashing applies a permutation function to a key or combination of keys). Note that hashing can be combined with any of the primitive data structures, thereby providing an orthogonal axis of freedom in the search space. Hashing is especially useful in combination with key splitting, because it allows to reduce the (average) size of the primitive data structures associated with the sub-keys after splitting. Many possible data structures within the model can realize a given set of records. Each one can be seen as a combination of different major options which are relatively orthogonal (Figure 2). Within each option, more detailed choices can still be made. Finding the best combination for a given application is not trivial, since it depends on the parameters in the model. Moreover, the full search space is too large to scan it exhaustively. To determine the optimal data structure we have to define the number of layers in the hierarchy, the key ordering, the hashing function for each key, and the primitive data structure for each layer in the hierarchy. Experiments showed that some decisions are much more important than others, and the heuristic decision ordering indicated in Figure 2, leads to near optimal solutions

4 1 Hashing No Yes Hashing function 3 Key ordering 2 Key splitting 4 Primitive data structure 4.1 VMM search space Similar to the ADT refinement problem, this is only feasible in practice by identifying the orthogonal decision trees in the available search space 1. Below we present the decision trees for allocation and recycling mechanisms. Array Pointer Array Binary Tree Linked List Figure 2. ADT refinement search space (a) lookup table free blocks tracking link fields index order none completely address size without exhaustively exploring all combinations. For a detailed description of the full optimization methodology we refer to [16]. 3.2 Experiments The set-of-records ADT in the ATM application was optimized for power using two realistic parameter sets. The first one assumed a storage of records in a memory built from 1 Mbit SRAMs, the second a memory built from 4 Mbit SRAMs. The optimal solution for the ADT data structures in both cases differs. Both are two layer structures with two keys. The first key indexes a pointer array, whereas the primitive data structure (DS) on the second layer is a pointer array and an array of records, for the first and second solution respectively. Applying the optimal DS for one set of parameters in the context for which the other DS was optimized, results in a power consumption that is more than 2.5 times above that of the optimal DS. Moreover, the entire search space spans a power range of four orders of magnitude, clearly substantiating the importance of a thorough exploration before deciding on a solution. 4 Virtual memory management The VMM step reserves storage space for each concrete data type obtained during the ADT refinement step, by defining a virtual memory segment for each concrete data type. Subsequently, it determines a custom virtual memory manager (VMM) for each data type that is dynamically allocated in the application. A VMM takes care of allocating and recycling blocks from the virtual memory segments. Allocation is the mechanism that searches the pool of free blocks and returns a free block large enough in order to satisfy a request of a given application. Recycling is the mechanism that returns a block which is not used anymore to the pool of free blocks for later reuse. Much literature is available about possible implementation choices for allocation mechanisms [3, 14] but none of the earlier work provides a complete search space useful for a systematic exploration. (b) (c) (d) (e) sector per type/size free pool block splitting (when) never immediate entire pool always exact match approximate part of free block used first first block merging (when) deferred fixed/variable amount free blocks reusage LIFO FIFO indexed never last unsatisfied request block merging (how much) all first sequential fit best enough Figure 3. Search space for VMM mechanisms Keeping track of free blocks - The allocator keeps track of free blocks using either link fields within free blocks or lookup tables (Figure 3.a). Using link fields within free blocks does not introduce overhead in terms of memory usage as long as a minimum block size is respected, while lookup tables always imply an overhead in terms of memory usage. The allocators are differentiated based on the indexing mechanism (by size, by address,... ). Choosing a free block - Different mechanisms exist for choosing a free block from the pool (Figure 3.b). The pool may be partitioned in sectors per size or type. The chosen block may be an exact match or an approximate match for the requested size. The allocator will try to satisfy an allocation request by returning either the first free block that is large enough (first fit) or the free block that is closest in size to the requested one (best fit). Freeing used blocks - A block that is freed by the application has to be returned to the pool of free blocks (Figure 3.c). Obvious mechanisms which provide good perfor- 1 We do not consider implicit recycling mechanisms, known as garbage collectors, in our search space....

5 mance are LIFO or FIFO schemes. A scheme that respects an index order (e.g. size) may avoid wasting memory when combined with splitting or merging techniques (see next sections) at the cost of a performance penalty. Splitting block being allocated - When the free block chosen to satisfy a request is larger than the required one, a policy for splitting the block can be implemented (Figure 3.d). The remainder of the split block is returned to the pool of free blocks. The splitting mechanisms are differentiated based on which part of the free block is used and on whether or not splitting respects an index order (e.g. size). Merging free blocks - When adjacent blocks are free, the allocator may decide to merge the blocks in order to have more opportunities to accommodate a subsequent larger allocation request (Figure 3.e). In general it is interesting to defer the merging in order to avoid subsequent splitting operations. Deferred merging may be implemented in different ways: wait for a fixed or variable amount of allocation requests before merging or wait for an unsatisfied allocation request before merging. The amount of blocks to be merged can vary between merging all blocks and merging only enough blocks to satisfy the last request. 4.2 Experiments The three data types in the ATM application that contribute most to the background memory are the Internal Packet Identifier (IPI), the Routing Record (RR), and the ATM cell. The virtual memory segments for these data types range in size from 3K to 12K words. For each virtual memory segment, a VMM mechanism has to be selected. Different choices result in power figures differing up to a factor of 5 for the IPI, 11 for the RR, and 25 for the ATM cell. In this application, the VMM with the minimal power figure is the same one for each data type. However, power is not the only parameter in the trade-off. When the amount of storage in use for two data types reaches a maximum at different moments during the lifetime of the application, it is possible to combine their virtual memory segments, at least if the VMM mechanism allows for this possibility. A second VMM mechanism that has an only slightly higher power figure for the IPI and RR data types, offers this possibility. It might therefore be possible to save area by combining the virtual memory segments for the IPI and RR data types, without affecting the power consumption. Unfortunately, in this application both IPI and RR data types reach there maximal use in an overlapping period of time. 5 Physical Memory Management Usually, for data-intensive algorithms the cycle budget available is insufficient to perform all the memory accesses sequentially. Hence a number of accesses have to be done in parallel. Distributed memory architectures allow to exploit parallelism, thus alleviating memory access bottlenecks. However, as the required memory bandwidth increases, the cycle budget available for each access individually become smaller since the number of addresses that has to be generated in parallel per processed data becomes higher, thus leading to an addressing overhead. 5.1 PMM methodology The signals accessed in parallel have to be assigned to different memories or they have to be accessed through different ports of a multi-port memory. Many different orders of the memory accesses are possible for the given cycle budget. Manually exploring all different ordering possibilities and memory configurations for area and power efficiency is a very tedious task. Therefore, an automated methodology [12, 15] has been developed. Basic groups - The virtual memory segments are split into smaller groups of data which are called basic groups. Every data item belongs to exactly one basic group, so that basic groups can be assigned to physical memories independently from each other. Basic groups are kept as small as possible, to increase the freedom of assigning basic groups to physical memories and to increase the parallel accessibility of the data in a virtual memory segment. Access ordering - The access ordering step optimizes the memory cost for the required storage bandwidth, by determining which basic groups should be made simultaneously accessible in the memory architecture such that the imposed timing constraints can be met. For this purpose, the data accesses are ordered within a given cycle budget. Whenever two accesses to two basic groups occur in the same cycle, there is an access conflict because the basic groups cannot share the same memory port. These conflicts have to be resolved during the subsequent memory allocation and assignment step by assigning conflicting basic groups either to different memories or to a multi-port memory such that they are simultaneously accessible. Memory allocation and assignment - Memory allocation and assignment determines the number and type of the memories, the number and type of their ports, and an assignment of basic groups on the allocated memories in a power and/or area optimized memory architecture. The conflict relations between the basic groups are used to restrict the search space to memory architectures that provide enough memory bandwidth to meet the timing constraints. Address optimization - Address manipulation forms a crucial component of any architecture which deals with data transfer intensive algorithms. The efficient access to the memories within real-time constraints requires an optimized mapping of the address expressions in the algorithm onto address arithmetic optimized for both area and

6 power. A methodology [11] has been developed to reduce the cost overhead for address generation for both custom and instruction-set processors. This methodology includes address expression splitting/clustering, induction variable analysis, target architecture selection, and global scope algebraic optimizations. In addition, high-level controller synthesis and optimal partitioning of the arithmetic unit are incorporated for the synthesis of custom memory management units. 5.2 Experiments Several experiments were performed on the ATM application with varying cycle budgets for the memory accesses. The virtual memory segments from the ATM application can be split in 14 basic groups. This reduces the critical path from 15 cycles to 9 cycles. The access ordering showed 13 conflicts between the basic groups. Several memory architectures were generated that satisfy the cycle budget constraints derived from the previous steps. The best solution is a trade-off between area and power. The best solution for power is a configuration with 6 memories. A configuration with 3 memories consumes a factor of 1.98 more power, and a configuration with a single memory 6.85 times more. To show the impact of the high-level address optimization, the resulting solutions where compared to those obtained by traditional synthesis tools. The RT-VHDL description that generates a hardwired solution (every address expression mapped on a separate unit), results in an area of 1.7 mm 2 after synthesis with Synopsys DC. A behavioral VHDL description synthesis ed with high-level synthesis (Synopsys BC), results in an area of 1.48 mm 2, subject to the constraint of generating one address expression in every clock cycle. When using the high-level address optimization of MATISSE before using high-level synthesis, an area of 0.42 mm 2 is obtained. 6 Conclusions In this paper, we have addressed the support for system design exploration for applications that require manipulation of a large amount of dynamically allocated stored data, as found in e.g. protocol processing applications used in telecom networks. Using the MATISSE language the designer is able to write a system specification, which abstracts low-level details, and is easily retargetable to different embedded hardware/software realizations. The M A- TISSE design flow assists the designer to explore the design space at system level for different ADT implementations and memory managers, and to explore different memory architectures for mixed hardware/software realizations. We demonstrated the results of the system design exploration using an industrial ATM application. We demonstrated that despite the higher level of abstraction of our input with respect to e.g., high-level synthesis (HLS), we achieve more efficient implementations. Acknowledgments This work is partly funded by the Flemish IWT in the HASTEC project and the European commission in the MEDIA project. Julio Leao da Silva Junior is supported by a Brazilian Government Fellowship - CAPES. We would further like to thank Bill Lin (University of California, San Diego) and Mark Genoe (Alcatel Telecom) for many insightful discussions. References [1] Coware. [2] A. Alles. ATM in private networking, a tutorial. Proc. IN- TEROP 93, [3] G. Attardi et al. A customisable memory management framework. Proc. USENIX C++ Conf. Cambridge, MA, [4] I. Bolsens et al. Hardware-software codesign of telecommunication systems. Proceedings of the IEEE, 85(3): , Mar [5] J. T. Buck et al. PTOLEMY: A framework for simulating and prototyping heterogeneous systems. Int l Journal on Computer Simulation, Jan [6] J. da Silva et al. Matisse: A concurrent and object-oriented system specification language. Int. Conf. on VLSI, [7] H. De Man et al. Architecture-driven synthesis techniques for VLSI implementation of DSP algorithms. Proceedings of the IEEE, 72(2): , Feb [8] R. Lauwereins et al. GRAPE-II: A system level prototyping environment for DSP applications. IEEE Computer, pp , Feb [9] J.-Y. Le Boudec. The Asynchronous Transfer Mode: A tutorial. Computer Networks and ISDN Systems, 24: , [10] P. Lippens et al. Allocation of multiport memories for hierarchical data streams. Proc. of ICCAD. Santa Clara, CA, Nov [11] M. Miranda et al. ADOPT: Efficient hardware address generation in distributed memory architectures. Proc. of the Int l Symposium on System Level Synthesis, [12] P. Slock, et al. Fast and extensive system-level memory exploration for ATM applications. Proc. of the Int l Symposium on System Synthesis, Sep [13] Y. Therasse et al. VLSI architecture of a SDMS/ATM router. Annales des Telecommunications, 48(3-4), [14] P. R. Wilson et al. Dynamic storage allocation: A survey and critical review. Proc. Int l Workshop on Memory Management. Kinross, Scotland, UK, Sep [15] S. Wuytack et al. Flow graph balancing for minimizing the required memory bandwidth. Proc. of the Int l Sympopsium on System Synthesis, pp , Nov [16] S. Wuytack et al. Transforming set data types to power optimal data structures. IEEE Transactions on Computer-aided Design, CAD-15(6): , June 1996.

Matisse System Multithread lib. initial Matisse specification. 3 Abstract Machine Generation. Matisse lib. abstract machine

Matisse System Multithread lib. initial Matisse specification. 3 Abstract Machine Generation. Matisse lib. abstract machine Ecient System Exploration and Synthesis of Applications with Dynamic Data Storage and Intensive Data Transfer Julio Leao da Silva Jr., Chantal Ykman-Couvreur, Miguel Miranda, Kris Croes, Sven Wuytack,