2D-VLIW: An Architecture Based on the Geometry of Computation

Size: px

Start display at page:

Download "2D-VLIW: An Architecture Based on the Geometry of Computation"

Aldous Gilmore
5 years ago
Views:

2D-VLIW: An Architecture Based on the Geometry of Computation Ricardo Santos,2 2 Dom Bosco Catholic University Computer Engineering Department Campo Grande, MS, Brazil Rodolfo Azevedo Guido Araujo

1 2D-VLIW: An Architecture Based on the Geometry of Computation Ricardo Santos,2 2 Dom Bosco Catholic University Computer Engineering Department Campo Grande, MS, Brazil Rodolfo Azevedo Guido Araujo State University of Campinas Initute of Computing Campinas, SP, Brazil {ricrs, rodolfo, guido}@icunicampbr Abract This work proposes a new architecture and execution model called 2D-VLIW This architecture adopts an execution model based on large pieces of computation running over a matrix of functional units connected by a set of local regier spread across the matrix Experiments using the Mediabench and SPECint00 programs and the Trimaran compiler show performance gains ranging from 5% to 63%, when comparing our proposal to an EPIC architecture with the same number of regiers and functional units We also show that the g72 enc program running on a 2D-VLIW 3 3 matrix had a speedup of 37 over a 2 2 matrix while the same program over the EPIC processor with 9 functional units had a speedup of 2 over an EPIC processor with 4 functional units For some internal procedures from Mediabench and SPECint programs, the average 2D-VLIW OPC (operations per cycle) was up to 0 times greater than for the equivalent EPIC processor Introduction For many years the conant increase in processor performance came from the advancement in VLSI process technology However, recent announcements on the technology limits due to the thermal barrier have motivated the research into innovative ways to suain the increase in performance One approach used by modern computer architectures to achieve performance, independent of the implementation technology, is the tion of hardware resources to exploit the parallelism available inside programs Two architectural design yles commonly used to exploit parallelism are VLIW (Very Long Inruction Word) [4] and superscalar [2] Despite all their advantages and disadvantages, the major problem with VLIW and superscalar architectures is related to its inability to capture and scale parallelism The performance udy in [7] sugges that a compiler can locate more parallelism by looking at larger pieces of the application code, though it is essential that the hardware provides a suitable execution model Moreover, previous work [5, 7, 4] as well as ate-ofthe-art architectures [2, 8, 3] have pointed out toward architectures adopting new rategies for code optimization and execution models Advanced compiler optimizations cou look at large code regions to find out greater parallelism levels The execution models cou provide new resource arrangements to execute large code regions, thus increasing performance In this paper, we propose a new architecture and execution model where the main characteriic is the mapping of large pieces of computation into a matrix of functional units This mapping brings attractive performance to applications with high ILP as embedded and some general-purpose applications Our architecture is named 2D-VLIW [9, 0] because large inructions, comprised of single operations, are fetched from the memory and executed into a (two-dimensional) matrix of functional units through a pipeline In order to evaluate the real performance gains from the 2D- VLIW architecture, we have compared it to an EPICbased processor [] using Mediabench and SPECint suites compiled on the Trimaran compiler We assume that an EPIC processor has more flexibility in terms of operations (dependent and parallel) within an inruction and hardware complexity than a pure VLIW processor The experiments revealed speedups ranging from 5%-63% over the EPIC processor and, for some programs, the average 2D-VLIW OPC (operations per cycle) was up to 0 times greater than for the EPIC processor This paper is ructured as follows Section 2 presents the 2D-VLIW architecture and execution model Section 3 shows the performance results ob- Application-specific Syems, Architectures and Processors (ASAP'06)

2 tained by comparing 2D-VLIW to an EPIC processor Section 4 discusses current architectures related to our approach The conclusions and future work are presented in Section 5 2 The 2D-VLIW Architecture and Execution Model Like traditional pipelined architectures, the 2D- VLIW datapath has ages to fetch, decode and execute inructions In 2D-VLIW, execution is performed through a matrix of functional units where each column represents one execution sub-age of the pipeline For example, a 6 6 matrix has 6 sub-ages: EX,EX 2,,EX 6 A detailed view of the 2D-VLIW datapath is presented next 2 The 2D-VLIW Architecture The 2D-VLIW architecture exploits inruction level parallelism through a pipelined datapath and a two dimensional VLIW execution model This architecture fetches large inructions from the memory as well as VLIW and EPIC architectures do The 2D-VLIW inructions are comprised of single operations executed by a set of functional units (FUs) organized as a matrix The maximum number of operations in one inruction is equivalent to the amount of functional units in the matrix Figure shows a simplified overview of a 4 4 2D-VLIW datapath 2D VLIW Word IF IF/ID Pipeline Regier Control Unit Regier File Bank ID ID/EX Pipeline Regier EX Pipeline Regier Functional Unit Matrix Figure A simplified overview of the 2D- VLIW datapath In Figure four operations are executed by 4 FUs at each execution age EX,EX 2,EX 3,EX 4 of the pipeline Notice that, by using the interconnection network, each FU in column n, n<4, can provide operands to two FUs in the following column (n + ) Figure 2 shows all logic blocks and signals inside a FU Results from a FU may be written either into the T emp Regier F ile (TRF) or into the Global EX 2 Pipeline Regier 2 3 EX 3 Pipeline Regier EX 4 Regier F ile (GRF) TRF is a small regier bank with 2 local regiers dedicated to each FU The result from an operation is always written into an internal regier called F U Regier By using TRF and FU Regier, a result from an operation in cycle i will be available for three other operations in the next cycle (i + ): two operations can use this result through the T emp Regier whereas another operation uses it from the F U Regier Operands input data come from three possible sources: the global regier file, the temporary regier file through the interconnection network, or from the own FU by the F U Regier The SelOpnd, SelOpnd2 and SelOperation input selection signals come from the pipeline regiers Operands Operands SelOpnd SelOpnd2 SelOperation Functional Unit Global Reg File Figure 2 The 2D-VLIW functional unit Temp Regier File The TRFs in each FU are a viable alternative to minimize the pressure over the global regier file This is achieved by oring temporary values (of the DAGs) into TRFs The DAG leaves read values from the global regier file, while the operations inside the DAGs read and write values from/to the temporary regier The DAG root nodes write values into the global regier file Figure 3 shows this rategy by using a DAG with 4 operations The load operation,, uses one value from the global regier file, and needs to ore its result into a regier which will be used by the two operations The results from these two operations are the inputs to add, which ores the result of the tion into the R4 global regier The DAG in 3(a) requires two read ports and two write ports to access regiers from GRF Figure 3(b) shows how this DAG can be allocated in the 2D-VLIW architecture The results from the operations and are ored into the TRFs Figure 3(c) shows the mapping of this DAG onto the matrix In general, the area upper bound complexity of a andard regier file is given by O(n 2 ) [] whereas the latency upper bound is O(log m), where n is the total number of read and write ports, and m the number of read ports Notice that the number of read ports has an impact on the area and latency of the regier Application-specific Syems, Architectures and Processors (ASAP'06)

3 R R2 add R4 R3 (a) One DAG using GRFs TRF TRF add R4 TRF (b) One DAG using TRFs add (c) The DAG (b) mapped onto the FU matrix Figure 3 Example of a DAG using TRFs (a) + + A A A2 A3 A4 B B B2 + + B3 B4 (b) file Multiple-issue architectures have the number of read ports bounded by O(k) (usually 2 k), where k is the amount of functional units in the architecture By adopting TRFs and assuming that only leaves/roots from the DAGs will need to read/write from/to the global regier file, we found out that the latency upper bound for our architecture GRF can be asymptotically reduced This new upper bound is O( k) (usually 2 k) For example, a 6-FUs EPIC architecture has 6 2 global file read ports, while the equivalent 2D- VLIW (4 4 matrix) has only 8 read ports which leads to an area 4 times smaller, and a latency reduction of 40%, when taking into account that the same number of write ports is available for both architectures 22 The 2D-VLIW Execution Model As aforementioned, program execution over the 2D- VLIW matrix is pipelined At each clock cycle, one 2D-VLIW inruction is fetched from the memory and pushed into the pipeline ages On the execution ages, the operations from this inruction are executed according to the number of FUs in each column Assuming that the architecture has 6 functional units organized as a 4 4 matrix, a 2D-VLIW inruction can also be represented as a 4 4 operation matrix comprised of 6 operations Take for example the DAGs in Figure 4 extracted from the 8mcf program This program is part of the SPECint00 benchmark and was compiled by Trimaran with the hyperblock option on The operations from these DAGs correspond to a subset of the HPL-PD [6] ISA Figure 4(a) shows several DAGs from the 8mcf program (from an unrolled inner loop) and 4(b) shows the organization of these DAGs into two 2D-VLIW inructions A and B These inructions are represented as matrices of operations where operations with Read-After-Write (RAW) dependencies are placed in two different rows while independent operations are at the same row Figure 4 DAGs from the 8mcf program and their equivalent 2D-VLIW inructions Figure 5 shows the execution of the A and B inructions from Figure 4(b) on the 2D-VLIW datapath Since the decode and inruction fetch ages work the same as in a andard processor, we art at the EX execution age After the ID/EX pipeline regier has been filled in, the execution arts over the matrix Figure 5(a) depicts the fir execution cycle on the FU matrix The fir column receives data from the ID/EX pipeline regier Four functional units from the fir column execute operations,,, from row A (inruction A) The dashed arrows indicate which FUs receive the results from these operations Obviously, the consumers FUs are limited by the interconnection network At the second execution cycle, 5(b), operations,,, from A 2 are executed on the second column At the same time, operations from B art on the fir column At the third execution age, 5(c), the EX 2 /EX 3 pipeline regier carries information to execute operations from A 3, the FUs in the second column are executing the operations,,, from B 2 and the fir row (C ), from a 2D-VLIW inruction C, arted its execution on the matrix At the fourth execution age, 5(d), the operations from A 4 are executed on the fourth column, operations from B 3 are executed on the third column, operations from C 2 are running on the second column and the inruction D arted its execution The la execution is shown in 5(e) where the four operations from B 4 are being executed and the inruction A has already been finished Following the pipeline execution, at the fourth execution cycle the matrix is quite filled with operations from four different 2D-VLIW inructions as indicated by the fir, second, third and fourth columns (highlighted differently) in 5(d) After the fourth execution cycle, one 2D-VLIW inruction is finished at every cycle Application-specific Syems, Architectures and Processors (ASAP'06)

4 ID/EX ID/EX A 2 3 (a) 2 3 C B2 A3 (c) ID/EX 2 3 ID/EX ID/EX B A 2 D C2 (b) 2 3 (d) B 3 A 4 both architectures Functional units were designed to execute all operations from the inruction set The GRF has 64 regiers, and each TRF has 2 local regiers The 2D-VLIW matrix was organized as a square matrix by using 4, 9 and 6 FUs For 30 FUs, however, it was organized as a 3 0 matrix, where the fir dimension is the number of rows and the second one the number of columns The speedup was obtained by dividing the number of clock cycles of a base machine by the clock cycles taken by the current configuration Figure 6 shows the comparison between 2D-VLIW and HPL-PD using 0 programs (4 MediaBench programs and 6 SPECint00 programs) Each bar represents the 2D-VLIW speedup over an HPL-PD machine with same number of FUs Notice that all 2D-VLIW configurations attained speedups over their equivalent HPL-PD D VLIW 2x2 2D VLIW 3x3 2D VLIW 4x4 2D VLIW 3x0 E D 2 C 3 B 4 (e) Figure 5 Execution ages on the 2D-VLIW architecture % Speedup Notice also that operations being executed in 5(a), 5(b), 5(c), 5(d), and 5(e) represent exactly all the operations from the 2D-VLIW inructions A and B in Figure 4(b) 3 Experiments and Results In this Section we present performance results through a comparison between 2D-VLIW and the HPL-PD [6] processor based on the EPIC architecture The results were obtained by simulating the program execution over the Trimaran simulation tool We simulate MediaBench and SPECint00 programs over the 2D-VLIW and HPL-PD models described by the HMDES language [3] All programs were compiled by Trimaran with hyperblock, pre-pass and po-pass scheduling options on In the following experiments, we have adopted 4, 9, 6 and 30 functional units, the same inruction set, the same operation s latency (including load and ore operations) and the same number of regiers for 0 epic g72 dec g72 enc gsm dec 75vpr 8mcf 97parser 255vortex 256bzip2 300twolf Figure 6 Performance comparison between 2D-VLIW and HPL-PD After an analysis of the generated code for both architectures, we have found out that the 2D-VLIW speedup was mainly owing to the arrangement of the architectural elements The tion of local regiers made it possible to handle Write-After-Read (WAR) dependencies through the matrix This situation is illurated in Figure 7 This code fragment follows the ELCOR-IR [3] syntax and it was picked up from the 255vortex program after the scheduling and regier allocation phase Notice that operation sw, in 7(a), is executed before the operation addw (ime(7) and ime(8), respectively) For the case of the 2D-VLIW, 7(b), these operations are executed simultaneously (ime(7)) The 2D-VLIW allocator assigns a local regier inead of a global regier to the deination operand of the addw operation When Application-specific Syems, Architectures and Processors (ASAP'06)

5 this operation is executed, it does not write to the global regier, thus being executed in parallel to sw and eliminating one extra execution cycle 2 2D VLIW 2x2 2D VLIW 3x3 2D VLIW 4x4 2D VLIW 3x0 s_w [] [<gpr > <gpr 55>] s_time(7) add_w [<gpr >] [<sp> < 2520>] s_time(8) (a) HPL-PD Scheduling s_w [] [<gpr > <gpr 55>] s_time(7) add_w [<trf >] [<sp> < 2520>] s_time(7) (b) 2D-VLIW Scheduling Speedup 5 Figure 7 Trimaran s scheduling for a code fragment based on HPL-PD and 2D-VLIW 05 The next experiment presents the scalability of 2D- VLIW and HPL-PD Figure 8(a) depicts the speedup obtained for three 2D-VLIW configurations over a basic model consiing of 4 functional units Figure 8(b) shows the same information, but taking into account the HPL-PD configurations In 8(a), although 2D-VLIW did not achieve any speedup for 75vpr, an incremental speedup (scalability) was obtained for g72-dec, g72-enc, gsmdec, 255vortex, 256bzip2 and 300twolf The results from 8(b) show that HPL-PD achieved scalability for g72-dec, g72-enc, gsm-dec and 255vortex programs But two programs, 75vpr and 300twolf, had a performance decrease Table presents another result from our experiments Each row shows one program, its respective procedure, and the number of OPCs according to each 2D-VLIW and HPL-PD configuration (4, 9, 6 and 30 FUs) These values indicate how well 2D-VLIW scales while the number of functional units is increased Numbers in parenthesis are the results obtained by HPL-PD over the same programs and procedures Looking at 8mcf program we can notice that 2D- VLIW achieved an OPC 0 times greater than HPL- PD Table 2D-VLIW and HPL-PD OPC Program 3 3 (9) 4 4 (6) 3 0 (30) 8mcf (primal artificial) 25 (3) 52 (4) 423 (4) 256bzip2 (main) 205 (73) 299 (73) 304 (63) gsm-decode (Autocorrelation) 27 (99) 222 (20) 235 (2) 4 Related Work The RAW architecture [3] is a set of interconnected tiles organized as a 4 4 matrix Each tile contains an 8-age MIPS-yle pipeline, a 4-age pipelined FPU, a 32 KB data cache, and a switch that allows for atic and dynamic inter-tile communication One Speedup epic epic g72 dec g72 dec g72 enc gsm dec 75vpr 8mcf 97parser 255vortex Media and SPECint00 Benchmarks (a) 2D-VLIW scalability g72 enc gsm dec 75vpr 8mcf 97parser 255vortex Media and SPECint00 Benchmarks (b) HPL-PD scalability 256bzip2 2D VLIW 2x2 2D VLIW 3x3 2D VLIW 4x4 2D VLIW 3x0 256bzip2 300twolf 300twolf Figure 8 Scalability to the 2D-VLIW and HPL- PD architectures remarkable difference between 2D-VLIW and RAW concerns the computation granularity RAW can map a whole kernel onto one processing tile whereas 2D- VLIW maps DAGs onto the FU matrix The TRIPS architecture [2] is composed of a 4 4 array of functional units (FUs) There are four regier file banks along the top array, and four inruction and data cache banks along the right side of the array The compiler buis 28-inruction blocks organized into groups of eight inructions per FU Experiments showed that the IPC for integer and floatpoint programs are 0 and 2, respectively Conversely to 2D-VLIW, the TRIPS execution model is based on a dataflow ordering with each operation being executed as soon as its operands are available Moreover, TRIPS uses block-atomic execution where all operations in a block are either entirely committed or rolled back Application-specific Syems, Architectures and Processors (ASAP'06)

6 ADRES [8] is a coarse-grained dynamically reconfigurable architecture comprised of a tightly coupled VLIW processor and a matrix of reconfigurable functional units These functional units have regier files to keep intermediate computation values The compiler identifies computation-intensive kernels inside the program and maps these kernels onto the reconfigurable matrix The experiments on this architecture achieved IPCs from 9 to 42 ADRES and 2D- VLIW have different execution models In ADRES, one inruction will be executed either by the VLIW processor or by the reconfigurable matrix 2D-VLIW has an unified execution where every inruction crosses all ages This unified view improves resource sharing and increases ILP 5 Conclusions and Future Work A new multiple-issue architecture called 2D-VLIW was presented in this paper This architecture aims at providing architectural resources that match to the topology of high performance applications in domains like multimedia, communication and some generalpurpose integer computing 2D-VLIW has a matrix of functional units and local regiers spread across this matrix The execution model consis in large inructions brought from the memory and executed onto the matrix The local regiers are used to keep temporary values and to reduce the pressure over the global regier file Data transportation inside the FU matrix is performed through a narrow interconnection network which provides a fa communication mechanism between functional units at consecutive ages The performance results from Section 3 show that 2D-VLIW outperformed the EPIC-based processor (HPL-PD) over all MediaBench and SPECint00 programs We define four FUs configurations (4, 9, 6 and 30) and measured the 2D-VLIW and HPL-PD performance in all of them Our experimental results show that the combination of architectural arrangement, execution model, and compiler assiance can achieve an appealing performance to embedded applications The compiler plays an important role in the 2D- VLIW architecture It is responsible for recognizing large pieces of computation, mapping their operations onto the matrix topology and using local regiers to keep temporary values Currently, we arted the implementation of two new scheduling and regier allocation algorithms for 2D-VLIW References [] A Abnous and N Bagherzadeh Architectural Design and Analysis of a VLIW Processor International Journal of Computers and Electrical Engineering, 2(2):9 42, 995 [2] D Burger, S W Keckler, K S McKinley, M Dahlin, L K John, C Lin, C R Moore, J Burril, R G Mcdona, and W Yoder Scaling to the End of Silicon with EDGE Architectures IEEE Computer, 37(7):44 55, 2004 [3] L N Chakrapani, J Gyllenhaal, W Mei, W Hwu, S A Mahlke, K V Palem, and R M Rabbah Trimaran - An Infraructure for Research in Inruction- Level Parallelism Lecture Notes in Computer Science, 3602:32 4, 2004 [4] J A Fisher Very Long Inruction Word Architectures and the ELI-52 In Proceedings of the 0 th International Symposium on Computer Architecture (ISCA), pages IEEE Computer Society, 983 [5] N P Jouppi Available Inruction-Level Parallelism for Superscalar and Superpipelined Machines In Proceedings of the 3 rd International Conference on Architectural Support for Programming Languages and Operating Syems (ASPLOS), pages ACM Press, 989 [6] V Kathail, M S Schlansker, and B R Rau HPL- PD Architecture Specification Technical Report 93-80, Hewlett Packard Laboratories Palo Alto, 2000 [7] M S Lam and R P Wilson Limits of Control Flow on Parallelism In Proceedings of the 9 th ISCA, pages ACM Press, 992 [8] B Mei, A Lambrechts, J-Y Mignolet, D Verke, and R Lauwereins Architecture Exploration for a Reconfigurable Architecture Template IEEE Design Te of Computers, 22(2):90 0, 2005 [9] R Santos, R Azevedo, and G Araujo Exploiting Dynamic Reconfiguration Techniques: The 2D-VLIW Approach In Proceedings of the 3 th IEEE International Workshop on Reconfigurable Architectures, Rhodes Island-Greece, 2006 IEEE Computer Society [0] R Santos, R Azevedo, and G Araujo The 2D-VLIW Architecture Technical Report IC , Initute of Computing, 2006 [] M S Schlansker and B R Rau EPIC: Explicitly Parallel Inruction Computing IEEE Computer, 33(2):37 45, 2000 [2] M D Smith, M Johnson, and M A Horowitz Limits on Multiple Inruction Issue In Proceedings of the 3 rd ASPLOS, pages ACM Press, 989 [3] E Waingo, M Taylor, D Srikrishna, V Sarkar, W Lee, V Lee, J Kim, M Frank, P Finch, R Barua, J Babb, S Amarasinghe, and A Argawal Baring it All to Software: RAW Machines IEEE Computer, 30(9):86 93, 997 [4] D W Wall Limits of Inruction-Level Parallelism In Proceedings of the 4 th ASPLOS, pages ACM Press, 99 Application-specific Syems, Architectures and Processors (ASAP'06)

The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic

The Vector-Thread Architecture Ronny Krashinsky, Chriopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Kre Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge,