Efficiency Aspects for Advanced Fluid Finite Element Formulations

Size: px
Start display at page:

Download "Efficiency Aspects for Advanced Fluid Finite Element Formulations"

Transcription

1 Proceedings of the 5 th International Conference on Computation of Shell and Spatial Structures June 1-4, 2005 Salzburg, Austria E. Ramm, W. A. Wall, K.-U. Bletzinger, M. Bischoff (eds.) Efficiency Aspects for Advanced Fluid Finite Element Formulations Malte Neumann*, Sunil R. Tiyyagura, Wolfgang A. Wall, Ekkehard Ramm *Institute of Structural Mechanics, University of Stuttgart Pfaffenwaldring 7, Stuttgart, Germany neumann@statik.uni-stuttgart.de Abstract For the numerical simulation of large scale CFD and fluid-structure interaction (FSI) problems efficiency and robustness of the algorithms are two key requirements. In this paper we would like to describe a very simple concept to increase significantly the performance of the element calculation of an arbitrary unstructured finite element mesh on vector computers. By grouping computationally similar elements together the length of the innermost loops and the vector length can be controlled. In addition the effect of different programming languages and different array management techniques will be investigated. A numerical CFD simulation will show the improvement in the overall time-to-solution on vector computers as well as on other architectures. 1 Introduction For the numerical simulation of large scale CFD and FSI problems computing time is still a limiting factor for the size and complexity of the problem. Besides the solution of the set of linear equations, the element evaluation and assembly for stabilized, highly complex elements on unstructured grids is often a main time consuming part of the calculation. Whereas a lot of research is done in the area of solvers and their efficient implementation, there is hardly any literature on efficient implementation of advanced finite element formulations. Still a large amount of computing time can be saved by an expert implementation of the element routines. We would like to propose a straightforward concept to improve significantly the performance of the integration of element matrices of an arbitrary unstructured finite element mesh on vector computers. Very often algorithms in scientific codes only use a small fraction of the available computer power[1]. Therefore it is highly advisable to take a closer look at the efficiency of algorithms and improve them to make the best out of the available computer power. To evaluate the performance of a numerical method several criteria are of course available. For computational scientists who attempt to solve a given problem the most relevant is most probably the time-to-solution. This criteria takes into account a lot of different factors. For example these are the efficiency of the algorithm, the use of a particular hardware platform at a percentage of its peak speed and also the effort to include additional capabilities into the numerical code. However, the multitude of quantities included in this benchmark makes it difficult to use it for comparisons. A more universal performance benchmark is the raw computational speed, typically expressed in FLoating-point OPerations 1

2 per Second (FLOPS). Even though the significance of such an isolated performance figure is limited, it still gives an approximate measurement of the capability of a given algorithm-architecture combination[3]. FLOPS is also the basis to evaluate the efficiency an application or algorithm reaches on a given architecture: The efficiency is usually given as the ratio of the achieved sustained FLOPS of the application and the peak FLOPS of the architecture. 2 Computational Efficiency For the numerical simulation of large scale CFD and fluid-structure interaction (FSI) problems computing time is still a limiting factor for the size and complexity of the problem. Waiting for more powerful computers will not solve this problem, as the demand on larger and more complex simulations usually grows as fast as the available computer power. It is rather highly advisable to use the full power that computers already offer today. Especially on superscalar processors the gap between sustained and peak performance is growing for scientific applications. Very often the sustained performance is below 5 percent of peak. On the other hand the efficiency on vector computers is usually much higher. For vectorizable programs it is possible to achieve a sustained performance of 30 to 60 percent, or above of the peak performance[1, 4]. Starting with a very low level of serial efficiency, e.g. on a superscalar computer, it is a reasonable assumption that the overall level of efficiency of the code will drop even further when run in parallel. Especially if one is to use only moderate numbers of processors, it is essential to use them as efficiently as possible. Therefore in this paper we only look at the serial efficiency as one key ingredient for a highly efficient parallel code[1]. 3 Performance Optimization To achieve a high efficiency on a specific system it is in general advantageous to write hardware specific code, i.e. the code has to make use of the system specific features like vector registers or the cache hierarchy. As our main target architecture is a NEC SX-6 parallel vector computer, we will address some aspects of vector optimization in this paper. But as we will show later this kind of performance optimization has also a positive effect on the performance of the code on other architectures. 3.1 Vector Processors Vector processors like the NEC SX-6 processor use a very different architectural approach than conventional scalar processors. Vectorization exploits regularities in the computational structure to accelerate uniform operations on independent data sets. Vector arithmetic instructions involve identical operations on the elements of vector operands located in the vector registers. A lot of scientific codes like FE programs allow vectorization, since they are characterized by predictable fine-grain data-parallelism[4]. The SX-6 processor contains an 8-way replicated vector pipe capable of issuing a MADD each cycle and 72 vector registers, each holding bit words. For non-vectorizable instructions the SX-6 also contains a cache-based superscalar unit. Since the vector unit is significantly more powerful than this scalar processor, it is critical to achieve high vector operations ratios, either via compiler discovery or explicitly through code and data (re-)organization. 3.2 Vector Optimization To achieve high performance on a vector architecture there are three main variants of vectorization tuning: compiler flags compiler directives code modifications In most cases an optimal performance on a vector architecture can only be achieved with code that was especially designed for this kind of processor. Here the data management as well as the structure of the algorithms are important. But often it is also very effective for an existing code to concentrate 2

3 element calculation loop all elements loop gauss points shape functions, derivatives, etc. calculate stiffness contributions assemble element matrix element calculation group similar elements into sets loop all sets loop gauss points shape functions, derivatives, etc. loop elements in set calculate stiffness contributions assemble all element matrices Figure 1: Old and new structure of the algorithm to evaluate element matrices. the vectorization efforts on performance critical parts and use more or less extensive code modifications to achieve a better performance. The reordering or fusion of loops to increase the vector length or the usage of temporary variables to break data dependencies in loops can be simple measures to improve the vector performance. We would like to put forward a very simple concept, that requires only little changes to an existing FE code, to improve the vector performance of the integration of element matrices of an arbitrary unstructured finite element mesh significantly. 4 Vectorization Concept for FE The main idea of this concept is to group computationally similar elements into sets and then perform all calculations necessary to build the element matrices simultaneously for all elements in one set. Computationally similar in this context means, that all elements in one set require exactly the same operations to integrate the element matrix, i.e. they have e.g. the same topology and the same number of nodes and integration points. The changes necessary to implement this concept are visualized in the structure charts in figure 1. Instead of looping all elements and calculation the element matrix individually, now all sets of elements are processed. For every set the usual procedure to integrate the matrices is carried out, except on the lowest level, i.e. as the innermost loop, a new loop over all elements in the current set is introduced. As some intermediate results now have to be stored for all elements in one set, the size of these sets is limited. The optimal size also depends strongly on the hardware architecture. 5 Further Influences on the Efficiency It is well known that the programming language can have a large impact on the performance of a scientific code. Fortran is often considered the best choice for highly efficient code[5] whereas some features of modern programming languages, like pointers in C or objects in C++, make vectorization more complicated or even impossible[4]. Especially the very general pointer concept in C makes it difficult for the compiler to identify data-parallel loops, as different pointers might alias each other. There are a few remedies for this problem like compiler flags or the restrict keyword. The latter is quite new in the C standard and it seems that it is not yet fully implemented in every compiler. We have implemented the proposed concept for the calculation of the element matrices in 5 different variants. The first four of them are implemented in C, the last one in Fortran. Further differences are the array management and the use of the restrict keyword. For a detailed description of the variants see table 1. Multi dimensional arrays denote the use of 3- or 4-dimensional arrays to store intermediate results, whereas one-dimensional arrays imply a manual indexing. 3

4 orig var1 var2 var3 var4 var5 language C C C C C Fortran array dimensions multi multi multi one one multi restrict keyword restrict restrict SX Itanium Pentium Table 1: Influences on the performance. Properties of the five different variants and their relative time for calculation of stiffness contributions. The results in table 1 give the cpu time spent for the calculation of some representative element matrix contributions standardized by the original code. The positive effect of the grouping of elements can be clearly seen for the vector processor. The calculation time is reduced to less than 3 % for all variants. On the other two processors the grouping of elements does not result in a better performance for all cases. The Itanium architecture shows only a improved performance for one dimensional array management and the variant implemented in Fortran and the Pentium processor performs in general worse for the new structure of the code. Only for the last variant the calculation time is cut in half. It can be clearly seen, that the effect of the restrict keyword varies for the different compilers/processors and also for one-dimensional and multi-dimensional arrays. Using restrict on the SX-6 results only in small improvements for one-dimensional arrays, on the Itanium architecture the speed-up for this array management is even considerable. In contrast to this on the Pentium architecture the restrict keyword has a positive effect on the performance of multi-dimensional arrays and a negative effect for one-dimensional ones. The most important result of this analysis is the superior performance of Fortran. The last variant is the fastest on all platforms. This is the reason we favor Fortran for performance critical scientific code and use the last variant for our further examples. 6 Results Concluding we would like to demonstrate the positive effect of the proposed concept for the calculation of element matrices on a full CFD simulation. The flow is the Beltrami-Flow (for details see [6]) and the unit-cube was discretized by stabilized 8-noded hexahedral elements[2]. Calculation time [sec] other solver ele. calc Original Variant 5 Figure 2: Split-up of total calculation time for 32 time steps of the Beltrami Flow on the SX-6. element calc. stiffness contr. original var5 original var5 SX Itanium Pentium Table 2: Efficiency of original and new code in percent of peak performance. 1 NEC SX-6, 565 MHz; NEC C++/SX Compiler, Version 1.0 Rev. 063; NEC FORTRAN/SX Compiler, Version 2.0 Rev Hewlett Packard Itanium2, 1.3 GHz; HP ac++/ansi C Compiler, Rev. C.05.50; HP F90 Compiler, v Intel Pentium4, 2.6 GHz; Intel C++ Compiler, Version 8.0; Intel Fortran Compiler, Version

5 In figure 2 the total calculation time for 32 time steps of this example and the fractions for the element calculation and the solver on the SX-6 are given for the original code and the full implementation of variant 5. The time spent for the element calculation, formerly the major part of the total time, could be reduced by a factor of 24. This considerable improvement can also be seen in the sustained performance given in table 2 as percentage of peak performance. The original code not written for any specific architecture has only a poor performance on the SX-6 and a moderate one on the other platforms. The new code, designed for a vector processor, achieves for the complete element calculation an acceptable efficiency of around 30 percent and for several subroutines, like the calculation of some stiffness contributions, even a superior efficiency of above 70 percent. It has to be noted that these high performance values come along with a vector length of almost 256 and a vector operations ratio of above 99.5 percent. But also for the Itanium2 and Pentium4 processors, which were not the main target architectures, the performance was improved significantly and for the Itanium2 the new code reaches around the same efficiency as on the vector architecture. References [1] Behr, M., Pressel, D.M., Sturek, W.B.: Comments on CFD Code Performance on Scalable Architectures. Computer Methods in Applied Mechanics and Engineering 2000; 190: [2] Wall, W.A.: Fluid-Struktur-Interaktion mit stabilisierten Finiten Elementen. PhD thesis, Institut für Baustatik, Universität Stuttgart, [3] Tezduyar, T., Aliabadi, S., Behr, M., Johnson, A., Kalro, V., Litke, M.: Flow Simulation and High Performance Computing. Computational Mechanics 1996; 18: [4] Oliker, L., Canning, A., Carter, J., Shalf, J., Skinner, D., Ethier, S., Biswas, R., Djomehri, J., van der Wijngaart, R.: Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations. In: Proceedings of the ACM/IEEE Supercomputing Conference 2003, Phoenix, Arizona, USA [5] Pohl, T., Deserno, F., Thürey, N., Rüde, U., Lammers, P., Wellein, G., Zeiser, T.: Performance Evaluation of Parallel Large-scale Lattice Boltzmann Applications on Three Supercomputing Architectures. In: Proceedings of the ACM/IEEE Supercomputing Conference 2004, Pittsburgh, USA [6] Ethier, C.R., Steinman, D.A.: Exact Fully 3D Navier Stokes Solution for Benchmarking. International Journal for Numerical Methods in Fluids 1994; 19:

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Document Information

Document Information TEST CASE DOCUMENTATION AND TESTING RESULTS TEST CASE ID ICFD-VAL-3.1 Flow around a two dimensional cylinder Tested with LS-DYNA R v980 Revision Beta Friday 1 st June, 2012 Document Information Confidentiality

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Jan Treibig, Simon Hausmann, Ulrich Ruede Zusammenfassung The Lattice Boltzmann method (LBM) is a well established algorithm

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

1.2 Numerical Solutions of Flow Problems

1.2 Numerical Solutions of Flow Problems 1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian

More information

TAU mesh deformation. Thomas Gerhold

TAU mesh deformation. Thomas Gerhold TAU mesh deformation Thomas Gerhold The parallel mesh deformation of the DLR TAU-Code Introduction Mesh deformation method & Parallelization Results & Applications Conclusion & Outlook Introduction CFD

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press,   ISSN The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics

More information

Lecture 4: RISC Computers

Lecture 4: RISC Computers Lecture 4: RISC Computers Introduction Program execution features RISC characteristics RISC vs. CICS Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) represents an important

More information

computational Fluid Dynamics - Prof. V. Esfahanian

computational Fluid Dynamics - Prof. V. Esfahanian Three boards categories: Experimental Theoretical Computational Crucial to know all three: Each has their advantages and disadvantages. Require validation and verification. School of Mechanical Engineering

More information

A NURBS-BASED APPROACH FOR SHAPE AND TOPOLOGY OPTIMIZATION OF FLOW DOMAINS

A NURBS-BASED APPROACH FOR SHAPE AND TOPOLOGY OPTIMIZATION OF FLOW DOMAINS 6th European Conference on Computational Mechanics (ECCM 6) 7th European Conference on Computational Fluid Dynamics (ECFD 7) 11 15 June 2018, Glasgow, UK A NURBS-BASED APPROACH FOR SHAPE AND TOPOLOGY OPTIMIZATION

More information

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting

More information

Finite Element Method. Chapter 7. Practical considerations in FEM modeling

Finite Element Method. Chapter 7. Practical considerations in FEM modeling Finite Element Method Chapter 7 Practical considerations in FEM modeling Finite Element Modeling General Consideration The following are some of the difficult tasks (or decisions) that face the engineer

More information

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s

Two main topics: `A posteriori (error) control of FEM/FV discretizations with adaptive meshing strategies' `(Iterative) Solution strategies for huge s . Trends in processor technology and their impact on Numerics for PDE's S. Turek Institut fur Angewandte Mathematik, Universitat Heidelberg Im Neuenheimer Feld 294, 69120 Heidelberg, Germany http://gaia.iwr.uni-heidelberg.de/~ture

More information

Large-scale Structural Analysis Using General Sparse Matrix Technique

Large-scale Structural Analysis Using General Sparse Matrix Technique Large-scale Structural Analysis Using General Sparse Matrix Technique Yuan-Sen Yang 1), Shang-Hsien Hsieh 1), Kuang-Wu Chou 1), and I-Chau Tsai 1) 1) Department of Civil Engineering, National Taiwan University,

More information

A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS

A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS Contemporary Mathematics Volume 157, 1994 A NEW MIXED PRECONDITIONING METHOD BASED ON THE CLUSTERED ELEMENT -BY -ELEMENT PRECONDITIONERS T.E. Tezduyar, M. Behr, S.K. Aliabadi, S. Mittal and S.E. Ray ABSTRACT.

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Data mining with sparse grids

Data mining with sparse grids Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids p.1/40 Overview What is Data mining? Regularization networks

More information

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang Outline of the Talk Introduction to the TELEMAC System and to TELEMAC-2D Code Developments Data Reordering Strategy Results Conclusions

More information

Massively Parallel Phase Field Simulations using HPC Framework walberla

Massively Parallel Phase Field Simulations using HPC Framework walberla Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich

More information

Performance of Implicit Solver Strategies on GPUs

Performance of Implicit Solver Strategies on GPUs 9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used

More information

PERFORMANCE MEASUREMENTS OF REAL-TIME COMPUTER SYSTEMS

PERFORMANCE MEASUREMENTS OF REAL-TIME COMPUTER SYSTEMS PERFORMANCE MEASUREMENTS OF REAL-TIME COMPUTER SYSTEMS Item Type text; Proceedings Authors Furht, Borko; Gluch, David; Joseph, David Publisher International Foundation for Telemetering Journal International

More information

Virtual EM Inc. Ann Arbor, Michigan, USA

Virtual EM Inc. Ann Arbor, Michigan, USA Functional Description of the Architecture of a Special Purpose Processor for Orders of Magnitude Reduction in Run Time in Computational Electromagnetics Tayfun Özdemir Virtual EM Inc. Ann Arbor, Michigan,

More information

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS 14 th European Conference on Mixing Warszawa, 10-13 September 2012 LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS Felix Muggli a, Laurent Chatagny a, Jonas Lätt b a Sulzer Markets & Technology

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System

More information

A study of mesh sensitivity for crash simulations: comparison of manually and batch meshed models

A study of mesh sensitivity for crash simulations: comparison of manually and batch meshed models 4. LS-DYNA Anwenderforum, Bamberg 25 Modellierung A study of mesh sensitivity for crash simulations: comparison of manually and batch meshed models Marc Ratzel*, Paul Du Bois +, Lars A. Fredriksson*, Detlef

More information

Mid-Year Report. Discontinuous Galerkin Euler Equation Solver. Friday, December 14, Andrey Andreyev. Advisor: Dr.

Mid-Year Report. Discontinuous Galerkin Euler Equation Solver. Friday, December 14, Andrey Andreyev. Advisor: Dr. Mid-Year Report Discontinuous Galerkin Euler Equation Solver Friday, December 14, 2012 Andrey Andreyev Advisor: Dr. James Baeder Abstract: The focus of this effort is to produce a two dimensional inviscid,

More information

Influence of mesh quality and density on numerical calculation of heat exchanger with undulation in herringbone pattern

Influence of mesh quality and density on numerical calculation of heat exchanger with undulation in herringbone pattern Influence of mesh quality and density on numerical calculation of heat exchanger with undulation in herringbone pattern Václav Dvořák, Jan Novosád Abstract Research of devices for heat recovery is currently

More information

Computation of Three-Dimensional Electromagnetic Fields for an Augmented Reality Environment

Computation of Three-Dimensional Electromagnetic Fields for an Augmented Reality Environment Excerpt from the Proceedings of the COMSOL Conference 2008 Hannover Computation of Three-Dimensional Electromagnetic Fields for an Augmented Reality Environment André Buchau 1 * and Wolfgang M. Rucker

More information

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract

More information

Concept design of Vehicle Structure for the purpose of. computing torsional and bending stiffness

Concept design of Vehicle Structure for the purpose of. computing torsional and bending stiffness Concept design of Vehicle Structure for the purpose of computing torsional and bending stiffness M.Mohseni Kabir 1, M.Izanloo 1, A.Khalkhali* 2 1. M.Sc. Automotive Simulation and Optimal Design Research

More information

Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras. Lecture - 36

Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras. Lecture - 36 Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras Lecture - 36 In last class, we have derived element equations for two d elasticity problems

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

Load Balancing for Problems with Good Bisectors, and Applications in Finite Element Simulations

Load Balancing for Problems with Good Bisectors, and Applications in Finite Element Simulations Load Balancing for Problems with Good Bisectors, and Applications in Finite Element Simulations Stefan Bischof, Ralf Ebner, and Thomas Erlebach Institut für Informatik Technische Universität München D-80290

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Communication-Avoiding Optimization of Geometric Multigrid on GPUs

Communication-Avoiding Optimization of Geometric Multigrid on GPUs Communication-Avoiding Optimization of Geometric Multigrid on GPUs Amik Singh James Demmel, Ed. Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-258

More information

Lecture 4: RISC Computers

Lecture 4: RISC Computers Lecture 4: RISC Computers Introduction Program execution features RISC characteristics RISC vs. CICS Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation

More information

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX

More information

Review of previous examinations TMA4280 Introduction to Supercomputing

Review of previous examinations TMA4280 Introduction to Supercomputing Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

The determination of the correct

The determination of the correct SPECIAL High-performance SECTION: H i gh-performance computing computing MARK NOBLE, Mines ParisTech PHILIPPE THIERRY, Intel CEDRIC TAILLANDIER, CGGVeritas (formerly Mines ParisTech) HENRI CALANDRA, Total

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011

NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011 NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011 First-Order Hyperbolic System Method If you have a CFD book for hyperbolic problems, you have a CFD book for all problems.

More information

Vectorized Search for Single Clusters

Vectorized Search for Single Clusters Journal of Statistical Physics, Vol. 70, Nos. 3/4, 1993 Vectorized Search for Single Clusters Hans Gerd Evertz ~ Received August 5, 1992 Breadth-first search for a single cluster on a regular lattice is

More information

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution Multi-Domain Pattern I. Problem The problem represents computations characterized by an underlying system of mathematical equations, often simulating behaviors of physical objects through discrete time

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

On the Single Processor Performance of Simple Lattice Boltzmann Kernels

On the Single Processor Performance of Simple Lattice Boltzmann Kernels On the Single Processor Performance of Simple Lattice Boltzmann Kernels G. Wellein 1, T. Zeiser, G. Hager, S. Donath Regionales Rechenzentrum Erlangen, Martensstr. 1, 91058 Erlangen, Germany Abstract This

More information

Application of the Computer Capacity to the Analysis of Processors Evolution. BORIS RYABKO 1 and ANTON RAKITSKIY 2 April 17, 2018

Application of the Computer Capacity to the Analysis of Processors Evolution. BORIS RYABKO 1 and ANTON RAKITSKIY 2 April 17, 2018 Application of the Computer Capacity to the Analysis of Processors Evolution BORIS RYABKO 1 and ANTON RAKITSKIY 2 April 17, 2018 arxiv:1705.07730v1 [cs.pf] 14 May 2017 Abstract The notion of computer capacity

More information

EVAPORATION: A TECHNIQUE FOR VISUALIZING MESH QUALITY

EVAPORATION: A TECHNIQUE FOR VISUALIZING MESH QUALITY EVAPORATION: A TECHNIQUE FOR VISUALIZING MESH QUALITY Lisa Durbeck University of Utah, Salt Lake City, UT, U.S.A. ldurbeck@cs.utah.edu ABSTRACT The work described here addresses information generated during

More information

Exploring unstructured Poisson solvers for FDS

Exploring unstructured Poisson solvers for FDS Exploring unstructured Poisson solvers for FDS Dr. Susanne Kilian hhpberlin - Ingenieure für Brandschutz 10245 Berlin - Germany Agenda 1 Discretization of Poisson- Löser 2 Solvers for 3 Numerical Tests

More information

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

Function call overhead benchmarks with MATLAB, Octave, Python, Cython and C

Function call overhead benchmarks with MATLAB, Octave, Python, Cython and C Function call overhead benchmarks with MATLAB, Octave, Python, Cython and C André Gaul September 23, 2018 arxiv:1202.2736v1 [cs.pl] 13 Feb 2012 1 Background In many applications a function has to be called

More information

Performance Prediction for Parallel Local Weather Forecast Programs

Performance Prediction for Parallel Local Weather Forecast Programs Performance Prediction for Parallel Local Weather Forecast Programs W. Joppich and H. Mierendorff GMD German National Research Center for Information Technology Institute for Algorithms and Scientific

More information

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more Parallel Free-Surface Extension of the Lattice-Boltzmann Method A Lattice-Boltzmann Approach for Simulation of Two-Phase Flows Stefan Donath (LSS Erlangen, stefan.donath@informatik.uni-erlangen.de) Simon

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

PATCH TEST OF HEXAHEDRAL ELEMENT

PATCH TEST OF HEXAHEDRAL ELEMENT Annual Report of ADVENTURE Project ADV-99- (999) PATCH TEST OF HEXAHEDRAL ELEMENT Yoshikazu ISHIHARA * and Hirohisa NOGUCHI * * Mitsubishi Research Institute, Inc. e-mail: y-ishi@mri.co.jp * Department

More information

ARCHITECTURES FOR PARALLEL COMPUTATION

ARCHITECTURES FOR PARALLEL COMPUTATION Datorarkitektur Fö 11/12-1 Datorarkitektur Fö 11/12-2 Why Parallel Computation? ARCHITECTURES FOR PARALLEL COMTATION 1. Why Parallel Computation 2. Parallel Programs 3. A Classification of Computer Architectures

More information

Multiple Issue ILP Processors. Summary of discussions

Multiple Issue ILP Processors. Summary of discussions Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware

More information

Network Bandwidth & Minimum Efficient Problem Size

Network Bandwidth & Minimum Efficient Problem Size Network Bandwidth & Minimum Efficient Problem Size Paul R. Woodward Laboratory for Computational Science & Engineering (LCSE), University of Minnesota April 21, 2004 Build 3 virtual computers with Intel

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Application of Finite Volume Method for Structural Analysis

Application of Finite Volume Method for Structural Analysis Application of Finite Volume Method for Structural Analysis Saeed-Reza Sabbagh-Yazdi and Milad Bayatlou Associate Professor, Civil Engineering Department of KNToosi University of Technology, PostGraduate

More information

Corrected/Updated References

Corrected/Updated References K. Kashiyama, H. Ito, M. Behr and T. Tezduyar, "Massively Parallel Finite Element Strategies for Large-Scale Computation of Shallow Water Flows and Contaminant Transport", Extended Abstracts of the Second

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

A High-Order Accurate Unstructured GMRES Solver for Poisson s Equation

A High-Order Accurate Unstructured GMRES Solver for Poisson s Equation A High-Order Accurate Unstructured GMRES Solver for Poisson s Equation Amir Nejat * and Carl Ollivier-Gooch Department of Mechanical Engineering, The University of British Columbia, BC V6T 1Z4, Canada

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Presented at the 2014 ANSYS Regional Conference- Detroit, June 5, 2014 Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Bhushan Desam, Ph.D. NVIDIA Corporation 1 NVIDIA Enterprise

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction FOR 493 - P3: A monolithic multigrid FEM solver for fluid structure interaction Stefan Turek 1 Jaroslav Hron 1,2 Hilmar Wobker 1 Mudassar Razzaq 1 1 Institute of Applied Mathematics, TU Dortmund, Germany

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

A Study of Workstation Computational Performance for Real-Time Flight Simulation

A Study of Workstation Computational Performance for Real-Time Flight Simulation A Study of Workstation Computational Performance for Real-Time Flight Simulation Summary Jeffrey M. Maddalon Jeff I. Cleveland II This paper presents the results of a computational benchmark, based on

More information

Vectorized Search for Single Clusters

Vectorized Search for Single Clusters Vectorized Search for Single Clusters Hans Gerd Evertz Supercomputer Computations Research Institute, Florida State University, Tallahassee, FL 32306 evertz@scri.fsu.edu Aug. 1, 1992; Published in in J.

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

PART I - Fundamentals of Parallel Computing

PART I - Fundamentals of Parallel Computing PART I - Fundamentals of Parallel Computing Objectives What is scientific computing? The need for more computing power The need for parallel computing and parallel programs 1 What is scientific computing?

More information

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Analyzing Cache Bandwidth on the Intel Core 2 Architecture John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms

More information

Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak

Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models. C. Aberle, A. Hakim, and U. Shumlak Development of a Maxwell Equation Solver for Application to Two Fluid Plasma Models C. Aberle, A. Hakim, and U. Shumlak Aerospace and Astronautics University of Washington, Seattle American Physical Society

More information

Fluid-Structure-Interaction Using SPH and GPGPU Technology

Fluid-Structure-Interaction Using SPH and GPGPU Technology IMPETUS AFEA SOLVER Fluid-Structure-Interaction Using SPH and GPGPU Technology Jérôme Limido Jean Luc Lacome Wayne L. Mindle GTC May 2012 IMPETUS AFEA SOLVER 1 2D Sloshing Water in Tank IMPETUS AFEA SOLVER

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Automating the Modeling and Optimization of the Performance of Signal Transforms

Automating the Modeling and Optimization of the Performance of Signal Transforms IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 8, AUGUST 2002 2003 Automating the Modeling and Optimization of the Performance of Signal Transforms Bryan Singer and Manuela M. Veloso Abstract Fast

More information

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck. To be published in: Notes on Numerical Fluid Mechanics, Vieweg 1994 Flow simulation with FEM on massively parallel systems Frank Lohmeyer, Oliver Vornberger Department of Mathematics and Computer Science

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

Techniques for Optimizing FEM/MoM Codes

Techniques for Optimizing FEM/MoM Codes Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO

More information

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Dense linear algebra, LAPACK, MMM optimizations in ATLAS Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Today Linear algebra software: history,

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Statement of Research

Statement of Research On Exploring Algorithm Performance Between Von-Neumann and VLSI Custom-Logic Computing Architectures Tiffany M. Mintz James P. Davis, Ph.D. South Carolina Alliance for Minority Participation University

More information

MESHLESS SOLUTION OF INCOMPRESSIBLE FLOW OVER BACKWARD-FACING STEP

MESHLESS SOLUTION OF INCOMPRESSIBLE FLOW OVER BACKWARD-FACING STEP Vol. 12, Issue 1/2016, 63-68 DOI: 10.1515/cee-2016-0009 MESHLESS SOLUTION OF INCOMPRESSIBLE FLOW OVER BACKWARD-FACING STEP Juraj MUŽÍK 1,* 1 Department of Geotechnics, Faculty of Civil Engineering, University

More information