BenchIT Performance Measurement and Comparison for Scientific Applications
|
|
- Dustin Nicholson
- 5 years ago
- Views:
Transcription
1 1 BenchIT Performance Measurement and Comparison for Scientific Applications Guido Juckeland a, Stefan Börner a, Michael Kluge a, Sebastian Kölling a, Wolfgang E. Nagel a, Stefan Pflüger a, Heike Röding a, Stephan Seidl a, Thomas William a, Robert Wloch a a Center for High Performance Computing, Dresden University of Technology, Dresden, Germany Introduction Contrary to common belief, performance evaluation is an art. [1] With an increasing variety of operation fields Webserver for displaying & comparing results from office applications to data-massive, high-performance computing with very different user demands, the programmer s know-how of program optimization, the choice reads of the compiler version, and the usage of the compiler options have an important influence on the runtime. Cur- Resultfile rent and future microprocessors offer a variety of different levels of parallel processing in combination with an writes increasing number of intelligently organized functional units and a deeply staged memory hierarchy. Main program runs the measurement Traditional benchmarks (e.g. [2,3]) highlight only a few aspects of the performance behavior. Often computer provides architects, system designers, software developers and decisionmakers want to have more detailed information about the interface.h performance of the whole system than only one or a few values of a performance metric. This paper introduces BenchIT a tool created by the Center for High Performance Computing Dresden to accompany the perfor- fulfills mance evaluator. Kernel provides the Algorithm This art of performance evaluation actually contains two steps: Performance measurement as well as data validation and comparison. BenchIT s modular design, Figure 1: Components of the BenchITtherefore, consists of three layers (as shown in figure 1): Project The measuring kernels, a main program for the measurements, and a web based graphing engine to plot and compare the gathered data. The unique step in this project is the concept of splitting the evaluation into exactly the two steps mentioned above and thus being so flexible to be used for any kind of performance measurement. The Center for High Performance Computing Dresden presents the established infrastructure for this project, which is designed to allow the HPC community easy access to a variety of performance measurements, easily extendable by own measurements and even, but especially, own measuring kernels. 1. Measuring Environment The BenchIT measuring environment is especially designed for the hazardous conditions on all kinds of measuring platforms. In reducing all varying factors on different machines, only two utilities are certain: a shell and a compiler. The BenchIT measuring environment deliberatly reduces itself to use only those two to allow the highest compatibily. The environment on a certain operating system is set up by a number of cascading shell scripts compiling the measuring kernel, linking it to a main
2 2 program and executing the measuring run. Some common definitions are placed in one small file named COMMONDEFS. This script provides the base name of the directory, the nodename, and the hostname of the machine as environmental variables used by the main program. The next file used by each kernel is the file ARCHDEFS providing a basic set of system variables depending on the operating system on the machine. They look like the following: if [ "${uname_minus_s}" = "Linux" ]; then HAVE_CC=1 HAVE_F77=1 HAVE_F90=0 HAVE_MPI=1 CC="cc" CC_C_FLAGS="$CC_C_FLAGS -Wall -Werror -Waggregate-return -Wcast-align" CC_C_FLAGS_STD="-O2" CC_C_FLAGS_HIGH="-O3" LIB_PTHREAD="-lpthread" These default values enable BenchIT to run on a normal installation of the OS s included. Nevertheless, each user might want to set machine specific variables. This is possible by defining a set of LOCALDEFS. The LOCALDEFS-file is named after the nodename of the machine running on and holds exactly the same variables as already defined in the ARCHDEFS-file, therefore, allowing an easy customization. Additionally The LOCALDEFS-directory accomodates the two input-files for each node. They are named <nodename> input architecture and <nodename> input display and allow to fill in large sections of the output-file (see 2.1) since they are just copied into the output-files. The last part of the environment is made up of the variables used in the shell-script of the kernel itself and usually sets some kernel specific values or overwrites already existing variables (from the ARCHDEFS or LOCALDEFS). 2. Module Interfaces In between the three BenchIT program layers stand two interface files. They ensure that the modules have a common basis to work together. The result-file - also called output-file - is, after it has been created on the local machine, transferred to the BenchIT webserver. The file interface.h is used as a common basis in the compilation and linking of one measurement run. The following will provide a more detailed view at the two necessary and important interfaces The Output-File A possible way to explain the results of a measuring kernel is to collect all the relevant data in a structured output file. This idea was realized in the BenchIT output-files saved in the subdirectory output. They are coded in ASCII format for easy viewing and editing. The different parts of the structure are bounded by the keywords beginofxxxxx and endofxxxxx and introduced in the following. Measurement Information This part of the output-file includes a kernel-string as a short description of the measuring kernel, for example Fortran dot product, a timestamp, a comment, the programming language, the used compiler and its compiler flags, and minima and maxima for the x- and y-values. Additionally the string code-sequence, for example do i=1,n# sum=sum+x(i)*y(i)#enddo shows the characteristic feature of this measuring program.
3 3 Architecture Important architectural statements are the node-name and the host-name. Output-files will not be accepted on the project homepage([6]) without this information. A collection of architectural information was designed as a guideline of this part of the outputfile, first to explain the measurement results and further to identify the machine the measurement ran on. The following characteristics are included (selection): mainboard manufacturer, mainboard type, mainboard chipset, processor name and clock rate, processor serial number, processor version, instruction set architecture and its level, several instruction set architecture extensions, processor clock rate, instruction length, processor word length, and the number of integer, floating point, and loadstore units. The cache hierarchy is described by the sizes, organization and location. To characterize the memory system information about the used memory chip type, memory bus type and clock rate are necessary. Display This section holds all information needed to set up the plotting engine to display the results contained in the output file. This includes axis texts and labels for all measured functions, axis setup (linear or logarithmic), and the boundaries for the plotting range. Additionally information from the sections Measurement Information and Architecture can be placed in the graph. Identifier-Strings This section is used to relate easily readable strings prepared for the web menu to all identifierstrings in the output-file, for example ISA Extension to the identifier-string processorisaextension2. Data The measured physical values are stored in the data section in a 2-dimensional ordering: The first value per row is the x value followed by y values depending on the number of measuring functions inside the kernel. Each new x value generates a new row. All values (integers or floating point numbers) are represented as ASCII coded decimal strings. The design of the output-files is no static. It is possible that additional parts will be inserted during the further development of the BenchIT project The File interface.h The two data acquisition layers of the BenchIT project are linked through the C header file interface.h. It defines an info structure, where a kernel provides information about itself. Furthermore it specifies the functions called by the main program and service functions to be used by the kernels. Info structure: Some elements are used to fill out the output file, such as: kernelstring, kernellibraries (e.g. PThread, MPI, BLAS), codesequence, axis texts and properties, and legend texts. The main program itself needs a few more details about the kernel, e.g. maxproblemsize, numfunctions, outlier direction upwards for error correction by the main program, and kernel execs XXX which allow an adaption to the kind of parallelism the kernel wants to execute. Interface functions The main program uses the functions bi getinfo, bi init, bi entry, and bi cleanup - first to inform itself about the kernel to run, initialize the kernel, than to run the measurements for various problem sizes, and finally to cleanup files and memory used by the kernel. Furthermore, the main program provides two tool functions - bi gettime and bi strdup. 3. Module Components Having introduced the BenchIT module layer interfaces, the paper will now turn the focus to the BenchIT modules itself. BenchIT consists of three module layers: the kernels, the main-program, and the website. Each layer offers different services which will be presented together with the modules
4 4 itself in the following The Kernels Within this project a kernel is referred to as an algorithm or measuring program. Typical examples are a matrix multiplication or the Jacobi algorithm. Programming a kernel demands a certain discipline from the kernel author. Since BenchIT is to run on a variety of computation platforms, the kernel code has to be compatible to all of them. This can be best accomplished by: using only basic program structures, avoiding system calls and system specific operations 1, and utilizing the functions provided by the main program. The professed goal of the BenchIT-Team is to have every kernel distributed with BenchIT being executable on every platform. Nevertheless it is possible and not valued less to write a problem specific kernel. A typical use for this strategy might be the optimization of a certain algorithm on a specific target architecture. As up today the following kernels are included in the BenchIT package: MPI-performance measurement (Roundtrip-Message and Binary-Tree-Broadcast programmed in C), performance measurement for the Jacobi algorithm (sequential in C and Java; parallel in Java using Java-Threads and in C using PThreads), matrix multiplication (sequential in C, Fortran 77, and Java; parallel in Fortran 77 using MPI), performance measurement for calculating the dot product for large vectors (sequential in Fortran 77; parallel in C using PThreads), performance measurement for the mathematical operations sine, cosine, and square root (sequential in C, Java and Fortran 77), memory bandwidth (sequential in C), and IO-performance such as write rate and read rate for small and large file (parallel in C using PThreads). Every BenchIT-User is also able and asked to act as an author of a kernel. A custom kernel can then be sent to the BenchIT- Team and will be taken into the kernel set, if considered useful and complying with the kernel rules. Initialize Program & Kernel 3.2. The Main Program The first service module within the BenchIT layers is the main program for the measurement. It controls the generation of measurement data by the kernels, offers them service routines (see 2.2), and writes the resultfile (see 2.1). The main program has to operate (just as the kernels) under a wide variety of system environments. However, the environment of the operating system is just one part of this variety. Another issue is the runtime environment. Since BenchIT supports among others MPI as a parallel environment, the main program has to adapt itself to that as well. 2 One might argue that it would also be feasible to have different main programs for each runtime environment, yet the BenchIT designers considered it an unnecessary code redundancy, especially since so far using just one main file has been practicable. One measurement run follows the scheme shown in figure 2. During the measurement the main program calls the kernel with a certain problem size. This is just an internal value and must not have something to do with the actual measurement. 3 The translation is done by the kernel. The main program also contains an error correction for the kernels since performance differences during a measurement run for one problem size due to other system processes running on the CPU are inevitable. BenchIT thus uses the following Measure one Problemsize still time left? no Analyze Data Write Result- & Quickview- File yes Figure 2: Schematic view of one measurement run. 1 If system calls become necessary they will have to be according to the POSIX([4]) standard. 2 This is in case of MPI done by compiling the main program with the -DUSE MPI -option. 3 The internal problem size might be the same as the external in case of a matrix multiply, but it could also be scaled by a certain factor.
5 5 approach: Measure one problem size n times 4. Each kernel informs the main program in the init routine if the outliers of each function have to be expected upwards or downwards. BenchIT then uses the best value of the n runs. After measuring the main program will analyze the gathered data. In this step minima and maxima are gathered and useful display boundaries are calculated. Furthermore some environment variables (see 1) are gathered and the two computer specific input files are opened. With all this done, the main program will then write the output file (see 2.1) as well as a gnuplot-file used by the local QUICKVIEW The Webserver The BenchIT web interface([6]) complements the BenchIT project, by giving the possibility to plot the results of the measuring kernels and compare them directly. It is the unique step in the project and allows acces to all measurement data with just an internet browser Specification The Webserver manages the output-files (see 2.1) uploaded by the registered users. They are held as ASCII-files as well as entries in a PostgreSQL-Database. The PHP-Webpages use the database to assemble a plot, then writes instructions for gnuplot([5]) which produces an eps-file that can be downloaded directly (as done in figure 3). Additionally a JPEG-image is created and displayed on the website. It is specified that all kind of measuring data can be displayed in one graph. The only limitation is that the data has to have one or the other unit (e.g. FLOPS, seconds, or a number of hits or misses) since gnuplot can at the maximum display two different y-axes. Another important question to be answered is how the plots will be assembled and how the user can customize the plots. The BenchIT Team has so far implemented two strategies: Selection by architectural characteristics The first possibility is to compare different values of one architectural feature. It is possible to show the sensitiveness of the results of the measuring kernels on the physical size of one architectural feature. This way it is possible to look for specific performance data for a searched architectural feature and compare it to other architectures. Selection by the measuring kernel The second possibility compares different characteristics of architecture, which are all calculated by just one measuring kernel. It can be considered the expressway in the adaption of the plot result since it is possible to customize a plot result with just three steps The construction of the BenchIT web interface The BenchIT web interface consists of two parts: An open and a restricted section. The measurement data is only accessible after registering on the website. This is also a security question since it is, therefore, trackable who uploaded which output-file. At the moment only registered users can download the measurement program, because BenchIT is still in a status of development. The new accounts will first be locked automatically and unlocked by the web interface administrators. All output-files uploaded to the webserver are backed up on a daily basis, hence, ensuring the availability of the data. Additional secu- 4.5e+08 4e e+08 3e e+08 2e e+08 4 The n is set by the compiler option -DERROR CORRECTION=n 1. Flops 1e+08 5e+07 Matrix Multiply Matrix Size Figure 3: The graph for a matrix multiplication ijk ikj jik jki kij kji
6 6 rity measures are implemented, so the data classified as non-disclosure can be uploaded and only be viewed by one user or a group of users. 4. First Results of the Project The project has been running for one year now and most of the immediate goals have been achieved. The measurement (as shown in figure 4) is so flexible that an adaption to a new platform is a matter of filling out one configuration file. The kernels run on all platform with the compilers and libraries necessary. The webserver is well capable of administering and plotting the files. It has been especially designed to work without Java-Script to allow the greatest browser compatibility. After first attempts without a database to support the server in managing the resultfiles for plotting, it hast been decided that a database for the arrangement of the plots is necessary to receive acceptable response times on the website. Guido@bluerabbit ~/benchit/src/kernel/matmul_c $./SUBDIREXEC.SH No definitions for your operating system found in ARCHDEFS. You will have to set them manually in your LOCALDEFS. BenchIT will not run without at least one set of definitions No definitions for your operating system found in ARCHDEFS. You will have to set them manually in your LOCALDEFS. BenchIT will not run without at least one set of definitions Warning: the variable 'ENVIRONMENT' is not set using NOTHING as default BenchIT: Getting info about kernel [ OK ] BenchIT: Getting starting time [ OK ] BenchIT: Selected kernel: Matrix Multiply BenchIT: Initializing kernel [ OK ] BenchIT: Allocating memory for results [ OK ] BenchIT: Measuring.. BenchIT: Total time limit reached. Stopping measurement. BenchIT: Analyzing results [ OK ] BenchIT: Writing resultfile [ OK ] BenchIT: Wrote output to "matmul_c0_amk7_1g33_2003_08_15 15_55.bit" BenchIT: Writing quickview file [ OK ] BenchIT: Finishing [ OK ] rm: cannot unlink `matmul_c': No such file or directory Guido@bluerabbit ~/benchit/src/kernel/matmul_c $ Figure 4: Output of one measurement run 5. Summary and Outlook The BenchIT kernels generate a large amount of measurement results in dependence of the number of functional arguments. Using the web interface the user is given the chance to show the selected results of different measuring programs in only one coordinate system. Often there are different reasons they can cause characteristic minima, maxima or a special shape in a graph. It is necessary to collect additional information about the tested system to explain such effects on a base of well-known system properties and physical values of the realization. The BenchIT-Project wants to provide such an evaluation platform by offering a variety of measurement kernels as well as a easily accessible plotting engine, thus enabling an easy way to measure performance on a specific system and compare the result, which is a full graph instead of just a number, to other results contributed by other users. The further development of the BenchIT-project will take place on all module layers. A GUI for the configuration of the measurements is under development it will provide an easier way to handle the measurements by partially substituting the shell scripts running the measurements up to this point. The power of the PCL will we utilized to access more measurement data. Furthermore an additional way to plot the data on the website by using Java-Applets and Java graphing tools is planned. The BenchIT-project will not merely be just another tool in the art of performance analysis yet it will have prove to be a very powerful one. REFERENCES [1] Raj Jain: The Art of Computer Systems Performance Analysis. John Wiley, Chichester [2] Standard Performance Evaluation Corporation (SPEC): [3] LINPACK: [4] IEEE POSIX: [5] Gnuplot: [6] The BenchIT Webserver:
Performance comparison and optimization: Case studies using BenchIT
John von Neumann Institute for Computing Performance comparison and optimization: Case studies using BenchIT R. Schöne, G. Juckeland, W.E. Nagel, S. Pflüger, R. Wloch published in Parallel Computing: Current
More informationAnalyzing Cache Bandwidth on the Intel Core 2 Architecture
John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms
More informationFakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur. BenchIT. Project Overview
Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur BenchIT Project Overview Nöthnitzer Straße 46 Raum INF 1041 Tel. +49 351-463 - 38458 (stefan.pflueger@tu-dresden.de)
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationKNL tools. Dr. Fabio Baruffa
KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the
More informationAccessing Data on SGI Altix: An Experience with Reality
Accessing Data on SGI Altix: An Experience with Reality Guido Juckeland, Matthias S. Müller, Wolfgang E. Nagel, Stefan Pflüger Technische Universität Dresden Center for Information Services and High Performance
More informationVisual Profiler. User Guide
Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationIntel profiling tools and roofline model. Dr. Luigi Iapichino
Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed
More informationCache memories are small, fast SRAM based memories managed automatically in hardware.
Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and
More informationAgenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories
Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal
More informationIrish Collegiate Programming Competition Problem Set
Irish Collegiate Programming Competition 24 Problem Set University College Cork ACM Student Chapter March 29, 24 Instructions Rules All mobile phones, laptops and other electronic devices must be powered
More informationToday Cache memory organization and operation Performance impact of caches
Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality
More informationShared Memory Programming With OpenMP Computer Lab Exercises
Shared Memory Programming With OpenMP Computer Lab Exercises Advanced Computational Science II John Burkardt Department of Scientific Computing Florida State University http://people.sc.fsu.edu/ jburkardt/presentations/fsu
More informationMemory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)
More informationToday. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,
Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory
More informationMemory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O
More informationShared Memory Programming With OpenMP Exercise Instructions
Shared Memory Programming With OpenMP Exercise Instructions John Burkardt Interdisciplinary Center for Applied Mathematics & Information Technology Department Virginia Tech... Advanced Computational Science
More informationInteractive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO,
Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO, 2013-04-03 Andreas Knüpfer, Thomas William TU Dresden, Germany Overview Introduction Vampir displays GPGPU
More informationCache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University
More informationIntroduction to C CMSC 104 Spring 2014, Section 02, Lecture 6 Jason Tang
Introduction to C CMSC 104 Spring 2014, Section 02, Lecture 6 Jason Tang Topics History of Programming Languages Compilation Process Anatomy of C CMSC 104 Coding Standards Machine Code In the beginning,
More informationMemory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds
More informationComputer Organization - Overview
Computer Organization - Overview Hyunyoung Lee CSCE 312 1 Course Overview Topics: Theme Five great realities of computer systems Computer system overview Summary NOTE: Most slides are from the textbook
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationCS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.
CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses
More informationProblem solving using standard programming techniques and Turbo C compiler.
Course Outcome First Year of B.Sc. IT Program Semester I Course Number:USIT 101 Course Name: Imperative Programming Introduces programming principles and fundamentals of programming. The ability to write
More informationCS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic
CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic Instructors: Nick Weaver & John Wawrzynek http://inst.eecs.berkeley.edu/~cs61c/sp18 3/16/18 Spring 2018 Lecture #17
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationFFTSS Library Version 3.0 User s Guide
Last Modified: 31/10/07 FFTSS Library Version 3.0 User s Guide Copyright (C) 2002-2007 The Scalable Software Infrastructure Project, is supported by the Development of Software Infrastructure for Large
More informationCache Memories October 8, 2007
15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache
More informationIntroduction to High-Performance Computing
Introduction to High-Performance Computing Dr. Axel Kohlmeyer Associate Dean for Scientific Computing, CST Associate Director, Institute for Computational Science Assistant Vice President for High-Performance
More informationCISC 360. Cache Memories Nov 25, 2008
CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based
More informationCS 110 Computer Architecture
CS 110 Computer Architecture Performance and Floating Point Arithmetic Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University
More informationIntroduction to Supercomputing
Introduction to Supercomputing TMA4280 Introduction to development tools 0.1 Development tools During this course, only the make tool, compilers, and the GIT tool will be used for the sake of simplicity:
More informationComputers in Engineering COMP 208. Computer Structure. Computer Architecture. Computer Structure Michael A. Hawker
Computers in Engineering COMP 208 Computer Structure Michael A. Hawker Computer Structure We will briefly look at the structure of a modern computer That will help us understand some of the concepts that
More informationBasic Shell Commands. Bok, Jong Soon
Basic Shell Commands Bok, Jong Soon javaexpert@nate.com www.javaexpert.co.kr Focusing on Linux Commands These days, many important tasks in Linux can be done from both graphical interfaces and from commands.
More informationHardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB
Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB Frommelt Thomas* and Gutser Raphael SGL Carbon GmbH *Corresponding author: Werner-von-Siemens Straße 18, 86405 Meitingen,
More informationIntroduction to Computer Systems: Semester 1 Computer Architecture
Introduction to Computer Systems: Semester 1 Computer Architecture Fall 2003 William J. Taffe using modified lecture slides of Randal E. Bryant Topics: Theme Five great realities of computer systems How
More informationEXPERIMENT 1. FAMILIARITY WITH DEBUG, x86 REGISTERS and MACHINE INSTRUCTIONS
EXPERIMENT 1 FAMILIARITY WITH DEBUG, x86 REGISTERS and MACHINE INSTRUCTIONS Pre-lab: This lab introduces you to a software tool known as DEBUG. Before the lab session, read the first two sections of chapter
More informationELEC4042 Signal Processing 2 MATLAB Review (prepared by A/Prof Ambikairajah)
Introduction ELEC4042 Signal Processing 2 MATLAB Review (prepared by A/Prof Ambikairajah) MATLAB is a powerful mathematical language that is used in most engineering companies today. Its strength lies
More informationIntroduction to Computing Systems - Scientific Computing's Perspective. Le Yan LSU
Introduction to Computing Systems - Scientific Computing's Perspective Le Yan HPC @ LSU 5/28/2017 LONI Scientific Computing Boot Camp 2018 Why We Are Here For researchers, understand how your instrument
More informationIntroduction to Computer Systems
CSCE 230J Computer Organization Introduction to Computer Systems Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due Most of slides for
More informationIntroduction to Computer Systems
CSCE 230J Computer Organization Introduction to Computer Systems Dr. Steve Goddard goddard@cse.unl.edu Giving credit where credit is due Most of slides for this lecture are based on slides created by Drs.
More informationINTRODUCTION TO COMPUTERS KANNAN TUITION CENTER. CHAPTER: 2 NUMBER SYSTEMS
CHAPTER: 1 TWO MARKS QUESTIONS. 1. What are peripheral devices? 2. What do you mean by an algorithm? 3. What is a word processor software? 4. What is analog computing system? 5. What is laptop computer?
More informationAn MPI failure detector over PMPI 1
An MPI failure detector over PMPI 1 Donghoon Kim Department of Computer Science, North Carolina State University Raleigh, NC, USA Email : {dkim2}@ncsu.edu Abstract Fault Detectors are valuable services
More informationHigh-Performance Scientific Computing
High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationCOS 318: Operating Systems. File Systems. Topics. Evolved Data Center Storage Hierarchy. Traditional Data Center Storage Hierarchy
Topics COS 318: Operating Systems File Systems hierarchy File system abstraction File system operations File system protection 2 Traditional Data Center Hierarchy Evolved Data Center Hierarchy Clients
More informationDC57 COMPUTER ORGANIZATION JUNE 2013
Q2 (a) How do various factors like Hardware design, Instruction set, Compiler related to the performance of a computer? The most important measure of a computer is how quickly it can execute programs.
More informationIdentifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning
Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)
More informationComputer Caches. Lab 1. Caching
Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationCache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010
Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of
More informationPerformance Analysis of Parallel Scientific Applications In Eclipse
Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains
More informationComputer Principles and Components 1
Computer Principles and Components 1 Course Map This module provides an overview of the hardware and software environment being used throughout the course. Introduction Computer Principles and Components
More informationTutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE
Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationProgramming Assignment 1
CMSC 417 Computer Networks Spring 2017 Programming Assignment 1 Assigned: February 3 Due: February 10, 11:59:59 PM. 1 Description In this assignment, you will write a UDP client and server to run a simplified
More informationThe Role of Performance
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware
More informationGiving credit where credit is due
CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationHPCC Results. Nathan Wichmann Benchmark Engineer
HPCC Results Nathan Wichmann Benchmark Engineer Outline What is HPCC? Results Comparing current machines Conclusions May 04 2 HPCChallenge Project Goals To examine the performance of HPC architectures
More informationSystems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations
Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache
More informationCache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons
Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact
More informationStorage and File System
COS 318: Operating Systems Storage and File System Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Topics Storage hierarchy File
More informationQuestions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process
Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation Why are threads useful? How does one use POSIX pthreads? Michael Swift 1 2 What s in a process? Organizing a Process A process
More informationCS342 - Spring 2019 Project #3 Synchronization and Deadlocks
CS342 - Spring 2019 Project #3 Synchronization and Deadlocks Assigned: April 2, 2019. Due date: April 21, 2019, 23:55. Objectives Practice multi-threaded programming. Practice synchronization: mutex and
More informationConcurrency, Thread. Dongkun Shin, SKKU
Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point
More informationPerformance Analysis of KDD Applications using Hardware Event Counters. CAP Theme 2.
Performance Analysis of KDD Applications using Hardware Event Counters CAP Theme 2 http://cap.anu.edu.au/cap/projects/kddmemperf/ Peter Christen and Adam Czezowski Peter.Christen@anu.edu.au Adam.Czezowski@anu.edu.au
More informationECE 571 Advanced Microprocessor-Based Design Lecture 2
ECE 571 Advanced Microprocessor-Based Design Lecture 2 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 January 2016 Announcements HW#1 will be posted tomorrow I am handing out
More informationCSc 10200! Introduction to Computing. Lecture 1 Edgardo Molina Fall 2013 City College of New York
CSc 10200! Introduction to Computing Lecture 1 Edgardo Molina Fall 2013 City College of New York 1 Introduction to Computing Lectures: Tuesday and Thursday s (2-2:50 pm) Location: NAC 1/202 Recitation:
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationPerformance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers
Performance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers This white paper details the performance improvements of Dell PowerEdge servers with the Intel Xeon Processor Scalable CPU
More informationThis lecture is covered in Section 4.1 of the textbook.
This lecture is covered in Section 4.1 of the textbook. A Unix process s address space appears to be three regions of memory: a read-only text region (containing executable code); a read-write region consisting
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationA Fast Review of C Essentials Part I
A Fast Review of C Essentials Part I Structural Programming by Z. Cihan TAYSI Outline Program development C Essentials Functions Variables & constants Names Formatting Comments Preprocessor Data types
More informationIntroduction to Computer Programming in Python Dr. William C. Bulko. Data Types
Introduction to Computer Programming in Python Dr William C Bulko Data Types 2017 What is a data type? A data type is the kind of value represented by a constant or stored by a variable So far, you have
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationOperating Systems 2 nd semester 2016/2017. Chapter 4: Threads
Operating Systems 2 nd semester 2016/2017 Chapter 4: Threads Mohamed B. Abubaker Palestine Technical College Deir El-Balah Note: Adapted from the resources of textbox Operating System Concepts, 9 th edition
More informationAbstract 1. Introduction
Jaguar: A Distributed Computing Environment Based on Java Sheng-De Wang and Wei-Shen Wang Department of Electrical Engineering National Taiwan University Taipei, Taiwan Abstract As the development of network
More informationCS516 Programming Languages and Compilers II
CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Mar 12 Parallelism and Shared Memory Hierarchy I Rutgers University Review: Classical Three-pass Compiler Front End IR Middle End IR
More informationScheduling the Intel Core i7
Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne
More informationAn Overview of the BLITZ System
An Overview of the BLITZ System Harry H. Porter III Department of Computer Science Portland State University Introduction The BLITZ System is a collection of software designed to support a university-level
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationCache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory
5-23 The course that gies CMU its Zip! Topics Cache Memories Oct., 22! Generic cache memory organization! Direct mapped caches! Set associatie caches! Impact of caches on performance Cache Memories Cache
More informationOperating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings
Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those
More informationA Brief Description of the NMP ISA and Benchmarks
Report No. UIUCDCS-R-2005-2633 UILU-ENG-2005-1823 A Brief Description of the NMP ISA and Benchmarks by Mingliang Wei, Marc Snir, Josep Torrellas, and R. Brett Tremaine February 2005 A Brief Description
More informationν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.
Topics CISC 36 Cache Memories Dec, 29 ν Generic cache memory organization ν Direct mapped caches ν Set associatie caches ν Impact of caches on performance Cache Memories Cache memories are small, fast
More informationCS 326: Operating Systems. Process Execution. Lecture 5
CS 326: Operating Systems Process Execution Lecture 5 Today s Schedule Process Creation Threads Limited Direct Execution Basic Scheduling 2/5/18 CS 326: Operating Systems 2 Today s Schedule Process Creation
More informationCache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory
5-23 The course that gies CMU its Zip! Cache Memories Oct, 2 Topics Generic cache memory organization Direct mapped caches Set associatie caches Impact of caches on performance Cache memories Cache memories
More informationWorkshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview
Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab cozzini@democritos.it Agenda Tools for
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationCopyright 2013 Thomas W. Doeppner. IX 1
Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Storage Project3 Digital Logic - Storage: Recap - Direct - Mapping - Fully Associated - 2-way Associated - Cache Friendly Code Rutgers University Liu
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationOutline. How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III
Outline How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III Peter Christen and Adam Czezowski CAP Research Group Department of Computer Science,
More information