BenchIT Performance Measurement and Comparison for Scientific Applications

Size: px
Start display at page:

Download "BenchIT Performance Measurement and Comparison for Scientific Applications"

Transcription

1 1 BenchIT Performance Measurement and Comparison for Scientific Applications Guido Juckeland a, Stefan Börner a, Michael Kluge a, Sebastian Kölling a, Wolfgang E. Nagel a, Stefan Pflüger a, Heike Röding a, Stephan Seidl a, Thomas William a, Robert Wloch a a Center for High Performance Computing, Dresden University of Technology, Dresden, Germany Introduction Contrary to common belief, performance evaluation is an art. [1] With an increasing variety of operation fields Webserver for displaying & comparing results from office applications to data-massive, high-performance computing with very different user demands, the programmer s know-how of program optimization, the choice reads of the compiler version, and the usage of the compiler options have an important influence on the runtime. Cur- Resultfile rent and future microprocessors offer a variety of different levels of parallel processing in combination with an writes increasing number of intelligently organized functional units and a deeply staged memory hierarchy. Main program runs the measurement Traditional benchmarks (e.g. [2,3]) highlight only a few aspects of the performance behavior. Often computer provides architects, system designers, software developers and decisionmakers want to have more detailed information about the interface.h performance of the whole system than only one or a few values of a performance metric. This paper introduces BenchIT a tool created by the Center for High Performance Computing Dresden to accompany the perfor- fulfills mance evaluator. Kernel provides the Algorithm This art of performance evaluation actually contains two steps: Performance measurement as well as data validation and comparison. BenchIT s modular design, Figure 1: Components of the BenchITtherefore, consists of three layers (as shown in figure 1): Project The measuring kernels, a main program for the measurements, and a web based graphing engine to plot and compare the gathered data. The unique step in this project is the concept of splitting the evaluation into exactly the two steps mentioned above and thus being so flexible to be used for any kind of performance measurement. The Center for High Performance Computing Dresden presents the established infrastructure for this project, which is designed to allow the HPC community easy access to a variety of performance measurements, easily extendable by own measurements and even, but especially, own measuring kernels. 1. Measuring Environment The BenchIT measuring environment is especially designed for the hazardous conditions on all kinds of measuring platforms. In reducing all varying factors on different machines, only two utilities are certain: a shell and a compiler. The BenchIT measuring environment deliberatly reduces itself to use only those two to allow the highest compatibily. The environment on a certain operating system is set up by a number of cascading shell scripts compiling the measuring kernel, linking it to a main

2 2 program and executing the measuring run. Some common definitions are placed in one small file named COMMONDEFS. This script provides the base name of the directory, the nodename, and the hostname of the machine as environmental variables used by the main program. The next file used by each kernel is the file ARCHDEFS providing a basic set of system variables depending on the operating system on the machine. They look like the following: if [ "${uname_minus_s}" = "Linux" ]; then HAVE_CC=1 HAVE_F77=1 HAVE_F90=0 HAVE_MPI=1 CC="cc" CC_C_FLAGS="$CC_C_FLAGS -Wall -Werror -Waggregate-return -Wcast-align" CC_C_FLAGS_STD="-O2" CC_C_FLAGS_HIGH="-O3" LIB_PTHREAD="-lpthread" These default values enable BenchIT to run on a normal installation of the OS s included. Nevertheless, each user might want to set machine specific variables. This is possible by defining a set of LOCALDEFS. The LOCALDEFS-file is named after the nodename of the machine running on and holds exactly the same variables as already defined in the ARCHDEFS-file, therefore, allowing an easy customization. Additionally The LOCALDEFS-directory accomodates the two input-files for each node. They are named <nodename> input architecture and <nodename> input display and allow to fill in large sections of the output-file (see 2.1) since they are just copied into the output-files. The last part of the environment is made up of the variables used in the shell-script of the kernel itself and usually sets some kernel specific values or overwrites already existing variables (from the ARCHDEFS or LOCALDEFS). 2. Module Interfaces In between the three BenchIT program layers stand two interface files. They ensure that the modules have a common basis to work together. The result-file - also called output-file - is, after it has been created on the local machine, transferred to the BenchIT webserver. The file interface.h is used as a common basis in the compilation and linking of one measurement run. The following will provide a more detailed view at the two necessary and important interfaces The Output-File A possible way to explain the results of a measuring kernel is to collect all the relevant data in a structured output file. This idea was realized in the BenchIT output-files saved in the subdirectory output. They are coded in ASCII format for easy viewing and editing. The different parts of the structure are bounded by the keywords beginofxxxxx and endofxxxxx and introduced in the following. Measurement Information This part of the output-file includes a kernel-string as a short description of the measuring kernel, for example Fortran dot product, a timestamp, a comment, the programming language, the used compiler and its compiler flags, and minima and maxima for the x- and y-values. Additionally the string code-sequence, for example do i=1,n# sum=sum+x(i)*y(i)#enddo shows the characteristic feature of this measuring program.

3 3 Architecture Important architectural statements are the node-name and the host-name. Output-files will not be accepted on the project homepage([6]) without this information. A collection of architectural information was designed as a guideline of this part of the outputfile, first to explain the measurement results and further to identify the machine the measurement ran on. The following characteristics are included (selection): mainboard manufacturer, mainboard type, mainboard chipset, processor name and clock rate, processor serial number, processor version, instruction set architecture and its level, several instruction set architecture extensions, processor clock rate, instruction length, processor word length, and the number of integer, floating point, and loadstore units. The cache hierarchy is described by the sizes, organization and location. To characterize the memory system information about the used memory chip type, memory bus type and clock rate are necessary. Display This section holds all information needed to set up the plotting engine to display the results contained in the output file. This includes axis texts and labels for all measured functions, axis setup (linear or logarithmic), and the boundaries for the plotting range. Additionally information from the sections Measurement Information and Architecture can be placed in the graph. Identifier-Strings This section is used to relate easily readable strings prepared for the web menu to all identifierstrings in the output-file, for example ISA Extension to the identifier-string processorisaextension2. Data The measured physical values are stored in the data section in a 2-dimensional ordering: The first value per row is the x value followed by y values depending on the number of measuring functions inside the kernel. Each new x value generates a new row. All values (integers or floating point numbers) are represented as ASCII coded decimal strings. The design of the output-files is no static. It is possible that additional parts will be inserted during the further development of the BenchIT project The File interface.h The two data acquisition layers of the BenchIT project are linked through the C header file interface.h. It defines an info structure, where a kernel provides information about itself. Furthermore it specifies the functions called by the main program and service functions to be used by the kernels. Info structure: Some elements are used to fill out the output file, such as: kernelstring, kernellibraries (e.g. PThread, MPI, BLAS), codesequence, axis texts and properties, and legend texts. The main program itself needs a few more details about the kernel, e.g. maxproblemsize, numfunctions, outlier direction upwards for error correction by the main program, and kernel execs XXX which allow an adaption to the kind of parallelism the kernel wants to execute. Interface functions The main program uses the functions bi getinfo, bi init, bi entry, and bi cleanup - first to inform itself about the kernel to run, initialize the kernel, than to run the measurements for various problem sizes, and finally to cleanup files and memory used by the kernel. Furthermore, the main program provides two tool functions - bi gettime and bi strdup. 3. Module Components Having introduced the BenchIT module layer interfaces, the paper will now turn the focus to the BenchIT modules itself. BenchIT consists of three module layers: the kernels, the main-program, and the website. Each layer offers different services which will be presented together with the modules

4 4 itself in the following The Kernels Within this project a kernel is referred to as an algorithm or measuring program. Typical examples are a matrix multiplication or the Jacobi algorithm. Programming a kernel demands a certain discipline from the kernel author. Since BenchIT is to run on a variety of computation platforms, the kernel code has to be compatible to all of them. This can be best accomplished by: using only basic program structures, avoiding system calls and system specific operations 1, and utilizing the functions provided by the main program. The professed goal of the BenchIT-Team is to have every kernel distributed with BenchIT being executable on every platform. Nevertheless it is possible and not valued less to write a problem specific kernel. A typical use for this strategy might be the optimization of a certain algorithm on a specific target architecture. As up today the following kernels are included in the BenchIT package: MPI-performance measurement (Roundtrip-Message and Binary-Tree-Broadcast programmed in C), performance measurement for the Jacobi algorithm (sequential in C and Java; parallel in Java using Java-Threads and in C using PThreads), matrix multiplication (sequential in C, Fortran 77, and Java; parallel in Fortran 77 using MPI), performance measurement for calculating the dot product for large vectors (sequential in Fortran 77; parallel in C using PThreads), performance measurement for the mathematical operations sine, cosine, and square root (sequential in C, Java and Fortran 77), memory bandwidth (sequential in C), and IO-performance such as write rate and read rate for small and large file (parallel in C using PThreads). Every BenchIT-User is also able and asked to act as an author of a kernel. A custom kernel can then be sent to the BenchIT- Team and will be taken into the kernel set, if considered useful and complying with the kernel rules. Initialize Program & Kernel 3.2. The Main Program The first service module within the BenchIT layers is the main program for the measurement. It controls the generation of measurement data by the kernels, offers them service routines (see 2.2), and writes the resultfile (see 2.1). The main program has to operate (just as the kernels) under a wide variety of system environments. However, the environment of the operating system is just one part of this variety. Another issue is the runtime environment. Since BenchIT supports among others MPI as a parallel environment, the main program has to adapt itself to that as well. 2 One might argue that it would also be feasible to have different main programs for each runtime environment, yet the BenchIT designers considered it an unnecessary code redundancy, especially since so far using just one main file has been practicable. One measurement run follows the scheme shown in figure 2. During the measurement the main program calls the kernel with a certain problem size. This is just an internal value and must not have something to do with the actual measurement. 3 The translation is done by the kernel. The main program also contains an error correction for the kernels since performance differences during a measurement run for one problem size due to other system processes running on the CPU are inevitable. BenchIT thus uses the following Measure one Problemsize still time left? no Analyze Data Write Result- & Quickview- File yes Figure 2: Schematic view of one measurement run. 1 If system calls become necessary they will have to be according to the POSIX([4]) standard. 2 This is in case of MPI done by compiling the main program with the -DUSE MPI -option. 3 The internal problem size might be the same as the external in case of a matrix multiply, but it could also be scaled by a certain factor.

5 5 approach: Measure one problem size n times 4. Each kernel informs the main program in the init routine if the outliers of each function have to be expected upwards or downwards. BenchIT then uses the best value of the n runs. After measuring the main program will analyze the gathered data. In this step minima and maxima are gathered and useful display boundaries are calculated. Furthermore some environment variables (see 1) are gathered and the two computer specific input files are opened. With all this done, the main program will then write the output file (see 2.1) as well as a gnuplot-file used by the local QUICKVIEW The Webserver The BenchIT web interface([6]) complements the BenchIT project, by giving the possibility to plot the results of the measuring kernels and compare them directly. It is the unique step in the project and allows acces to all measurement data with just an internet browser Specification The Webserver manages the output-files (see 2.1) uploaded by the registered users. They are held as ASCII-files as well as entries in a PostgreSQL-Database. The PHP-Webpages use the database to assemble a plot, then writes instructions for gnuplot([5]) which produces an eps-file that can be downloaded directly (as done in figure 3). Additionally a JPEG-image is created and displayed on the website. It is specified that all kind of measuring data can be displayed in one graph. The only limitation is that the data has to have one or the other unit (e.g. FLOPS, seconds, or a number of hits or misses) since gnuplot can at the maximum display two different y-axes. Another important question to be answered is how the plots will be assembled and how the user can customize the plots. The BenchIT Team has so far implemented two strategies: Selection by architectural characteristics The first possibility is to compare different values of one architectural feature. It is possible to show the sensitiveness of the results of the measuring kernels on the physical size of one architectural feature. This way it is possible to look for specific performance data for a searched architectural feature and compare it to other architectures. Selection by the measuring kernel The second possibility compares different characteristics of architecture, which are all calculated by just one measuring kernel. It can be considered the expressway in the adaption of the plot result since it is possible to customize a plot result with just three steps The construction of the BenchIT web interface The BenchIT web interface consists of two parts: An open and a restricted section. The measurement data is only accessible after registering on the website. This is also a security question since it is, therefore, trackable who uploaded which output-file. At the moment only registered users can download the measurement program, because BenchIT is still in a status of development. The new accounts will first be locked automatically and unlocked by the web interface administrators. All output-files uploaded to the webserver are backed up on a daily basis, hence, ensuring the availability of the data. Additional secu- 4.5e+08 4e e+08 3e e+08 2e e+08 4 The n is set by the compiler option -DERROR CORRECTION=n 1. Flops 1e+08 5e+07 Matrix Multiply Matrix Size Figure 3: The graph for a matrix multiplication ijk ikj jik jki kij kji

6 6 rity measures are implemented, so the data classified as non-disclosure can be uploaded and only be viewed by one user or a group of users. 4. First Results of the Project The project has been running for one year now and most of the immediate goals have been achieved. The measurement (as shown in figure 4) is so flexible that an adaption to a new platform is a matter of filling out one configuration file. The kernels run on all platform with the compilers and libraries necessary. The webserver is well capable of administering and plotting the files. It has been especially designed to work without Java-Script to allow the greatest browser compatibility. After first attempts without a database to support the server in managing the resultfiles for plotting, it hast been decided that a database for the arrangement of the plots is necessary to receive acceptable response times on the website. Guido@bluerabbit ~/benchit/src/kernel/matmul_c $./SUBDIREXEC.SH No definitions for your operating system found in ARCHDEFS. You will have to set them manually in your LOCALDEFS. BenchIT will not run without at least one set of definitions No definitions for your operating system found in ARCHDEFS. You will have to set them manually in your LOCALDEFS. BenchIT will not run without at least one set of definitions Warning: the variable 'ENVIRONMENT' is not set using NOTHING as default BenchIT: Getting info about kernel [ OK ] BenchIT: Getting starting time [ OK ] BenchIT: Selected kernel: Matrix Multiply BenchIT: Initializing kernel [ OK ] BenchIT: Allocating memory for results [ OK ] BenchIT: Measuring.. BenchIT: Total time limit reached. Stopping measurement. BenchIT: Analyzing results [ OK ] BenchIT: Writing resultfile [ OK ] BenchIT: Wrote output to "matmul_c0_amk7_1g33_2003_08_15 15_55.bit" BenchIT: Writing quickview file [ OK ] BenchIT: Finishing [ OK ] rm: cannot unlink `matmul_c': No such file or directory Guido@bluerabbit ~/benchit/src/kernel/matmul_c $ Figure 4: Output of one measurement run 5. Summary and Outlook The BenchIT kernels generate a large amount of measurement results in dependence of the number of functional arguments. Using the web interface the user is given the chance to show the selected results of different measuring programs in only one coordinate system. Often there are different reasons they can cause characteristic minima, maxima or a special shape in a graph. It is necessary to collect additional information about the tested system to explain such effects on a base of well-known system properties and physical values of the realization. The BenchIT-Project wants to provide such an evaluation platform by offering a variety of measurement kernels as well as a easily accessible plotting engine, thus enabling an easy way to measure performance on a specific system and compare the result, which is a full graph instead of just a number, to other results contributed by other users. The further development of the BenchIT-project will take place on all module layers. A GUI for the configuration of the measurements is under development it will provide an easier way to handle the measurements by partially substituting the shell scripts running the measurements up to this point. The power of the PCL will we utilized to access more measurement data. Furthermore an additional way to plot the data on the website by using Java-Applets and Java graphing tools is planned. The BenchIT-project will not merely be just another tool in the art of performance analysis yet it will have prove to be a very powerful one. REFERENCES [1] Raj Jain: The Art of Computer Systems Performance Analysis. John Wiley, Chichester [2] Standard Performance Evaluation Corporation (SPEC): [3] LINPACK: [4] IEEE POSIX: [5] Gnuplot: [6] The BenchIT Webserver:

Performance comparison and optimization: Case studies using BenchIT

Performance comparison and optimization: Case studies using BenchIT John von Neumann Institute for Computing Performance comparison and optimization: Case studies using BenchIT R. Schöne, G. Juckeland, W.E. Nagel, S. Pflüger, R. Wloch published in Parallel Computing: Current

More information

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Analyzing Cache Bandwidth on the Intel Core 2 Architecture John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms

More information

Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur. BenchIT. Project Overview

Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur. BenchIT. Project Overview Fakultät Informatik, Institut für Technische Informatik, Professur Rechnerarchitektur BenchIT Project Overview Nöthnitzer Straße 46 Raum INF 1041 Tel. +49 351-463 - 38458 (stefan.pflueger@tu-dresden.de)

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

KNL tools. Dr. Fabio Baruffa

KNL tools. Dr. Fabio Baruffa KNL tools Dr. Fabio Baruffa fabio.baruffa@lrz.de 2 Which tool do I use? A roadmap to optimization We will focus on tools developed by Intel, available to users of the LRZ systems. Again, we will skip the

More information

Accessing Data on SGI Altix: An Experience with Reality

Accessing Data on SGI Altix: An Experience with Reality Accessing Data on SGI Altix: An Experience with Reality Guido Juckeland, Matthias S. Müller, Wolfgang E. Nagel, Stefan Pflüger Technische Universität Dresden Center for Information Services and High Performance

More information

Visual Profiler. User Guide

Visual Profiler. User Guide Visual Profiler User Guide Version 3.0 Document No. 06-RM-1136 Revision: 4.B February 2008 Visual Profiler User Guide Table of contents Table of contents 1 Introduction................................................

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Intel profiling tools and roofline model. Dr. Luigi Iapichino Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed

More information

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache memories are small, fast SRAM based memories managed automatically in hardware. Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories Agenda Chapter 6 Cache Memories Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal

More information

Irish Collegiate Programming Competition Problem Set

Irish Collegiate Programming Competition Problem Set Irish Collegiate Programming Competition 24 Problem Set University College Cork ACM Student Chapter March 29, 24 Instructions Rules All mobile phones, laptops and other electronic devices must be powered

More information

Today Cache memory organization and operation Performance impact of caches

Today Cache memory organization and operation Performance impact of caches Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality

More information

Shared Memory Programming With OpenMP Computer Lab Exercises

Shared Memory Programming With OpenMP Computer Lab Exercises Shared Memory Programming With OpenMP Computer Lab Exercises Advanced Computational Science II John Burkardt Department of Scientific Computing Florida State University http://people.sc.fsu.edu/ jburkardt/presentations/fsu

More information

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)

More information

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster,

Today. Cache Memories. General Cache Concept. General Cache Organization (S, E, B) Cache Memories. Example Memory Hierarchy Smaller, faster, Today Cache Memories CSci 2021: Machine Architecture and Organization November 7th-9th, 2016 Your instructor: Stephen McCamant Cache memory organization and operation Performance impact of caches The memory

More information

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster, Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O

More information

Shared Memory Programming With OpenMP Exercise Instructions

Shared Memory Programming With OpenMP Exercise Instructions Shared Memory Programming With OpenMP Exercise Instructions John Burkardt Interdisciplinary Center for Applied Mathematics & Information Technology Department Virginia Tech... Advanced Computational Science

More information

Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO,

Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO, Interactive Performance Analysis with Vampir UCAR Software Engineering Assembly in Boulder/CO, 2013-04-03 Andreas Knüpfer, Thomas William TU Dresden, Germany Overview Introduction Vampir displays GPGPU

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Introduction to C CMSC 104 Spring 2014, Section 02, Lecture 6 Jason Tang

Introduction to C CMSC 104 Spring 2014, Section 02, Lecture 6 Jason Tang Introduction to C CMSC 104 Spring 2014, Section 02, Lecture 6 Jason Tang Topics History of Programming Languages Compilation Process Anatomy of C CMSC 104 Coding Standards Machine Code In the beginning,

More information

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds

More information

Computer Organization - Overview

Computer Organization - Overview Computer Organization - Overview Hyunyoung Lee CSCE 312 1 Course Overview Topics: Theme Five great realities of computer systems Computer system overview Summary NOTE: Most slides are from the textbook

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses

More information

Problem solving using standard programming techniques and Turbo C compiler.

Problem solving using standard programming techniques and Turbo C compiler. Course Outcome First Year of B.Sc. IT Program Semester I Course Number:USIT 101 Course Name: Imperative Programming Introduces programming principles and fundamentals of programming. The ability to write

More information

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic Instructors: Nick Weaver & John Wawrzynek http://inst.eecs.berkeley.edu/~cs61c/sp18 3/16/18 Spring 2018 Lecture #17

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

FFTSS Library Version 3.0 User s Guide

FFTSS Library Version 3.0 User s Guide Last Modified: 31/10/07 FFTSS Library Version 3.0 User s Guide Copyright (C) 2002-2007 The Scalable Software Infrastructure Project, is supported by the Development of Software Infrastructure for Large

More information

Cache Memories October 8, 2007

Cache Memories October 8, 2007 15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache

More information

Introduction to High-Performance Computing

Introduction to High-Performance Computing Introduction to High-Performance Computing Dr. Axel Kohlmeyer Associate Dean for Scientific Computing, CST Associate Director, Institute for Computational Science Assistant Vice President for High-Performance

More information

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based

More information

CS 110 Computer Architecture

CS 110 Computer Architecture CS 110 Computer Architecture Performance and Floating Point Arithmetic Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University

More information

Introduction to Supercomputing

Introduction to Supercomputing Introduction to Supercomputing TMA4280 Introduction to development tools 0.1 Development tools During this course, only the make tool, compilers, and the GIT tool will be used for the sake of simplicity:

More information

Computers in Engineering COMP 208. Computer Structure. Computer Architecture. Computer Structure Michael A. Hawker

Computers in Engineering COMP 208. Computer Structure. Computer Architecture. Computer Structure Michael A. Hawker Computers in Engineering COMP 208 Computer Structure Michael A. Hawker Computer Structure We will briefly look at the structure of a modern computer That will help us understand some of the concepts that

More information

Basic Shell Commands. Bok, Jong Soon

Basic Shell Commands. Bok, Jong Soon Basic Shell Commands Bok, Jong Soon javaexpert@nate.com www.javaexpert.co.kr Focusing on Linux Commands These days, many important tasks in Linux can be done from both graphical interfaces and from commands.

More information

Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB

Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB Frommelt Thomas* and Gutser Raphael SGL Carbon GmbH *Corresponding author: Werner-von-Siemens Straße 18, 86405 Meitingen,

More information

Introduction to Computer Systems: Semester 1 Computer Architecture

Introduction to Computer Systems: Semester 1 Computer Architecture Introduction to Computer Systems: Semester 1 Computer Architecture Fall 2003 William J. Taffe using modified lecture slides of Randal E. Bryant Topics: Theme Five great realities of computer systems How

More information

EXPERIMENT 1. FAMILIARITY WITH DEBUG, x86 REGISTERS and MACHINE INSTRUCTIONS

EXPERIMENT 1. FAMILIARITY WITH DEBUG, x86 REGISTERS and MACHINE INSTRUCTIONS EXPERIMENT 1 FAMILIARITY WITH DEBUG, x86 REGISTERS and MACHINE INSTRUCTIONS Pre-lab: This lab introduces you to a software tool known as DEBUG. Before the lab session, read the first two sections of chapter

More information

ELEC4042 Signal Processing 2 MATLAB Review (prepared by A/Prof Ambikairajah)

ELEC4042 Signal Processing 2 MATLAB Review (prepared by A/Prof Ambikairajah) Introduction ELEC4042 Signal Processing 2 MATLAB Review (prepared by A/Prof Ambikairajah) MATLAB is a powerful mathematical language that is used in most engineering companies today. Its strength lies

More information

Introduction to Computing Systems - Scientific Computing's Perspective. Le Yan LSU

Introduction to Computing Systems - Scientific Computing's Perspective. Le Yan LSU Introduction to Computing Systems - Scientific Computing's Perspective Le Yan HPC @ LSU 5/28/2017 LONI Scientific Computing Boot Camp 2018 Why We Are Here For researchers, understand how your instrument

More information

Introduction to Computer Systems

Introduction to Computer Systems CSCE 230J Computer Organization Introduction to Computer Systems Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due Most of slides for

More information

Introduction to Computer Systems

Introduction to Computer Systems CSCE 230J Computer Organization Introduction to Computer Systems Dr. Steve Goddard goddard@cse.unl.edu Giving credit where credit is due Most of slides for this lecture are based on slides created by Drs.

More information

INTRODUCTION TO COMPUTERS KANNAN TUITION CENTER. CHAPTER: 2 NUMBER SYSTEMS

INTRODUCTION TO COMPUTERS KANNAN TUITION CENTER.  CHAPTER: 2 NUMBER SYSTEMS CHAPTER: 1 TWO MARKS QUESTIONS. 1. What are peripheral devices? 2. What do you mean by an algorithm? 3. What is a word processor software? 4. What is analog computing system? 5. What is laptop computer?

More information

An MPI failure detector over PMPI 1

An MPI failure detector over PMPI 1 An MPI failure detector over PMPI 1 Donghoon Kim Department of Computer Science, North Carolina State University Raleigh, NC, USA Email : {dkim2}@ncsu.edu Abstract Fault Detectors are valuable services

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

COS 318: Operating Systems. File Systems. Topics. Evolved Data Center Storage Hierarchy. Traditional Data Center Storage Hierarchy

COS 318: Operating Systems. File Systems. Topics. Evolved Data Center Storage Hierarchy. Traditional Data Center Storage Hierarchy Topics COS 318: Operating Systems File Systems hierarchy File system abstraction File system operations File system protection 2 Traditional Data Center Hierarchy Evolved Data Center Hierarchy Clients

More information

DC57 COMPUTER ORGANIZATION JUNE 2013

DC57 COMPUTER ORGANIZATION JUNE 2013 Q2 (a) How do various factors like Hardware design, Instruction set, Compiler related to the performance of a computer? The most important measure of a computer is how quickly it can execute programs.

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010 Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of

More information

Performance Analysis of Parallel Scientific Applications In Eclipse

Performance Analysis of Parallel Scientific Applications In Eclipse Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt Spear, University of Oregon wspear@cs.uoregon.edu Supercomputing Big systems solving big problems Performance gains

More information

Computer Principles and Components 1

Computer Principles and Components 1 Computer Principles and Components 1 Course Map This module provides an overview of the hardware and software environment being used throughout the course. Introduction Computer Principles and Components

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Programming Assignment 1

Programming Assignment 1 CMSC 417 Computer Networks Spring 2017 Programming Assignment 1 Assigned: February 3 Due: February 10, 11:59:59 PM. 1 Description In this assignment, you will write a UDP client and server to run a simplified

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

HPCC Results. Nathan Wichmann Benchmark Engineer

HPCC Results. Nathan Wichmann Benchmark Engineer HPCC Results Nathan Wichmann Benchmark Engineer Outline What is HPCC? Results Comparing current machines Conclusions May 04 2 HPCChallenge Project Goals To examine the performance of HPC architectures

More information

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache

More information

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons

Cache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact

More information

Storage and File System

Storage and File System COS 318: Operating Systems Storage and File System Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Topics Storage hierarchy File

More information

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation Why are threads useful? How does one use POSIX pthreads? Michael Swift 1 2 What s in a process? Organizing a Process A process

More information

CS342 - Spring 2019 Project #3 Synchronization and Deadlocks

CS342 - Spring 2019 Project #3 Synchronization and Deadlocks CS342 - Spring 2019 Project #3 Synchronization and Deadlocks Assigned: April 2, 2019. Due date: April 21, 2019, 23:55. Objectives Practice multi-threaded programming. Practice synchronization: mutex and

More information

Concurrency, Thread. Dongkun Shin, SKKU

Concurrency, Thread. Dongkun Shin, SKKU Concurrency, Thread 1 Thread Classic view a single point of execution within a program a single PC where instructions are being fetched from and executed), Multi-threaded program Has more than one point

More information

Performance Analysis of KDD Applications using Hardware Event Counters. CAP Theme 2.

Performance Analysis of KDD Applications using Hardware Event Counters. CAP Theme 2. Performance Analysis of KDD Applications using Hardware Event Counters CAP Theme 2 http://cap.anu.edu.au/cap/projects/kddmemperf/ Peter Christen and Adam Czezowski Peter.Christen@anu.edu.au Adam.Czezowski@anu.edu.au

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 2

ECE 571 Advanced Microprocessor-Based Design Lecture 2 ECE 571 Advanced Microprocessor-Based Design Lecture 2 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 January 2016 Announcements HW#1 will be posted tomorrow I am handing out

More information

CSc 10200! Introduction to Computing. Lecture 1 Edgardo Molina Fall 2013 City College of New York

CSc 10200! Introduction to Computing. Lecture 1 Edgardo Molina Fall 2013 City College of New York CSc 10200! Introduction to Computing Lecture 1 Edgardo Molina Fall 2013 City College of New York 1 Introduction to Computing Lectures: Tuesday and Thursday s (2-2:50 pm) Location: NAC 1/202 Recitation:

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Performance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers

Performance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers Performance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers This white paper details the performance improvements of Dell PowerEdge servers with the Intel Xeon Processor Scalable CPU

More information

This lecture is covered in Section 4.1 of the textbook.

This lecture is covered in Section 4.1 of the textbook. This lecture is covered in Section 4.1 of the textbook. A Unix process s address space appears to be three regions of memory: a read-only text region (containing executable code); a read-write region consisting

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

A Fast Review of C Essentials Part I

A Fast Review of C Essentials Part I A Fast Review of C Essentials Part I Structural Programming by Z. Cihan TAYSI Outline Program development C Essentials Functions Variables & constants Names Formatting Comments Preprocessor Data types

More information

Introduction to Computer Programming in Python Dr. William C. Bulko. Data Types

Introduction to Computer Programming in Python Dr. William C. Bulko. Data Types Introduction to Computer Programming in Python Dr William C Bulko Data Types 2017 What is a data type? A data type is the kind of value represented by a constant or stored by a variable So far, you have

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads Operating Systems 2 nd semester 2016/2017 Chapter 4: Threads Mohamed B. Abubaker Palestine Technical College Deir El-Balah Note: Adapted from the resources of textbox Operating System Concepts, 9 th edition

More information

Abstract 1. Introduction

Abstract 1. Introduction Jaguar: A Distributed Computing Environment Based on Java Sheng-De Wang and Wei-Shen Wang Department of Electrical Engineering National Taiwan University Taipei, Taiwan Abstract As the development of network

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Mar 12 Parallelism and Shared Memory Hierarchy I Rutgers University Review: Classical Three-pass Compiler Front End IR Middle End IR

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

An Overview of the BLITZ System

An Overview of the BLITZ System An Overview of the BLITZ System Harry H. Porter III Department of Computer Science Portland State University Introduction The BLITZ System is a collection of software designed to support a university-level

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory 5-23 The course that gies CMU its Zip! Topics Cache Memories Oct., 22! Generic cache memory organization! Direct mapped caches! Set associatie caches! Impact of caches on performance Cache Memories Cache

More information

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those

More information

A Brief Description of the NMP ISA and Benchmarks

A Brief Description of the NMP ISA and Benchmarks Report No. UIUCDCS-R-2005-2633 UILU-ENG-2005-1823 A Brief Description of the NMP ISA and Benchmarks by Mingliang Wei, Marc Snir, Josep Torrellas, and R. Brett Tremaine February 2005 A Brief Description

More information

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines. Topics CISC 36 Cache Memories Dec, 29 ν Generic cache memory organization ν Direct mapped caches ν Set associatie caches ν Impact of caches on performance Cache Memories Cache memories are small, fast

More information

CS 326: Operating Systems. Process Execution. Lecture 5

CS 326: Operating Systems. Process Execution. Lecture 5 CS 326: Operating Systems Process Execution Lecture 5 Today s Schedule Process Creation Threads Limited Direct Execution Basic Scheduling 2/5/18 CS 326: Operating Systems 2 Today s Schedule Process Creation

More information

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory 5-23 The course that gies CMU its Zip! Cache Memories Oct, 2 Topics Generic cache memory organization Direct mapped caches Set associatie caches Impact of caches on performance Cache memories Cache memories

More information

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab cozzini@democritos.it Agenda Tools for

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Copyright 2013 Thomas W. Doeppner. IX 1

Copyright 2013 Thomas W. Doeppner. IX 1 Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Storage Project3 Digital Logic - Storage: Recap - Direct - Mapping - Fully Associated - 2-way Associated - Cache Friendly Code Rutgers University Liu

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Outline. How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III

Outline. How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III Outline How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III Peter Christen and Adam Czezowski CAP Research Group Department of Computer Science,

More information