Optimization Space Exploration of the FastFlow Parallel Skeleton Framework
|
|
- Christina Edwards
- 6 years ago
- Views:
Transcription
1 Optimization Space Exploration of the FastFlow Parallel Skeleton Framework Alexander Collins Christian Fensch Hugh Leather University of Edinburgh School of Informatics
2 What We Did Parallel skeletons provide easy abstraction for parallel programs Contain many manually tuned parameters Automatically tuning provides better performance Preliminary results that make auto-tuning faster 1
3 Motivating Example 2
4 Motivating Example 2
5 Motivating Example 2
6 Motivating Example 2
7 Motivating Example 2
8 Motivating Example 2
9 Tuning the Example Speedup Human expert Best possible Maximum recursion depth 3
10 What Next? Humans failed Auto-tuning won Can we do even better? Multiple parameters Multiple programs Multiple platforms 4
11 Optimization Space Exploration 5
12 Parameters Investigated Number of workers Bounded or unbounded queues Size of queue s buffer Cache alignment Maximum recursion depth Batch size 6
13 Speedup over a Human Expert 7
14 Speedup over a Human Expert 8
15 Speedup over a Human Expert 9
16 Visualising the Optimisation Space 10
17 Visualisation of Optimisation Space 11
18 Reducing the Size of the Search Space Two methods: Remove useless parameters Exploit linear dependencies 12
19 What is a Useless Parameter? 13
20 What is a Useful Parameter? 14
21 Removing Useless Parameters 50% Average performance loss 40% 30% 20% 10% 0% buffertype cachealignbuffersize Parameter Reduces size of search space by 6 seqthresh batchsize numworkers 15
22 Exploiting Linear Dependencies 16
23 Exploiting Linear Dependencies 17
24 Exploiting Linear Dependencies 18
25 Conclusions Tuning parameters is very important Humans are bad at tuning Auto-tuning is much better Tuning is program and platform dependent Have shown preliminary results that make auto-tuning faster 19
26 Optimization Space Exploration of the FastFlow Parallel Skeleton Framework Alexander Collins Christian Fensch Hugh Leather University of Edinburgh School of Informatics
27 Speedup over a Human Expert Speedup aquad cwc dt fibonacci mandelbrot matmul nqueens pbzip2 quicksort swps3 desktop phantom scuttle xxxii16 xxxii Average Average
28 Principal Components Analysis N = 6 P = 5624 ( ) batchsize, buffersize, buffertype, p = cachealign, numworkers, seqthresh λ = (0.443, 0.419, 0.204, 0.138, 0.007, 0.003) e = ν = (36%, 70%, 87%, 99%, 99%, 100%)
29 Is the subset representative? Percentage of best program performance 100% 80% 60% % 80% 60% % 80% 60% % 80% 60% % 80% 60% Number of iterations desktop phantom scuttle Platform xxxii16 xxxii
30 Programs Program Description aquad Adaptive Quadrature algorithm cwc Implementation of CWC, a calculus for the representation and simulation of biological systems dt Implementation of the C4.5 decision tree algorithm fibonacci Naïve recursive algorithm, without memoization, to compute Fibonacci numbers mandelbrot Mandelbrot fractal generator matmul O(n 3 ) nested-loops matrix multiplication nqueens n-queens problem solver pbzip2 Parallel bzip2 compression quicksort Parallel quicksort swps3 Smith-Waterman algorithm for gene sequence alignment
31 Platforms Platform Processor Cores Freq. Memory L3 L2 xxxii 4 Intel GHz 64GB 4 32 Xeon 24MB 256KB L7555 xxxii16 2 Intel Xeon L7555 scuttle AMD Phenom II X6 1055T phantom Intel Xeon E5430 desktop Intel Core 2 Duo E GHz 64GB 2 24MB 6 3.3GHz 8GB 1 6MB KB 1 512KB GHz 8GB None 2 6MB GHz 3GB None 1 2MB
32 Parameter Values Parameter Values numworkers 1,..., # cores 1.5 buffertype Bounded or unbounded buffersize 1, 2, 4, 8,..., 2 20 batchsize 1, 2, 4, 8,..., 2 20 cachealign 64, 128 or 256 bytes seqthresh with aquad 0.02, 0.04, 0.06,..., 1 seqthresh with fibonacci 10, 11, 12,..., 44 seqthresh with nqueens 3, 4, 5,..., 15 seqthresh with quicksort 1, 2, 4, 8,..., 2 21
33 Outlier Removal Arithmetic mean is not a robust statistic An outlier will cause many more repeats Impractical Remove using interquartile range removal: [ Q1 k(q 3 Q 1 ), Q 3 + k(q 3 Q 1 ) ] with k = 3
34 Quantifying Noise Repeats allow quantification of noise: Perform between 10 and 100 repeats Stop if coefficient of variation drops below 1% for a 99% confidence interval Use the arithmetic mean as an estimator of execution time And confidence intervals to compare execution times
35 Skeletons Provided by FastFlow farm farm-with-feedback pipe
MaSiF: Machine learning guided auto-tuning of parallel skeletons Collins, Alexander; Fensch, Christian; Leather, Hugh; Cole, Murray
Heriot-Watt University Heriot-Watt University Research Gateway MaSiF: Machine learning guided auto-tuning of parallel skeletons Collins, Alexander; Fensch, Christian; Leather, Hugh; Cole, Murray Published
More informationAn efficient Unbounded Lock-Free Queue for Multi-Core Systems
An efficient Unbounded Lock-Free Queue for Multi-Core Systems Authors: Marco Aldinucci 1, Marco Danelutto 2, Peter Kilpatrick 3, Massimiliano Meneghin 4 and Massimo Torquati 2 1 Computer Science Dept.
More informationEfficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed
Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed Marco Aldinucci Computer Science Dept. - University of Torino - Italy Marco Danelutto, Massimiliano Meneghin,
More informationEfficient Smith-Waterman on multi-core with FastFlow
BioBITs Euromicro PDP 2010 - Pisa Italy - 17th Feb 2010 Efficient Smith-Waterman on multi-core with FastFlow Marco Aldinucci Computer Science Dept. - University of Torino - Italy Massimo Torquati Computer
More informationHow to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture
How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationParallelized Progressive Network Coding with Hardware Acceleration
Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto Network coding Information is coded
More informationBenchmarking CPU Performance
Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed
More informationHugh Leather, Edwin Bonilla, Michael O'Boyle
Automatic Generation for Machine Learning Based Optimizing Compilation Hugh Leather, Edwin Bonilla, Michael O'Boyle Institute for Computing Systems Architecture University of Edinburgh, UK Overview Introduction
More informationPerformance Modeling and Analysis of Flash based Storage Devices
Performance Modeling and Analysis of Flash based Storage Devices H. Howie Huang, Shan Li George Washington University Alex Szalay, Andreas Terzis Johns Hopkins University MSST 11 May 26, 2011 NAND Flash
More informationFundamentals of Computer Systems
Fundamentals of Computer Systems Caches Martha A. Kim Columbia University Fall 215 Illustrations Copyright 27 Elsevier 1 / 23 Computer Systems Performance depends on which is slowest: the processor or
More informationMetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores
MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,
More informationParallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach
Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach Tiziano De Matteis, Gabriele Mencagli University of Pisa Italy INTRODUCTION The recent years have
More informationCluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance
Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to
More informationModeling Resource Utilization of a Large Data Acquisition System
Modeling Resource Utilization of a Large Data Acquisition System Alejandro Santos CERN / Ruprecht-Karls-Universität Heidelberg On behalf of the ATLAS Collaboration 1 Outline Introduction ATLAS TDAQ Simulation
More informationBenchmarking CPU Performance. Benchmarking CPU Performance
Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,
More informationLearning with Purpose
Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationFundamentals of Computer Systems
Fundamentals of Computer Systems Caches Stephen A. Edwards Columbia University Summer 217 Illustrations Copyright 27 Elsevier Computer Systems Performance depends on which is slowest: the processor or
More informationProgramming Languages Research Programme
Programming Languages Research Programme Logic & semantics Planning Language-based security Resource-bound analysis Theorem proving/ CISA verification LFCS Logic Functional "Foundational" PL has been an
More informationHammer Slide: Work- and CPU-efficient Streaming Window Aggregation
Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS)
More informationEE 457 Unit 7b. Main Memory Organization
1 EE 457 Unit 7b Main Memory Organization 2 Motivation Organize main memory to Facilitate byte-addressability while maintaining Efficient fetching of the words in a cache block Low order interleaving (L.O.I)
More informationA Case for Packing and Indexing in Cloud File Systems
A Case for Packing and Indexing in Cloud File Systems Saurabh Kadekodi, Bin Fan*, Adit Madan*, Garth Gibson and Greg Ganger University, *Alluxio Inc. In a nutshell Cloud object stores have a per-operation
More informationPreliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths
Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Y. Kodama, T. Odajima, M. Matsuda, M. Tsuji, J. Lee and M. Sato RIKEN AICS (Advanced Institute for Computational
More informationStatistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform
Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.
More informationMultivariate Calibration Quick Guide
Last Updated: 06.06.2007 Table Of Contents 1. HOW TO CREATE CALIBRATION MODELS...1 1.1. Introduction into Multivariate Calibration Modelling... 1 1.1.1. Preparing Data... 1 1.2. Step 1: Calibration Wizard
More informationParallel Programming using FastFlow
Parallel Programming using FastFlow Massimo Torquati Computer Science Department, University of Pisa - Italy Karlsruhe, September 2nd, 2014 Outline Structured Parallel Programming
More informationA Hybrid Implementation of Hamming Weight
A Hybrid Implementation of Hamming Weight Enric Morancho Computer Architecture Department Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain enricm@ac.upc.edu 22 nd Euromicro International
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationBanshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!
Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth
More informationFractals. Investigating task farms and load imbalance
Fractals Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationAAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors
AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors Kaixi Hou, Hao Wang, Wu-chun Feng {kaixihou,hwang121,wfeng}@vt.edu Pairwise Sequence Alignment Algorithms
More informationDouble-precision General Matrix Multiply (DGEMM)
Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply
More informationPredictable Timing Analysis of x86 Multicores using High-Level Parallel Patterns
Predictable Timing Analysis of x86 Multicores using High-Level Parallel Patterns Kevin Hammond, Susmit Sarkar and Chris Brown University of St Andrews, UK T: @paraphrase_fp7 E: kh@cs.st-andrews.ac.uk W:
More informationDynamic Autotuning. of Algorithmic Skeletons:
Dynamic Autotuning of Algorithmic Skeletons Informatics Research Proposal Chris Cummins Abstract. The rapid transition towards multicore hardware has left application programmers requiring higher-level
More informationPredicting the Resilience of Obfuscated Code Against Symbolic Execution Attacks via Machine Learning
Fakultät für Informatik Technische Universität München 26th USENIX Security Symposium Predicting the Resilience of Obfuscated Code Against Symbolic Execution Attacks via Machine Learning Sebastian Banescu
More informationFractals exercise. Investigating task farms and load imbalance
Fractals exercise Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationIntroduction to FastFlow programming
Introduction to FastFlow programming SPM lecture, November 2016 Massimo Torquati Computer Science Department, University of Pisa - Italy Data Parallel Computations In data parallel
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationClustering and Reclustering HEP Data in Object Databases
Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications
More informationA new edge selection heuristic for computing the Tutte polynomial of an undirected graph.
FPSAC 2012, Nagoya, Japan DMTCS proc. (subm.), by the authors, 1 12 A new edge selection heuristic for computing the Tutte polynomial of an undirected graph. Michael Monagan 1 1 Department of Mathematics,
More informationMinimum Hardware and OS Specifications
Hardware and OS Specifications File Stream Document Management Software System Requirements for v4.5 NB: please read through carefully, as it contains 4 separate specifications for a Workstation PC, a
More informationAccelerating the Prediction of Protein Interactions
Accelerating the Prediction of Protein Interactions Alex Rodionov, Jonathan Rose, Elisabeth R.M. Tillier, Alexandr Bezginov October 21 21 Motivation The human genome is sequenced, but we don't know what
More informationParallel Performance Studies for a Clustering Algorithm
Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,
More informationMarco Aldinucci Salvatore Ruggieri, Massimo Torquati
Marco Aldinucci aldinuc@di.unito.it Computer Science Department University of Torino Italy Salvatore Ruggieri, Massimo Torquati ruggieri@di.unipi.it torquati@di.unipi.it Computer Science Department University
More informationAccelerating Data Warehousing Applications Using General Purpose GPUs
Accelerating Data Warehousing Applications Using General Purpose s Sponsors: Na%onal Science Founda%on, LogicBlox Inc., IBM, and NVIDIA The General Purpose is a many core co-processor 10s to 100s of cores
More informationLAPI on HPS Evaluating Federation
LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationPart I Basic Concepts 1
Introduction xiii Part I Basic Concepts 1 Chapter 1 Integer Arithmetic 3 1.1 Example Program 3 1.2 Computer Program 4 1.3 Documentation 5 1.4 Input 6 1.5 Assignment Statement 7 1.5.1 Basics of assignment
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationSystem Requirements. SuccessMaker 7
System Requirements SuccessMaker 7 Copyright 2015 Pearson Education, Inc. or one or more of its direct or indirect affiliates. All rights reserved. Pearson and SuccessMaker are registered trademarks, in
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationMasher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs
Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2, Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical
More informationSequoia. Mike Houston Stanford University July 9 th, CScADS Workshop
Sequoia Mike Houston Stanford University July 9 th, 2007 - CScADS Workshop Today s outline Sequoia programming model Sequoia targets Tuning in Sequoia http://sequoia.stanford.edu - Supercomputing 2006
More informationInternational Conference on Computational Science (ICCS 2017)
International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A.
More informationHardware & System Requirements
Safend Data Protection Suite Hardware & System Requirements System Requirements Hardware & Software Minimum Requirements: Safend Data Protection Agent Requirements Console Safend Data Access Utility Operating
More informationTheoretical principles and implementation issues of fuzzy GUHA association rules. Martin Ralbovský KIZI FIS KEG
Theoretical principles and implementation issues of fuzzy GUHA association rules Martin Ralbovský KIZI FIS VŠE @ KEG 21.5.2009 Preliminaries The GUHA method Method of exploratory data analysis Automatic
More informationJoin Processing for Flash SSDs: Remembering Past Lessons
Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of
More informationSystem Requirements. SuccessMaker 8
System Requirements SuccessMaker 8 Copyright 2015 Pearson Education, Inc. or one or more of its direct or indirect affiliates. All rights reserved. Pearson and SuccessMaker are registered trademarks, in
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationMarket Data Publisher In a High Frequency Trading Set up
Market Data Publisher In a High Frequency Trading Set up INTRODUCTION The main theme behind the design of Market Data Publisher is to make the latest trade & book data available to several integrating
More informationVerification and Validation of X-Sim: A Trace-Based Simulator
http://www.cse.wustl.edu/~jain/cse567-06/ftp/xsim/index.html 1 of 11 Verification and Validation of X-Sim: A Trace-Based Simulator Saurabh Gayen, sg3@wustl.edu Abstract X-Sim is a trace-based simulator
More informationCompiler Optimisation
Compiler Optimisation 2 Coursework Hugh Leather IF 1.18a hleather@inf.ed.ac.uk Institute for Computing Systems Architecture School of Informatics University of Edinburgh 2018 Course work Based on GCC compiler
More informationUsing Multiple Machines to Solve Models Faster with Gurobi 6.0
Using Multiple Machines to Solve Models Faster with Gurobi 6.0 Distributed Algorithms in Gurobi 6.0 Gurobi 6.0 includes 3 distributed algorithms Distributed concurrent LP (new in 6.0) MIP Distributed MIP
More informationParallel and Distributed Optimization with Gurobi Optimizer
Parallel and Distributed Optimization with Gurobi Optimizer Our Presenter Dr. Tobias Achterberg Developer, Gurobi Optimization 2 Parallel & Distributed Optimization 3 Terminology for this presentation
More informationCost Modelling for Vectorization on ARM
Cost Modelling for Vectorization on ARM Angela Pohl, Biagio Cosenza and Ben Juurlink ARM Research Summit 2018 Challenges of Auto-Vectorization in Compilers 1. Is it possible to vectorize the code? Passes:
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationProduction. Visual Effects. Fluids, RBD, Cloth. 2. Dynamics Simulation. 4. Compositing
Visual Effects Pr roduction on the Cell/BE Andrew Clinton, Side Effects Software Visual Effects Production 1. Animation Character, Keyframing 2. Dynamics Simulation Fluids, RBD, Cloth 3. Rendering Raytrac
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationExploiting Local Orientation Similarity for Efficient Ray Traversal of Hair and Fur
1 Exploiting Local Orientation Similarity for Efficient Ray Traversal of Hair and Fur Sven Woop, Carsten Benthin, Ingo Wald, Gregory S. Johnson Intel Corporation Eric Tabellion DreamWorks Animation 2 Legal
More informationRUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS
RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering
More informationSEDA: An Architecture for Well-Conditioned, Scalable Internet Services
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of California, Berkeley Operating Systems Principles
More informationMSc-IT 1st Semester Fall 2016, Course Instructor M. Imran khalil 1
Objectives Overview Differentiate among various styles of system units on desktop computers, notebook computers, and mobile devices Identify chips, adapter cards, and other components of a motherboard
More informationSage 100 Standard Version 2017 Supported Platform Matrix Created as of October 25, 2016
The information in this document applies to Sage 100 Standard Version 2017. Detailed product update information and support policies can be found on the Sage Support website at: https://support.na.sage.com/.
More informationHyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers. Session #3798. Hein van den Heuvel
Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers Session #3798 Hein van den Heuvel Performance Engineer Hewlett-Packard 2004 Hewlett-Packard Development Company,
More informationReliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!
Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015! ! My Topic for Today! Goal: a reliable longest name prefix lookup performance
More informationMINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2
MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2 NB: please read this page carefully, as it contains 4 separate specifications for a Workstation
More informationA Fine-grained Performance-based Decision Model for Virtualization Application Solution
A Fine-grained Performance-based Decision Model for Virtualization Application Solution Jianhai Chen College of Computer Science Zhejiang University Hangzhou City, Zhejiang Province, China 2011/08/29 Outline
More informationSystems Programming and Computer Architecture ( ) Timothy Roscoe
Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture
More informationRevisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to
More informationLecture 16: Introduction to Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY
Lecture 16: Introduction to Dynamic Programming Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Problem of the Day
More informationA Static Cut-off for Task Parallel Programs
A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationExperimenting with the MetaFork Framework Targeting Multicores
Experimenting with the MetaFork Framework Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario 26 January 2014 1 Introduction The work reported in this report
More informationGPU Accelerated Machine Learning for Bond Price Prediction
GPU Accelerated Machine Learning for Bond Price Prediction Venkat Bala Rafael Nicolas Fermin Cota Motivation Primary Goals Demonstrate potential benefits of using GPUs over CPUs for machine learning Exploit
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationDigital Camera CAM-SC30. For C-Mount. Data Sheet. Dr. Natalie Bischoff Creation Date Oct-09 Last Update 25-Jan-10 CAM-SC30
Digital Camera CAM-SC30 For C-Mount Data Sheet Dr. Natalie Bischoff Creation Date Oct-09 Last Update 25-Jan-10 CAM-SC30 CONTENT 1 BASIC INFORMATION 1.1 Article Information 1.2 Predecessor Information 1.3
More informationStructured Parallel Programming Patterns for Efficient Computation
Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO
More informationTwelve Simple Algorithms to Compute Fibonacci Numbers
arxiv:1803.07199v2 [cs.ds] 13 Apr 2018 Twelve Simple Algorithms to Compute Fibonacci Numbers Ali Dasdan KD Consulting Saratoga, CA, USA alidasdan@gmail.com April 16, 2018 Abstract The Fibonacci numbers
More informationPerformance of Virtual Desktops in a VMware Infrastructure 3 Environment VMware ESX 3.5 Update 2
Performance Study Performance of Virtual Desktops in a VMware Infrastructure 3 Environment VMware ESX 3.5 Update 2 Workload The benefits of virtualization for enterprise servers have been well documented.
More informationCPSC 330 Computer Organization
CPSC 33 Computer Organization Lecture 7c Memory Adapted from CS52, CS 6C and notes by Kevin Peterson and Morgan Kaufmann Publishers, Copyright 24. Improving cache performance Two ways of improving performance:
More informationIdentifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning
Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)
More informationPerformance Diagnosis through Classification of Computation Bursts to Known Computational Kernel Behavior
Performance Diagnosis through Classification of Computation Bursts to Known Computational Kernel Behavior Kevin Huck, Juan González, Judit Gimenez, Jesús Labarta Dagstuhl Seminar 10181: Program Development
More informationAn Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware
An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical
More informationScalable RNA Sequencing on Clusters of Multicore Processors
JOAQUÍN DOPAZO JOAQUÍN TARRAGA SERGIO BARRACHINA MARÍA ISABEL CASTILLO HÉCTOR MARTÍNEZ ENRIQUE S. QUINTANA ORTÍ IGNACIO MEDINA INTRODUCTION DNA Exon 0 Exon 1 Exon 2 Intron 0 Intron 1 Reads Sequencing RNA
More informationA parallel patch based algorithm for CT image denoising on the Cell Broadband Engine
A parallel patch based algorithm for CT image denoising on the Cell Broadband Engine Dominik Bartuschat, Markus Stürmer, Harald Köstler and Ulrich Rüde Friedrich-Alexander Universität Erlangen-Nürnberg,Germany
More informationGeometric Registration for Deformable Shapes 3.3 Advanced Global Matching
Geometric Registration for Deformable Shapes 3.3 Advanced Global Matching Correlated Correspondences [ASP*04] A Complete Registration System [HAW*08] In this session Advanced Global Matching Some practical
More information