For-loop equivalents and massive parallelization in CasADi
|
|
- Joshua Boone
- 6 years ago
- Views:
Transcription
1 For-loop equivalents and massive parallelization in CasADi Citius, Altius, Fortius Joris Gillis (MECO group, KU Leuven)
2 Outline Key ingredients of CasADi New concepts: Map, MapAccum Massive parallelization Software architecture
3 Core idea Computer-aided formulation of optimization problems symbolic primitive expression x = optivar() a = optivar() optimize(x^2+a^2, [ x+a^2>= 0 ]) 2 minimize x +a x,a subject to 2 2 x +a 0 Obtain gradients, Jacobians, Hessians automatically
4 Basic example Find initial condition to achieve a goal x =a x +cos x 2 minimize x N x0 x0 = optivar() a = optivar() dt = 1 x1 x2 x3 xn = x0 + dt*(a*x0 + cos(x0)) = x1 + dt*(a*x1 + cos(x1)) = x2 + dt*(a*x2 + cos(x2)) = optimize(xn**2)
5 Basic example Find initial condition to achieve a goal x =a x +cos x 2 minimize x N x0 x0 = optivar() a = optivar() dt = 1 x = x0 for i in range(n): x = x + dt*(a*x + cos(x)) xn = x optimize(xn**2)
6 Key ingredient: efficient symbolics Matlab symbolic toolbox (mupad) syms('x0') syms('a') dt = 1; x = x0; for i=1:n x = x+dt*(a*x+cos(x)); end
7 Key ingredient: efficient symbolics Matlab: yalmip x0 = sdpvar(1,1) a = sdpvar(1,1) dt = 1; x = x0; for i=1:n x = x+dt*(a*x+cos(x)); end
8 Key ingredient: efficient symbolics Python: sympy x0 = symbols('x0') a = symbols('a') dt = 1 x = x0 for i in range(n): x = x+dt*(a*x+cos(x))
9 Key ingredient: efficient symbolics Python: casadi x0 = MX.sym('x0') a = MX.sym('a') dt = 1 x = x0 for i in range(n): x = x+dt*(a*x+cos(x))
10 Key ingredient: efficient symbolics Construction time [s] x0 = MX.sym('x0') a = MX.sym('a') dt = 1 x = x0 for i in range(n): x = x+dt*(a*x+cos(x)) N
11 Key ingredient: efficient symbolics x0 = optivar() a = optivar() dt = 1 x0 + dt*(a*x0 + cos(x0)) x1 x2 x3 xn = x0 + dt*(a*x0 + cos(x0)) = x1 + dt*(a*x1 + cos(x1)) = x2 + dt*(a*x2 + cos(x2)) = x2 = x1 + dt*(a*x1 + cos(x1)) x2 = (x0 + dt*(a*x0 + cos(x0))) + dt*(a*(x0 + dt*(a*x0 + cos(x0))) + cos(x0 + dt*(a*x0 + cos(x0)))) Expression length ~ 3^N
12 Key ingredient: efficient symbolics x0 a cos graph representation x0 = MX.sym('x0') a = MX.sym('a') dt = 1 * * x = x0 cos * + + for i in range(n): x = x+dt*(a*x+cos(x)) * cos * #nodes: O(N) + + *
13 Key ingredient: algorithmic differentiation Cost of a (forward/reverse) derivative of: small multiple of original 2 minimize x N x0 d xn d x0
14 Key ingredient: algorithmic differentiation Construction time [s] x0 = MX.sym('x0') a = MX.sym('a') dt = 1 x = x0 for i in range(n): x = x+dt*(a*x+cos(x)) jacobian(x,x0) N d xn d x0
15 Key ingredient: matrix-valued graphs x = A x+cos x x ℝ n x = MX.sym('x',2,2) A = MX.sym('A',2) x = SX.sym('x',2,2) A = SX.sym('A',2,2) mul(a,x) mul(a,x) A x mul Scalar expansion ~ loop unrolling * a00 a01 x0 * a10 a11 * * + x1 +
16 Key ingredient: function embedding x0 a cos x0 = MX.sym('x0') a = MX.sym('a') dt = 1 * x = x0 * cos * + + for i in range(n): x = x+dt*(a*x+cos(x)) * cos * + + *
17 Key ingredient: function embedding x0 a x0 = MX.sym('x0') a = MX.sym('a') dt = 1 F x = x0 for i in range(n): x = F(x,a) F F
18 Key ingredient: function embedding x0 a x0 = MX.sym('x0') a = MX.sym('a') dt = 1 F x = x0 for i in range(n): x = F(x,a) F x F x = MX.sym('x') a = MX.sym('a') dt = 1 e = x+dt*(a*x+cos(x)) F = Function( F,[x,a],[e]) a F cos * *
19 Key ingredient: function embedding rootfinder linsol integrator...
20 Outline Key ingredients of CasADi Efficient symbolics Algorithmic differentiation Matrix-valued graphs Function embedding New concepts: Map, MapAccum Massive parallelization Software architecture
21 Towards large scale optimal control 2 x k +1= F ( x k,uk ) x0 minimize x N x 0, u0, u1... u N 1 U u x0 = MX.sym('x0') U = MX.sym('u',1,N) dt = 1 F w = horzsplit(u) x = x0 F for i in range(n): x = F(x,w[i]) x F MX graph (optimized for memory footprint) a F cos * 1 SX graph (optimized for speed) + + *
22 Towards large scale optimal control x0 U u... F F F N = , ,...
23 Towards large scale optimal control Not fast enough! Construction time [s] Painfully slow Lightning fast N = , ,...
24 New node: MapAccum x0 U u... F x0 U MapAccum(F,N) (x1) F x1 x2 x3 (x2) F (x3)
25 Practical example: sys id minimize x =v 3 v =u k x c v d x 2 x i x i 2 x 0, k, c, d... x k +1= F ( x k,uk, p) u_data x u params x0 F x_next MapAccum(F,N) k,c,d x1 x2 x3 Function( F,[x,u,vertcat([k,c,d])],[...])
26 Practical example: sys id minimize 2 x i x i 2 x 0, k, c, d x =v 3 v =u k x c v d x R = F.mapaccum("R", N) X_symbolic = R(x0, u_data, repmat(params,1,n)) e = y_data-x_symbolic[0,:].t; Measured data Known signals
27 Practical example: sys id Construction time [s] for i in range(n): X = F(X,u_data[:,i],params) R = F.mapaccum("R", N) N
28 Construction time evaluation time Up-front cost Numerical evaluation (iterations of nlp)
29 Construction time evaluation time So MapAccum is useless to speed up online computations? Evaluation time [s] Numerical evaluation (iterations of nlp)
30 Evaluation time [s] Code-generation
31 Code-generation First you need to run the compiler... Compilation time [s] Compilation memory [B]
32 Outline Key ingredients of CasADi Efficient symbolics Algorithmic differentiation Matrix-valued graphs Function embedding New concepts: MapAccum 2 orders of magnitude faster construction compile 2 orders of magnitude larger problems Software architecture
33 MapAccum ~ single shooting 2 minimize x N x 0, u0, u1... u N 1 x0 U u... F x0 U MapAccum(F,N) (x1) F x1 x2 x3 (x2) F (x3)
34 Map ~ multiple shooting 2 minimize x N x 0, x 1,... x N,u0, u1... u N 1 X U u x U Map(F,N) F x1 x2 x3 F... x1 x2 x3 F
35 Map ~ multiple shooting 2 minimize x N x 0, x 1,... x N,u0, u1... u N 1 s.t. F is a function that: - reads a few inputs - computes a lot (e.g. time integration) - writes a few outputs x 1 F ( x 0 ; u0 )=0 x 2 F ( x 1 ;u1 )=0... Ideal for parallelism: - openmp support builtin - opencl GPU (proof-of-concept) x N F (x N 1 ; u N 1 )=0 R = F.map("R","serial",N) X_n = R(X, u_data, repmat(params,1,n)) gaps = X[:,1:]-X_n[:,:-1]
36 Outline Key ingredients of CasADi Efficient symbolics Algorithmic differentiation Matrix-valued graphs Function embedding New concepts: Map, MapAccum Speedups Massive parallelization Software architecture
37 Operation on an image D f (D i, j, X ) i, j i j
38 Operation on an image f (D i, j, X ) i, j X = MX.sym( x,2) D = MX.sym( D,1024,768) R = F.map( mymap, serial,1024*768) Dflat = vec(d).t terms = R(Dflat,X) e = sum2(terms)
39 Operation on an image f (D i, j, X ) i, j X = MX.sym( x,2) D = MX.sym( D,1024,768) R = F.map( mymap, openmp,1024*768) Dflat = vec(d).t terms = R(Dflat,X) e = sum2(terms)
40 Operation on an image f (D i, j, X ) i, j X = MX.sym( x,2) D = MX.sym( D,1024,768) R = F.map( mymap, opencl,1024*768) Dflat = vec(d).t terms = R(Dflat,X) e = sum2(terms)
41 Operation on an image f (D i, j, X ) i, j X = MX.sym( x,2) D = MX.sym( D,1024,768) R = F.map( mymap, opencl,1024*768) Dflat = vec(d).t terms = R(Dflat,X) e = sum2(terms)
42 What is OpenCL? C library Compiler for computation kernels Low-level memory concepts Write once, run on CPU GPU FPGA
43 Operation on an image f (D i, j, X ) i, j X = MX.sym( x,2) D = MX.sym( D,1024,768) Init: Allocate buffer for x on GPU Allocate buffer for D on GPU Allocate buffer for result on GPU Compile GPU kernel Eval: Send x R = F.map( mymap, opencl,1024*768) Send D Compute kernel Retrieve sol Dflat = vec(d).t terms = R(Dflat,X) e = sum2(terms)
44 Kernelsum node f (D i, j, X ) i, j Init: Allocate buffer for x on GPU Allocate buffer for D on GPU Allocate buffer for result on GPU Send D Compile GPU kernel Eval: Send x Compute kernel Sum result Retrieve sum
45 Kernelsum node For Flanders Make: x speedup on a Titan Black GPU (2000 cores)
46 Outline Key ingredients of CasADi Efficient symbolics Algorithmic differentiation Matrix-valued graphs Function embedding New concepts: Map, MapAccum Massive parallelization Software architecture
47 Software architecture: interfaces Python interface source casadipython_wrap.cxx C++ source casadi.cpp casadi.hpp SWIG interface header _casadi.so casadi.i Matlab interface source casadimatlab_wrap.cxx libcasadi.so casadimatlab_wrap.mexa64
48 Software architecture: before plugins C++ source casadi.cpp casadi.hpp nlp_solver.cpp ipopt_interface.cpp snopt_interface.cpp libipopt.so (LGPL) libsnopt.so (commercial) libcasadi.so
49 Software architecture: before plugins C++ source casadi.cpp casadi.hpp nlp_solver.cpp ipopt_interface.cpp snopt_interface.cpp Mr. Cheapskate from casadi import * Error: symbol snopt not found Mrs. Spender from casadi import * nlpsol( solver, snopt,nlp) libipopt.so (LGPL) libsnopt.so (commercial) libcasadi.so
50 Software architecture: before plugins C++ source casadi.cpp Mr. Cheapskate from casadi import * nlpsol( solver, ipopt,nlp) casadi.hpp nlp_solver.cpp ipopt_interface.cpp snopt_interface.cpp Mrs. Spender from casadi import * nlpsol( solver, snopt,nlp) Error: solver snopt not compiled libipopt.so (LGPL) libsnopt.so (commercial) libcasadi.so
51 Software architecture: plugins from casadi import * C++ source nlpsol( ipopt,nlp) casadi.cpp Dynamically load (dlopen) libcasadi_nlpsolver_ipopt.so casadi.hpp nlp_solver.cpp from casadi import * ipopt_interface.cpp snopt_interface.cpp nlpsol( snopt,nlp) Dynamically load (dlopen) libcasadi_nlpsolver_snopt.so libcasadi_nlpsol_ipopt.so libcasadi_nlpsol_snopt.so libcasadi.so libipopt.so (LGPL) libsnopt.so (commercial)
52 Software architecture: version control C++ source casadi.cpp casadi.hpp nlp_solver.cpp ipopt_interface.cpp snopt_interface.cpp Github
53 Software architecture: version control I added a flag here
54 Software architecture: continuous-integration Every change (commit) is unittested by travis-ci cpu-time Memory checks Green = good Documentation
55 Software architecture: continuous-integration
56 Software architecture: continuous-integration Both appveyor and travis are free for open-source projects Encryption support can test commercial plugins/interfaces e.g. Matlab
57 Software architecture: binaries Linux buildslaves g++ x86_64-w64-mingw32-g++
58 Software architecture: binaries (python) Binary = zip folder e.g. casadi-py27-np1.9.1-v3.0.0.tar.gz /home/me/software - casadi-py27-np1.9.1-v casadi - include - lib - casadi-py27-np1.9.1-v casadi - include - lib import sys sys.path.append( /home/me/software/casadi-py27-np1.9.1-v3.0.0 ) import casadi
59 Software architecture: binaries (matlab) Binary = zip folder e.g. casadi-matlabr2014b-v2.4.0.zip /home/me/software - casadi-matlabr2014b-v casadi - include - lib - casadi-matlabr2014b-v casadi - include - lib addpath('/home/me/software/casadi-matlabr2014b-v3.0.0') import casadi.*
60 SX, MX, Function, AD, JIT nlp, qpsol,... Ipopt plugin, Snopt plugin, OOQP plugin, Ecos plugin, Mosek plugin,. Developers: Joel Andersson, Joris Gillis Greg Horn, Niels van Duijkeren
61 Go write your dynamic optimization tool...
62 e.g. Optoy Go write your dynamic optimization tool...
63 Optoy Go write your dynamic optimization tool...
64 Go write your dynamic optimization tool...
65 Go write your dynamic optimization tool... Prototype-style speed of development State-of-the-art performance
66 Thanks for your attention Let's do some coding...
CasADi tutorial Introduction
Lund, 6 December 2011 CasADi tutorial Introduction Joel Andersson Department of Electrical Engineering (ESAT-SCD) & Optimization in Engineering Center (OPTEC) Katholieke Universiteit Leuven OPTEC (ESAT
More informationUser Documentation for CasADi v a1d1a5d78. Joel Andersson Joris Gillis Moritz Diehl
User Documentation for CasADi v3.3.0-194.a1d1a5d78 Joel Andersson Joris Gillis Moritz Diehl February 11, 2018 2 Contents 1 Introduction 5 1.1 What CasADi is and what it is not...................... 5 1.2
More informationautomatic differentiation techniques used in jump
automatic differentiation techniques used in jump Miles Lubin, Iain Dunning, and Joey Huchette June 22, 2016 MIT Operations Research Center 1 / 36 - Solver-independent, fast, extensible, open-source algebraic
More informationSolving Mixed-Integer Linear Programs with MATLAB
Solving Mixed-Integer Linear Programs with MATLAB Bowen Hua Department of Electrical and Computer Engineering The University of Texas at Austin November 2018 Outline Install MATLAB and YALMIP Example problem
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationOutline 19/3/ Introduction. 2. Current Status of qpoases. 3. Future Developments
Hans Joachim Ferreau, ABB Corporate Research, Switzerland Past, Present, Future Outline 1. Introduction 2. Current Status of 3. Future Developments Slide 2 1 Outline 1. Introduction 2. Current Status of
More informationDeep learning in MATLAB From Concept to CUDA Code
Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,
More informationUser Documentation for CasADi v Joel Andersson Joris Gillis Moritz Diehl
User Documentation for CasADi v2.0.0 Joel Andersson Joris Gillis Moritz Diehl July 25, 2014 2 Contents 1 Introduction 5 1.1 What CasADi is and what it is not...................... 5 1.2 Help and support................................
More informationLDPC Simulation With CUDA GPU
LDPC Simulation With CUDA GPU EE179 Final Project Kangping Hu June 3 rd 2014 1 1. Introduction This project is about simulating the performance of binary Low-Density-Parity-Check-Matrix (LDPC) Code with
More informationOn efficient Hessian computation using Edge Pushing algorithm in Julia
On efficient Hessian computation using Edge Pushing algorithm in Julia Feng Qiang Argonne National Laboratory Joint work with Cosmin G. Petra (LLNL), Joey Huchette (MIT), Miles Lubin (MIT), Mihai Anitescu
More informationMatCL - OpenCL MATLAB Interface
MatCL - OpenCL MATLAB Interface MatCL - OpenCL MATLAB Interface Slide 1 MatCL - OpenCL MATLAB Interface OpenCL toolkit for Mathworks MATLAB/SIMULINK Compile & Run OpenCL Kernels Handles OpenCL memory management
More informationCODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS
CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS Dániel Berényi Wigner RCP, GPU Laboratory, Budapest, Hungary Perspectives of GPU Computing in Physics and Astrophysics Rome 2014. INTRODUCTION The most
More informationACADO Code Generation tool
ACADO Code Generation tool Rien Quirynen July 31, 2015 Introduction Automatic Code Generation Real-Time Iterations Application examples ACADO demo Outline Introduction Automatic Code Generation Real-Time
More informationEmbedded Convex Optimization with CVXPY
Embedded Convex Optimization with CVXPY Nicholas Moehle, Jacob Mattingley, Stephen Boyd Stanford University February 2018 Outline Convex optimization Embedded convex optimization DSLs for embedded convex
More informationSerial and Parallel Sobel Filtering for multimedia applications
Serial and Parallel Sobel Filtering for multimedia applications Gunay Abdullayeva Institute of Computer Science University of Tartu Email: gunay@ut.ee Abstract GSteamer contains various plugins to apply
More informationAccelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic
Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationThe Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
The Power of Streams on the SRC MAP Wim Bohm Colorado State University RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in
More informationNumbaPro CUDA Python. Square matrix multiplication
NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore
More informationIntroduction to OpenCL!
Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model
More informationParalution & ViennaCL
Paralution & ViennaCL Clemens Schiffer June 12, 2014 Clemens Schiffer (Uni Graz) Paralution & ViennaCL June 12, 2014 1 / 32 Introduction Clemens Schiffer (Uni Graz) Paralution & ViennaCL June 12, 2014
More informationCS 179 Lecture 16. Logistic Regression & Parallel SGD
CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)
More informationDeep Learning: Transforming Engineering and Science The MathWorks, Inc.
Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA
More informationComputational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?
Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationModeling a 4G LTE System in MATLAB
Modeling a 4G LTE System in MATLAB Part 2: Simulation acceleration Houman Zarrinkoub PhD. Signal Processing Product Manager MathWorks houmanz@mathworks.com 2011 The MathWorks, Inc. 1 Why simulation acceleration?
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationLecture 3. Introduction to Matlab
Lecture 3 Introduction to Matlab Programming Today s Lecture Matlab programming Programming environment and search path M-file scripts and functions Flow control statements Function functions Programming
More informationScientific computing platforms at PGI / JCNS
Member of the Helmholtz Association Scientific computing platforms at PGI / JCNS PGI-1 / IAS-1 Scientific Visualization Workshop Josef Heinen Outline Introduction Python distributions The SciPy stack Julia
More informationOpen Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,
Open Compute Stack (OpenCS) Overview D.D. Nikolić Updated: 20 August 2018 DAE Tools Project, http://www.daetools.com/opencs What is OpenCS? A framework for: Platform-independent model specification 1.
More informationTerminology & Basic Concepts
Terminology & Basic Concepts Language Processors The basic model of a language processor is the black box translator (or transducer) Has one input stream, one output stream, and a black box (program) that
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationGraph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.
Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal
More informationData Parallel Algorithmic Skeletons with Accelerator Support
MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support Steffen Ernsting and Herbert Kuchen July 2, 2015 Agenda WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationProfiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang
Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework
More informationAutomatic Code Generation TVM Stack
Automatic Code Generation TVM Stack CSE 599W Spring TVM stack is an active project by saml.cs.washington.edu and many partners in the open source community The Gap between Framework and Hardware Frameworks
More informationAccelerating System Simulations
Accelerating System Simulations 김용정부장 Senior Applications Engineer 2013 The MathWorks, Inc. 1 Why simulation acceleration? From algorithm exploration to system design Size and complexity of models increases
More informationINTEROPERABILITY WITH FMI TOOLS AND SOFTWARE COMPONENTS. Johan Åkesson
INTEROPERABILITY WITH FMI TOOLS AND SOFTWARE COMPONENTS Johan Åkesson 1 OUTLINE FMI Technology FMI tools Industrial FMI integration example THE FUNCTIONAL MOCK-UP INTERFACE Problems/needs Component development
More informationLABORATORIO DI ARCHITETTURE E PROGRAMMAZIONE DEI SISTEMI ELETTRONICI INDUSTRIALI
LABORATORIO DI ARCHITETTURE E PROGRAMMAZIONE DEI SISTEMI ELETTRONICI INDUSTRIALI Laboratory Lesson 10: CMSIS DSP Library and Functions Final Assignment Prof. Luca Benini Prof Davide
More informationBuild and Test. The COIN-OR Way Ted Ralphs. COIN forgery: Developing Open Source Tools for OR
Build and Test The COIN-OR Way Ted Ralphs COIN forgery: Developing Open Source Tools for OR Institute for Mathematics and Its Applications, Minneapolis, MN Outline 1 Build and Install 2 Unit Testing 3
More informationA Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT
A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT Daniel Schlifske ab and Henry Medeiros a a Marquette University, 1250 W Wisconsin Ave, Milwaukee,
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationModel-Based Dynamic Optimization with OpenModelica and CasADi
Model-Based Dynamic Optimization with OpenModelica and CasADi Alachew Shitahun PELAB Programming Environment Lab, Dept. Computer Science Linköping University, SE-581 83 Linköping, Sweden Vitalij Ruge Mathematics
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationA Cross-Input Adaptive Framework for GPU Program Optimizations
A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang, Xipeng Shen Computer Science Department The College of William & Mary Outline GPU overview G-Adapt Framework Evaluation
More informationLarge-scale workflows for wave-equation based inversion in Julia
Large-scale workflows for wave-equation based inversion in Julia Philipp A. Witte, Mathias Louboutin and Felix J. Herrmann SLIM University of British Columbia Motivation Use Geophysics to understand the
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationSTRING KERNEL TESTING ACCELERATION USING MICRON S AUTOMATA PROCESSOR
STRING KERNEL TESTING ACCELERATION USING MICRON S AUTOMATA PROCESSOR Chunkun Bo 1,2, Ke Wang 1,2, Yanjun (Jane) Qi 1, Kevin Skadron 1,2 1 Department of Computer Science 2 Center for Automata Processing
More informationOptimization with Gradient and Hessian Information Calculated Using Hyper-Dual Numbers
Optimization with Gradient and Hessian Information Calculated Using Hyper-Dual Numbers Jeffrey A. Fike and Juan J. Alonso Department of Aeronautics and Astronautics, Stanford University, Stanford, CA 94305,
More informationSensitivity computation based on CasADi for dynamic optimization problems
Sensitivity computation based on CasADi for dynamic optimization problems 16 th European Workshop on Automatic Differentiation 8 9 December 2014, Jena, Germany Evgeny Lazutkin, Abebe Geletu, Siegbert Hopfgarten,
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationGPUs and Einstein s Equations
GPUs and Einstein s Equations Tim Dewey Advisor: Dr. Manuel Tiglio AMSC Scientific Computing University of Maryland May 5, 2011 Outline 1 Project Summary 2 Evolving Einstein s Equations 3 Implementation
More informationINTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX
INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX Keith Ma ---------------------------------------- keithma@bu.edu Research Computing Services ----------- help@rcs.bu.edu Boston University ----------------------------------------------------
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationMOSEK Optimization Suite
MOSEK Optimization Suite Release 8.1.0.72 MOSEK ApS 2018 CONTENTS 1 Overview 1 2 Interfaces 5 3 Remote optimization 11 4 Contact Information 13 i ii CHAPTER ONE OVERVIEW The problem minimize 1x 1 + 2x
More informationOn Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy
On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationtrisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee
trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell Khronos OpenCL SYCL committee 05/12/2015 IWOCL 2015 SYCL Tutorial OpenCL SYCL committee work... Weekly telephone meeting Define
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationAccelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger
Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger The goal of my project was to develop an optimized linear system solver to shorten the
More informationHigh-Level API for GPGPU using Meta-programming
High-Level API for GPGPU using Meta-programming Joel Falcou University Paris-Sud LRI December 15, 2015 The Hardware/Software Trade-Off Single Core Era Performance Multi-Core/SIMD Era Performance Heterogenous
More informationOmpCloud: Bridging the Gap between OpenMP and Cloud Computing
OmpCloud: Bridging the Gap between OpenMP and Cloud Computing Hervé Yviquel, Marcio Pereira and Guido Araújo University of Campinas (UNICAMP), Brazil A bit of background qguido Araujo, PhD Princeton University
More informationSDA: Software-Defined Accelerator for general-purpose big data analysis system
SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationParallel implicit ordinary differential equation solver for cuda. Tomasz M. Kardaś
Parallel implicit ordinary differential equation solver for cuda Tomasz M. Kardaś August 11, 2014 Chapter 1 Parallel Implicit Ordinary Differential Equations Solver A simplest definition of stiffness,
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationL0 L1 L2 L3 T0 T1 T2 T3. Eth1-4. Eth1-4. Eth1-2 Eth1-2 Eth1-2 Eth Eth3-4 Eth3-4 Eth3-4 Eth3-4.
Click! N P 10.3.0.1 10.3.1.1 Eth33 Eth1-4 Eth1-4 C0 C1 Eth1-2 Eth1-2 Eth1-2 Eth1-2 10.2.0.1 10.2.1.1 10.2.2.1 10.2.3.1 Eth3-4 Eth3-4 Eth3-4 Eth3-4 L0 L1 L2 L3 Eth24-25 Eth1-24 Eth24-25 Eth24-25 Eth24-25
More informationAccelerating sequential computer vision algorithms using commodity parallel hardware
Accelerating sequential computer vision algorithms using commodity parallel hardware Platform Parallel Netherlands GPGPU-day, 28 June 2012 Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision
More informationIntroduction to Matlab
EL1150, Lecture 2 Matlab Programming Introduction to Matlab Based on lectures by F. Gustafsson, Linköping University http://www.kth.se/ees/utbildning/kurshemsidor/control/el1150 1 Today s Lecture Matlab
More informationFrom Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M.
More informationDomain Specific Languages for Convex Optimization
Domain Specific Languages for Convex Optimization Stephen Boyd joint work with E. Chu, J. Mattingley, M. Grant Electrical Engineering Department, Stanford University ROKS 2013, Leuven, 9 July 2013 1 Outline
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationOptimal Control Techniques for Dynamic Walking
Optimal Control Techniques for Dynamic Walking Optimization in Robotics & Biomechanics IWR, University of Heidelberg Presentation partly based on slides by Sebastian Sager, Moritz Diehl and Peter Riede
More informationIn-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units
Page 1 of 17 In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Niloo Ranjan Jibonananda Sanyal Joshua New Page 2 of 17 Table of Contents In-Situ Statistical Analysis
More informationA Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory
A Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory Maximilian Seitzer, Michael Gruhn, Tilo Müller Friedrich Alexander Universität Erlangen-Nürnberg https://www1.cs.fau.de Introduction
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationEfficiency of second-order differentiation schemes by algorithmic differentiation: Case Study
Efficiency of second-order differentiation schemes by algorithmic differentiation: Case Study Thorsten Lajewski Supervisor: Johannes Lotz STCE, RWTH Aachen May 6, 2013 1 Introduction In numerical calculations
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationParallelizing Cryptography. Gordon Werner Samantha Kenyon
Parallelizing Cryptography Gordon Werner Samantha Kenyon Outline Security requirements Cryptographic Primitives Block Cipher Parallelization of current Standards AES RSA Elliptic Curve Cryptographic Attacks
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationMulti-threaded Optimization Toolbox
Multi-threaded Optimization Toolbox Release 0.3.11 Robbert Harms Feb 16, 2018 Contents 1 Introduction 1 1.1 Can MOT help me?......................................... 1 1.2 Summary...............................................
More informationOn the efficiency of the Accelerated Processing Unit for scientific computing
24 th High Performance Computing Symposium Pasadena, April 5 th 2016 On the efficiency of the Accelerated Processing Unit for scientific computing I. Said, P. Fortin, J.-L. Lamotte, R. Dolbeau, H. Calandra
More informationHow to Run an AMESim model with the RT-LAB Platform
How to Run an AMESim model with the RT-LAB Platform This document will describe how to run an AMESim model in real-time with the RT-LAB platform, or integrate it as a part of RT-LAB real-time model. 1.
More informationSubset Sum Problem Parallel Solution
Subset Sum Problem Parallel Solution Project Report Harshit Shah hrs8207@rit.edu Rochester Institute of Technology, NY, USA 1. Overview Subset sum problem is NP-complete problem which can be solved in
More informationJacket: Faster MATLAB Genomics Codes. by Chris McClanahan, GPU Engineer
Jacket: Faster MATLAB Genomics Codes by Chris McClanahan, GPU Engineer Outline Introduction to Jacket for MATLAB GFOR Comparison with GPU-PCT alternative Case Studies: Genomics Examples Matrix Types gdouble
More informationA Brief Look at Optimization
A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest
More informationACADO TOOLKIT - AUTOMATIC CONTROL AND DYNAMIC OPTIMIZATION
ACADO Toolkit - Automatic Control and Dynamic Optimization p. 1/24 ACADO TOOLKIT - AUTOMATIC CONTROL AND DYNAMIC OPTIMIZATION Moritz Diehl, Boris Houska, Hans Joachim Ferreau Optimization in Engineering
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationAccelerated Machine Learning Algorithms in Python
Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals
More informationShared Memory Consistency Models: A Tutorial
Shared Memory Consistency Models: A Tutorial By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The
More informationFirst steps to solve MINLP problems with Muriqui Optimizer
First steps to solve MINLP problems with Muriqui Optimizer June 8, 2018 Wendel Melo College of Computer Science Federal University of Uberlandia, Brazil wendelmelo@ufu.br Marcia Fampa Institute of Mathematics
More informationExploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement
Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More information