For-loop equivalents and massive parallelization in CasADi

Size: px
Start display at page:

Download "For-loop equivalents and massive parallelization in CasADi"

Transcription

1 For-loop equivalents and massive parallelization in CasADi Citius, Altius, Fortius Joris Gillis (MECO group, KU Leuven)

2 Outline Key ingredients of CasADi New concepts: Map, MapAccum Massive parallelization Software architecture

3 Core idea Computer-aided formulation of optimization problems symbolic primitive expression x = optivar() a = optivar() optimize(x^2+a^2, [ x+a^2>= 0 ]) 2 minimize x +a x,a subject to 2 2 x +a 0 Obtain gradients, Jacobians, Hessians automatically

4 Basic example Find initial condition to achieve a goal x =a x +cos x 2 minimize x N x0 x0 = optivar() a = optivar() dt = 1 x1 x2 x3 xn = x0 + dt*(a*x0 + cos(x0)) = x1 + dt*(a*x1 + cos(x1)) = x2 + dt*(a*x2 + cos(x2)) = optimize(xn**2)

5 Basic example Find initial condition to achieve a goal x =a x +cos x 2 minimize x N x0 x0 = optivar() a = optivar() dt = 1 x = x0 for i in range(n): x = x + dt*(a*x + cos(x)) xn = x optimize(xn**2)

6 Key ingredient: efficient symbolics Matlab symbolic toolbox (mupad) syms('x0') syms('a') dt = 1; x = x0; for i=1:n x = x+dt*(a*x+cos(x)); end

7 Key ingredient: efficient symbolics Matlab: yalmip x0 = sdpvar(1,1) a = sdpvar(1,1) dt = 1; x = x0; for i=1:n x = x+dt*(a*x+cos(x)); end

8 Key ingredient: efficient symbolics Python: sympy x0 = symbols('x0') a = symbols('a') dt = 1 x = x0 for i in range(n): x = x+dt*(a*x+cos(x))

9 Key ingredient: efficient symbolics Python: casadi x0 = MX.sym('x0') a = MX.sym('a') dt = 1 x = x0 for i in range(n): x = x+dt*(a*x+cos(x))

10 Key ingredient: efficient symbolics Construction time [s] x0 = MX.sym('x0') a = MX.sym('a') dt = 1 x = x0 for i in range(n): x = x+dt*(a*x+cos(x)) N

11 Key ingredient: efficient symbolics x0 = optivar() a = optivar() dt = 1 x0 + dt*(a*x0 + cos(x0)) x1 x2 x3 xn = x0 + dt*(a*x0 + cos(x0)) = x1 + dt*(a*x1 + cos(x1)) = x2 + dt*(a*x2 + cos(x2)) = x2 = x1 + dt*(a*x1 + cos(x1)) x2 = (x0 + dt*(a*x0 + cos(x0))) + dt*(a*(x0 + dt*(a*x0 + cos(x0))) + cos(x0 + dt*(a*x0 + cos(x0)))) Expression length ~ 3^N

12 Key ingredient: efficient symbolics x0 a cos graph representation x0 = MX.sym('x0') a = MX.sym('a') dt = 1 * * x = x0 cos * + + for i in range(n): x = x+dt*(a*x+cos(x)) * cos * #nodes: O(N) + + *

13 Key ingredient: algorithmic differentiation Cost of a (forward/reverse) derivative of: small multiple of original 2 minimize x N x0 d xn d x0

14 Key ingredient: algorithmic differentiation Construction time [s] x0 = MX.sym('x0') a = MX.sym('a') dt = 1 x = x0 for i in range(n): x = x+dt*(a*x+cos(x)) jacobian(x,x0) N d xn d x0

15 Key ingredient: matrix-valued graphs x = A x+cos x x ℝ n x = MX.sym('x',2,2) A = MX.sym('A',2) x = SX.sym('x',2,2) A = SX.sym('A',2,2) mul(a,x) mul(a,x) A x mul Scalar expansion ~ loop unrolling * a00 a01 x0 * a10 a11 * * + x1 +

16 Key ingredient: function embedding x0 a cos x0 = MX.sym('x0') a = MX.sym('a') dt = 1 * x = x0 * cos * + + for i in range(n): x = x+dt*(a*x+cos(x)) * cos * + + *

17 Key ingredient: function embedding x0 a x0 = MX.sym('x0') a = MX.sym('a') dt = 1 F x = x0 for i in range(n): x = F(x,a) F F

18 Key ingredient: function embedding x0 a x0 = MX.sym('x0') a = MX.sym('a') dt = 1 F x = x0 for i in range(n): x = F(x,a) F x F x = MX.sym('x') a = MX.sym('a') dt = 1 e = x+dt*(a*x+cos(x)) F = Function( F,[x,a],[e]) a F cos * *

19 Key ingredient: function embedding rootfinder linsol integrator...

20 Outline Key ingredients of CasADi Efficient symbolics Algorithmic differentiation Matrix-valued graphs Function embedding New concepts: Map, MapAccum Massive parallelization Software architecture

21 Towards large scale optimal control 2 x k +1= F ( x k,uk ) x0 minimize x N x 0, u0, u1... u N 1 U u x0 = MX.sym('x0') U = MX.sym('u',1,N) dt = 1 F w = horzsplit(u) x = x0 F for i in range(n): x = F(x,w[i]) x F MX graph (optimized for memory footprint) a F cos * 1 SX graph (optimized for speed) + + *

22 Towards large scale optimal control x0 U u... F F F N = , ,...

23 Towards large scale optimal control Not fast enough! Construction time [s] Painfully slow Lightning fast N = , ,...

24 New node: MapAccum x0 U u... F x0 U MapAccum(F,N) (x1) F x1 x2 x3 (x2) F (x3)

25 Practical example: sys id minimize x =v 3 v =u k x c v d x 2 x i x i 2 x 0, k, c, d... x k +1= F ( x k,uk, p) u_data x u params x0 F x_next MapAccum(F,N) k,c,d x1 x2 x3 Function( F,[x,u,vertcat([k,c,d])],[...])

26 Practical example: sys id minimize 2 x i x i 2 x 0, k, c, d x =v 3 v =u k x c v d x R = F.mapaccum("R", N) X_symbolic = R(x0, u_data, repmat(params,1,n)) e = y_data-x_symbolic[0,:].t; Measured data Known signals

27 Practical example: sys id Construction time [s] for i in range(n): X = F(X,u_data[:,i],params) R = F.mapaccum("R", N) N

28 Construction time evaluation time Up-front cost Numerical evaluation (iterations of nlp)

29 Construction time evaluation time So MapAccum is useless to speed up online computations? Evaluation time [s] Numerical evaluation (iterations of nlp)

30 Evaluation time [s] Code-generation

31 Code-generation First you need to run the compiler... Compilation time [s] Compilation memory [B]

32 Outline Key ingredients of CasADi Efficient symbolics Algorithmic differentiation Matrix-valued graphs Function embedding New concepts: MapAccum 2 orders of magnitude faster construction compile 2 orders of magnitude larger problems Software architecture

33 MapAccum ~ single shooting 2 minimize x N x 0, u0, u1... u N 1 x0 U u... F x0 U MapAccum(F,N) (x1) F x1 x2 x3 (x2) F (x3)

34 Map ~ multiple shooting 2 minimize x N x 0, x 1,... x N,u0, u1... u N 1 X U u x U Map(F,N) F x1 x2 x3 F... x1 x2 x3 F

35 Map ~ multiple shooting 2 minimize x N x 0, x 1,... x N,u0, u1... u N 1 s.t. F is a function that: - reads a few inputs - computes a lot (e.g. time integration) - writes a few outputs x 1 F ( x 0 ; u0 )=0 x 2 F ( x 1 ;u1 )=0... Ideal for parallelism: - openmp support builtin - opencl GPU (proof-of-concept) x N F (x N 1 ; u N 1 )=0 R = F.map("R","serial",N) X_n = R(X, u_data, repmat(params,1,n)) gaps = X[:,1:]-X_n[:,:-1]

36 Outline Key ingredients of CasADi Efficient symbolics Algorithmic differentiation Matrix-valued graphs Function embedding New concepts: Map, MapAccum Speedups Massive parallelization Software architecture

37 Operation on an image D f (D i, j, X ) i, j i j

38 Operation on an image f (D i, j, X ) i, j X = MX.sym( x,2) D = MX.sym( D,1024,768) R = F.map( mymap, serial,1024*768) Dflat = vec(d).t terms = R(Dflat,X) e = sum2(terms)

39 Operation on an image f (D i, j, X ) i, j X = MX.sym( x,2) D = MX.sym( D,1024,768) R = F.map( mymap, openmp,1024*768) Dflat = vec(d).t terms = R(Dflat,X) e = sum2(terms)

40 Operation on an image f (D i, j, X ) i, j X = MX.sym( x,2) D = MX.sym( D,1024,768) R = F.map( mymap, opencl,1024*768) Dflat = vec(d).t terms = R(Dflat,X) e = sum2(terms)

41 Operation on an image f (D i, j, X ) i, j X = MX.sym( x,2) D = MX.sym( D,1024,768) R = F.map( mymap, opencl,1024*768) Dflat = vec(d).t terms = R(Dflat,X) e = sum2(terms)

42 What is OpenCL? C library Compiler for computation kernels Low-level memory concepts Write once, run on CPU GPU FPGA

43 Operation on an image f (D i, j, X ) i, j X = MX.sym( x,2) D = MX.sym( D,1024,768) Init: Allocate buffer for x on GPU Allocate buffer for D on GPU Allocate buffer for result on GPU Compile GPU kernel Eval: Send x R = F.map( mymap, opencl,1024*768) Send D Compute kernel Retrieve sol Dflat = vec(d).t terms = R(Dflat,X) e = sum2(terms)

44 Kernelsum node f (D i, j, X ) i, j Init: Allocate buffer for x on GPU Allocate buffer for D on GPU Allocate buffer for result on GPU Send D Compile GPU kernel Eval: Send x Compute kernel Sum result Retrieve sum

45 Kernelsum node For Flanders Make: x speedup on a Titan Black GPU (2000 cores)

46 Outline Key ingredients of CasADi Efficient symbolics Algorithmic differentiation Matrix-valued graphs Function embedding New concepts: Map, MapAccum Massive parallelization Software architecture

47 Software architecture: interfaces Python interface source casadipython_wrap.cxx C++ source casadi.cpp casadi.hpp SWIG interface header _casadi.so casadi.i Matlab interface source casadimatlab_wrap.cxx libcasadi.so casadimatlab_wrap.mexa64

48 Software architecture: before plugins C++ source casadi.cpp casadi.hpp nlp_solver.cpp ipopt_interface.cpp snopt_interface.cpp libipopt.so (LGPL) libsnopt.so (commercial) libcasadi.so

49 Software architecture: before plugins C++ source casadi.cpp casadi.hpp nlp_solver.cpp ipopt_interface.cpp snopt_interface.cpp Mr. Cheapskate from casadi import * Error: symbol snopt not found Mrs. Spender from casadi import * nlpsol( solver, snopt,nlp) libipopt.so (LGPL) libsnopt.so (commercial) libcasadi.so

50 Software architecture: before plugins C++ source casadi.cpp Mr. Cheapskate from casadi import * nlpsol( solver, ipopt,nlp) casadi.hpp nlp_solver.cpp ipopt_interface.cpp snopt_interface.cpp Mrs. Spender from casadi import * nlpsol( solver, snopt,nlp) Error: solver snopt not compiled libipopt.so (LGPL) libsnopt.so (commercial) libcasadi.so

51 Software architecture: plugins from casadi import * C++ source nlpsol( ipopt,nlp) casadi.cpp Dynamically load (dlopen) libcasadi_nlpsolver_ipopt.so casadi.hpp nlp_solver.cpp from casadi import * ipopt_interface.cpp snopt_interface.cpp nlpsol( snopt,nlp) Dynamically load (dlopen) libcasadi_nlpsolver_snopt.so libcasadi_nlpsol_ipopt.so libcasadi_nlpsol_snopt.so libcasadi.so libipopt.so (LGPL) libsnopt.so (commercial)

52 Software architecture: version control C++ source casadi.cpp casadi.hpp nlp_solver.cpp ipopt_interface.cpp snopt_interface.cpp Github

53 Software architecture: version control I added a flag here

54 Software architecture: continuous-integration Every change (commit) is unittested by travis-ci cpu-time Memory checks Green = good Documentation

55 Software architecture: continuous-integration

56 Software architecture: continuous-integration Both appveyor and travis are free for open-source projects Encryption support can test commercial plugins/interfaces e.g. Matlab

57 Software architecture: binaries Linux buildslaves g++ x86_64-w64-mingw32-g++

58 Software architecture: binaries (python) Binary = zip folder e.g. casadi-py27-np1.9.1-v3.0.0.tar.gz /home/me/software - casadi-py27-np1.9.1-v casadi - include - lib - casadi-py27-np1.9.1-v casadi - include - lib import sys sys.path.append( /home/me/software/casadi-py27-np1.9.1-v3.0.0 ) import casadi

59 Software architecture: binaries (matlab) Binary = zip folder e.g. casadi-matlabr2014b-v2.4.0.zip /home/me/software - casadi-matlabr2014b-v casadi - include - lib - casadi-matlabr2014b-v casadi - include - lib addpath('/home/me/software/casadi-matlabr2014b-v3.0.0') import casadi.*

60 SX, MX, Function, AD, JIT nlp, qpsol,... Ipopt plugin, Snopt plugin, OOQP plugin, Ecos plugin, Mosek plugin,. Developers: Joel Andersson, Joris Gillis Greg Horn, Niels van Duijkeren

61 Go write your dynamic optimization tool...

62 e.g. Optoy Go write your dynamic optimization tool...

63 Optoy Go write your dynamic optimization tool...

64 Go write your dynamic optimization tool...

65 Go write your dynamic optimization tool... Prototype-style speed of development State-of-the-art performance

66 Thanks for your attention Let's do some coding...

CasADi tutorial Introduction

CasADi tutorial Introduction Lund, 6 December 2011 CasADi tutorial Introduction Joel Andersson Department of Electrical Engineering (ESAT-SCD) & Optimization in Engineering Center (OPTEC) Katholieke Universiteit Leuven OPTEC (ESAT

More information

User Documentation for CasADi v a1d1a5d78. Joel Andersson Joris Gillis Moritz Diehl

User Documentation for CasADi v a1d1a5d78. Joel Andersson Joris Gillis Moritz Diehl User Documentation for CasADi v3.3.0-194.a1d1a5d78 Joel Andersson Joris Gillis Moritz Diehl February 11, 2018 2 Contents 1 Introduction 5 1.1 What CasADi is and what it is not...................... 5 1.2

More information

automatic differentiation techniques used in jump

automatic differentiation techniques used in jump automatic differentiation techniques used in jump Miles Lubin, Iain Dunning, and Joey Huchette June 22, 2016 MIT Operations Research Center 1 / 36 - Solver-independent, fast, extensible, open-source algebraic

More information

Solving Mixed-Integer Linear Programs with MATLAB

Solving Mixed-Integer Linear Programs with MATLAB Solving Mixed-Integer Linear Programs with MATLAB Bowen Hua Department of Electrical and Computer Engineering The University of Texas at Austin November 2018 Outline Install MATLAB and YALMIP Example problem

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

Outline 19/3/ Introduction. 2. Current Status of qpoases. 3. Future Developments

Outline 19/3/ Introduction. 2. Current Status of qpoases. 3. Future Developments Hans Joachim Ferreau, ABB Corporate Research, Switzerland Past, Present, Future Outline 1. Introduction 2. Current Status of 3. Future Developments Slide 2 1 Outline 1. Introduction 2. Current Status of

More information

Deep learning in MATLAB From Concept to CUDA Code

Deep learning in MATLAB From Concept to CUDA Code Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,

More information

User Documentation for CasADi v Joel Andersson Joris Gillis Moritz Diehl

User Documentation for CasADi v Joel Andersson Joris Gillis Moritz Diehl User Documentation for CasADi v2.0.0 Joel Andersson Joris Gillis Moritz Diehl July 25, 2014 2 Contents 1 Introduction 5 1.1 What CasADi is and what it is not...................... 5 1.2 Help and support................................

More information

LDPC Simulation With CUDA GPU

LDPC Simulation With CUDA GPU LDPC Simulation With CUDA GPU EE179 Final Project Kangping Hu June 3 rd 2014 1 1. Introduction This project is about simulating the performance of binary Low-Density-Parity-Check-Matrix (LDPC) Code with

More information

On efficient Hessian computation using Edge Pushing algorithm in Julia

On efficient Hessian computation using Edge Pushing algorithm in Julia On efficient Hessian computation using Edge Pushing algorithm in Julia Feng Qiang Argonne National Laboratory Joint work with Cosmin G. Petra (LLNL), Joey Huchette (MIT), Miles Lubin (MIT), Mihai Anitescu

More information

MatCL - OpenCL MATLAB Interface

MatCL - OpenCL MATLAB Interface MatCL - OpenCL MATLAB Interface MatCL - OpenCL MATLAB Interface Slide 1 MatCL - OpenCL MATLAB Interface OpenCL toolkit for Mathworks MATLAB/SIMULINK Compile & Run OpenCL Kernels Handles OpenCL memory management

More information

CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS

CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS Dániel Berényi Wigner RCP, GPU Laboratory, Budapest, Hungary Perspectives of GPU Computing in Physics and Astrophysics Rome 2014. INTRODUCTION The most

More information

ACADO Code Generation tool

ACADO Code Generation tool ACADO Code Generation tool Rien Quirynen July 31, 2015 Introduction Automatic Code Generation Real-Time Iterations Application examples ACADO demo Outline Introduction Automatic Code Generation Real-Time

More information

Embedded Convex Optimization with CVXPY

Embedded Convex Optimization with CVXPY Embedded Convex Optimization with CVXPY Nicholas Moehle, Jacob Mattingley, Stephen Boyd Stanford University February 2018 Outline Convex optimization Embedded convex optimization DSLs for embedded convex

More information

Serial and Parallel Sobel Filtering for multimedia applications

Serial and Parallel Sobel Filtering for multimedia applications Serial and Parallel Sobel Filtering for multimedia applications Gunay Abdullayeva Institute of Computer Science University of Tartu Email: gunay@ut.ee Abstract GSteamer contains various plugins to apply

More information

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED. The Power of Streams on the SRC MAP Wim Bohm Colorado State University RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in

More information

NumbaPro CUDA Python. Square matrix multiplication

NumbaPro CUDA Python. Square matrix multiplication NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore

More information

Introduction to OpenCL!

Introduction to OpenCL! Lecture 6! Introduction to OpenCL! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! OpenCL Architecture Defined in four parts Platform Model

More information

Paralution & ViennaCL

Paralution & ViennaCL Paralution & ViennaCL Clemens Schiffer June 12, 2014 Clemens Schiffer (Uni Graz) Paralution & ViennaCL June 12, 2014 1 / 32 Introduction Clemens Schiffer (Uni Graz) Paralution & ViennaCL June 12, 2014

More information

CS 179 Lecture 16. Logistic Regression & Parallel SGD

CS 179 Lecture 16. Logistic Regression & Parallel SGD CS 179 Lecture 16 Logistic Regression & Parallel SGD 1 Outline logistic regression (stochastic) gradient descent parallelizing SGD for neural nets (with emphasis on Google s distributed neural net implementation)

More information

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Deep Learning: Transforming Engineering and Science The MathWorks, Inc. Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA

More information

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors? The CVDLab Team Francesco Furiani Tue, April 3, 2014 ROMA TRE UNIVERSITÀ DEGLI STUDI

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Modeling a 4G LTE System in MATLAB

Modeling a 4G LTE System in MATLAB Modeling a 4G LTE System in MATLAB Part 2: Simulation acceleration Houman Zarrinkoub PhD. Signal Processing Product Manager MathWorks houmanz@mathworks.com 2011 The MathWorks, Inc. 1 Why simulation acceleration?

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

Lecture 3. Introduction to Matlab

Lecture 3. Introduction to Matlab Lecture 3 Introduction to Matlab Programming Today s Lecture Matlab programming Programming environment and search path M-file scripts and functions Flow control statements Function functions Programming

More information

Scientific computing platforms at PGI / JCNS

Scientific computing platforms at PGI / JCNS Member of the Helmholtz Association Scientific computing platforms at PGI / JCNS PGI-1 / IAS-1 Scientific Visualization Workshop Josef Heinen Outline Introduction Python distributions The SciPy stack Julia

More information

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project, Open Compute Stack (OpenCS) Overview D.D. Nikolić Updated: 20 August 2018 DAE Tools Project, http://www.daetools.com/opencs What is OpenCS? A framework for: Platform-independent model specification 1.

More information

Terminology & Basic Concepts

Terminology & Basic Concepts Terminology & Basic Concepts Language Processors The basic model of a language processor is the black box translator (or transducer) Has one input stream, one output stream, and a black box (program) that

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal

More information

Data Parallel Algorithmic Skeletons with Accelerator Support

Data Parallel Algorithmic Skeletons with Accelerator Support MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support Steffen Ernsting and Herbert Kuchen July 2, 2015 Agenda WESTFÄLISCHE MÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework

More information

Automatic Code Generation TVM Stack

Automatic Code Generation TVM Stack Automatic Code Generation TVM Stack CSE 599W Spring TVM stack is an active project by saml.cs.washington.edu and many partners in the open source community The Gap between Framework and Hardware Frameworks

More information

Accelerating System Simulations

Accelerating System Simulations Accelerating System Simulations 김용정부장 Senior Applications Engineer 2013 The MathWorks, Inc. 1 Why simulation acceleration? From algorithm exploration to system design Size and complexity of models increases

More information

INTEROPERABILITY WITH FMI TOOLS AND SOFTWARE COMPONENTS. Johan Åkesson

INTEROPERABILITY WITH FMI TOOLS AND SOFTWARE COMPONENTS. Johan Åkesson INTEROPERABILITY WITH FMI TOOLS AND SOFTWARE COMPONENTS Johan Åkesson 1 OUTLINE FMI Technology FMI tools Industrial FMI integration example THE FUNCTIONAL MOCK-UP INTERFACE Problems/needs Component development

More information

LABORATORIO DI ARCHITETTURE E PROGRAMMAZIONE DEI SISTEMI ELETTRONICI INDUSTRIALI

LABORATORIO DI ARCHITETTURE E PROGRAMMAZIONE DEI SISTEMI ELETTRONICI INDUSTRIALI LABORATORIO DI ARCHITETTURE E PROGRAMMAZIONE DEI SISTEMI ELETTRONICI INDUSTRIALI Laboratory Lesson 10: CMSIS DSP Library and Functions Final Assignment Prof. Luca Benini Prof Davide

More information

Build and Test. The COIN-OR Way Ted Ralphs. COIN forgery: Developing Open Source Tools for OR

Build and Test. The COIN-OR Way Ted Ralphs. COIN forgery: Developing Open Source Tools for OR Build and Test The COIN-OR Way Ted Ralphs COIN forgery: Developing Open Source Tools for OR Institute for Mathematics and Its Applications, Minneapolis, MN Outline 1 Build and Install 2 Unit Testing 3

More information

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT Daniel Schlifske ab and Henry Medeiros a a Marquette University, 1250 W Wisconsin Ave, Milwaukee,

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Model-Based Dynamic Optimization with OpenModelica and CasADi

Model-Based Dynamic Optimization with OpenModelica and CasADi Model-Based Dynamic Optimization with OpenModelica and CasADi Alachew Shitahun PELAB Programming Environment Lab, Dept. Computer Science Linköping University, SE-581 83 Linköping, Sweden Vitalij Ruge Mathematics

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

A Cross-Input Adaptive Framework for GPU Program Optimizations

A Cross-Input Adaptive Framework for GPU Program Optimizations A Cross-Input Adaptive Framework for GPU Program Optimizations Yixun Liu, Eddy Z. Zhang, Xipeng Shen Computer Science Department The College of William & Mary Outline GPU overview G-Adapt Framework Evaluation

More information

Large-scale workflows for wave-equation based inversion in Julia

Large-scale workflows for wave-equation based inversion in Julia Large-scale workflows for wave-equation based inversion in Julia Philipp A. Witte, Mathias Louboutin and Felix J. Herrmann SLIM University of British Columbia Motivation Use Geophysics to understand the

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

STRING KERNEL TESTING ACCELERATION USING MICRON S AUTOMATA PROCESSOR

STRING KERNEL TESTING ACCELERATION USING MICRON S AUTOMATA PROCESSOR STRING KERNEL TESTING ACCELERATION USING MICRON S AUTOMATA PROCESSOR Chunkun Bo 1,2, Ke Wang 1,2, Yanjun (Jane) Qi 1, Kevin Skadron 1,2 1 Department of Computer Science 2 Center for Automata Processing

More information

Optimization with Gradient and Hessian Information Calculated Using Hyper-Dual Numbers

Optimization with Gradient and Hessian Information Calculated Using Hyper-Dual Numbers Optimization with Gradient and Hessian Information Calculated Using Hyper-Dual Numbers Jeffrey A. Fike and Juan J. Alonso Department of Aeronautics and Astronautics, Stanford University, Stanford, CA 94305,

More information

Sensitivity computation based on CasADi for dynamic optimization problems

Sensitivity computation based on CasADi for dynamic optimization problems Sensitivity computation based on CasADi for dynamic optimization problems 16 th European Workshop on Automatic Differentiation 8 9 December 2014, Jena, Germany Evgeny Lazutkin, Abebe Geletu, Siegbert Hopfgarten,

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

GPUs and Einstein s Equations

GPUs and Einstein s Equations GPUs and Einstein s Equations Tim Dewey Advisor: Dr. Manuel Tiglio AMSC Scientific Computing University of Maryland May 5, 2011 Outline 1 Project Summary 2 Evolving Einstein s Equations 3 Implementation

More information

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX Keith Ma ---------------------------------------- keithma@bu.edu Research Computing Services ----------- help@rcs.bu.edu Boston University ----------------------------------------------------

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

MOSEK Optimization Suite

MOSEK Optimization Suite MOSEK Optimization Suite Release 8.1.0.72 MOSEK ApS 2018 CONTENTS 1 Overview 1 2 Interfaces 5 3 Remote optimization 11 4 Contact Information 13 i ii CHAPTER ONE OVERVIEW The problem minimize 1x 1 + 2x

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell Khronos OpenCL SYCL committee 05/12/2015 IWOCL 2015 SYCL Tutorial OpenCL SYCL committee work... Weekly telephone meeting Define

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger

Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger The goal of my project was to develop an optimized linear system solver to shorten the

More information

High-Level API for GPGPU using Meta-programming

High-Level API for GPGPU using Meta-programming High-Level API for GPGPU using Meta-programming Joel Falcou University Paris-Sud LRI December 15, 2015 The Hardware/Software Trade-Off Single Core Era Performance Multi-Core/SIMD Era Performance Heterogenous

More information

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing OmpCloud: Bridging the Gap between OpenMP and Cloud Computing Hervé Yviquel, Marcio Pereira and Guido Araújo University of Campinas (UNICAMP), Brazil A bit of background qguido Araujo, PhD Princeton University

More information

SDA: Software-Defined Accelerator for general-purpose big data analysis system

SDA: Software-Defined Accelerator for general-purpose big data analysis system SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,

More information

Parallel implicit ordinary differential equation solver for cuda. Tomasz M. Kardaś

Parallel implicit ordinary differential equation solver for cuda. Tomasz M. Kardaś Parallel implicit ordinary differential equation solver for cuda Tomasz M. Kardaś August 11, 2014 Chapter 1 Parallel Implicit Ordinary Differential Equations Solver A simplest definition of stiffness,

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

L0 L1 L2 L3 T0 T1 T2 T3. Eth1-4. Eth1-4. Eth1-2 Eth1-2 Eth1-2 Eth Eth3-4 Eth3-4 Eth3-4 Eth3-4.

L0 L1 L2 L3 T0 T1 T2 T3. Eth1-4. Eth1-4. Eth1-2 Eth1-2 Eth1-2 Eth Eth3-4 Eth3-4 Eth3-4 Eth3-4. Click! N P 10.3.0.1 10.3.1.1 Eth33 Eth1-4 Eth1-4 C0 C1 Eth1-2 Eth1-2 Eth1-2 Eth1-2 10.2.0.1 10.2.1.1 10.2.2.1 10.2.3.1 Eth3-4 Eth3-4 Eth3-4 Eth3-4 L0 L1 L2 L3 Eth24-25 Eth1-24 Eth24-25 Eth24-25 Eth24-25

More information

Accelerating sequential computer vision algorithms using commodity parallel hardware

Accelerating sequential computer vision algorithms using commodity parallel hardware Accelerating sequential computer vision algorithms using commodity parallel hardware Platform Parallel Netherlands GPGPU-day, 28 June 2012 Jaap van de Loosdrecht NHL Centre of Expertise in Computer Vision

More information

Introduction to Matlab

Introduction to Matlab EL1150, Lecture 2 Matlab Programming Introduction to Matlab Based on lectures by F. Gustafsson, Linköping University http://www.kth.se/ees/utbildning/kurshemsidor/control/el1150 1 Today s Lecture Matlab

More information

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M.

More information

Domain Specific Languages for Convex Optimization

Domain Specific Languages for Convex Optimization Domain Specific Languages for Convex Optimization Stephen Boyd joint work with E. Chu, J. Mattingley, M. Grant Electrical Engineering Department, Stanford University ROKS 2013, Leuven, 9 July 2013 1 Outline

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Optimal Control Techniques for Dynamic Walking

Optimal Control Techniques for Dynamic Walking Optimal Control Techniques for Dynamic Walking Optimization in Robotics & Biomechanics IWR, University of Heidelberg Presentation partly based on slides by Sebastian Sager, Moritz Diehl and Peter Riede

More information

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Page 1 of 17 In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Niloo Ranjan Jibonananda Sanyal Joshua New Page 2 of 17 Table of Contents In-Situ Statistical Analysis

More information

A Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory

A Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory A Bytecode Interpreter for Secure Program Execution in Untrusted Main Memory Maximilian Seitzer, Michael Gruhn, Tilo Müller Friedrich Alexander Universität Erlangen-Nürnberg https://www1.cs.fau.de Introduction

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Efficiency of second-order differentiation schemes by algorithmic differentiation: Case Study

Efficiency of second-order differentiation schemes by algorithmic differentiation: Case Study Efficiency of second-order differentiation schemes by algorithmic differentiation: Case Study Thorsten Lajewski Supervisor: Johannes Lotz STCE, RWTH Aachen May 6, 2013 1 Introduction In numerical calculations

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

Parallelizing Cryptography. Gordon Werner Samantha Kenyon

Parallelizing Cryptography. Gordon Werner Samantha Kenyon Parallelizing Cryptography Gordon Werner Samantha Kenyon Outline Security requirements Cryptographic Primitives Block Cipher Parallelization of current Standards AES RSA Elliptic Curve Cryptographic Attacks

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Multi-threaded Optimization Toolbox

Multi-threaded Optimization Toolbox Multi-threaded Optimization Toolbox Release 0.3.11 Robbert Harms Feb 16, 2018 Contents 1 Introduction 1 1.1 Can MOT help me?......................................... 1 1.2 Summary...............................................

More information

On the efficiency of the Accelerated Processing Unit for scientific computing

On the efficiency of the Accelerated Processing Unit for scientific computing 24 th High Performance Computing Symposium Pasadena, April 5 th 2016 On the efficiency of the Accelerated Processing Unit for scientific computing I. Said, P. Fortin, J.-L. Lamotte, R. Dolbeau, H. Calandra

More information

How to Run an AMESim model with the RT-LAB Platform

How to Run an AMESim model with the RT-LAB Platform How to Run an AMESim model with the RT-LAB Platform This document will describe how to run an AMESim model in real-time with the RT-LAB platform, or integrate it as a part of RT-LAB real-time model. 1.

More information

Subset Sum Problem Parallel Solution

Subset Sum Problem Parallel Solution Subset Sum Problem Parallel Solution Project Report Harshit Shah hrs8207@rit.edu Rochester Institute of Technology, NY, USA 1. Overview Subset sum problem is NP-complete problem which can be solved in

More information

Jacket: Faster MATLAB Genomics Codes. by Chris McClanahan, GPU Engineer

Jacket: Faster MATLAB Genomics Codes. by Chris McClanahan, GPU Engineer Jacket: Faster MATLAB Genomics Codes by Chris McClanahan, GPU Engineer Outline Introduction to Jacket for MATLAB GFOR Comparison with GPU-PCT alternative Case Studies: Genomics Examples Matrix Types gdouble

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

ACADO TOOLKIT - AUTOMATIC CONTROL AND DYNAMIC OPTIMIZATION

ACADO TOOLKIT - AUTOMATIC CONTROL AND DYNAMIC OPTIMIZATION ACADO Toolkit - Automatic Control and Dynamic Optimization p. 1/24 ACADO TOOLKIT - AUTOMATIC CONTROL AND DYNAMIC OPTIMIZATION Moritz Diehl, Boris Houska, Hans Joachim Ferreau Optimization in Engineering

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals

More information

Shared Memory Consistency Models: A Tutorial

Shared Memory Consistency Models: A Tutorial Shared Memory Consistency Models: A Tutorial By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The

More information

First steps to solve MINLP problems with Muriqui Optimizer

First steps to solve MINLP problems with Muriqui Optimizer First steps to solve MINLP problems with Muriqui Optimizer June 8, 2018 Wendel Melo College of Computer Science Federal University of Uberlandia, Brazil wendelmelo@ufu.br Marcia Fampa Institute of Mathematics

More information

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information