Sparse Training Data Tutorial of Parameter Server

Similar documents
Carnegie Mellon University Implementation Tutorial of Parameter Server

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Lecture 13: March 25

Scaling Distributed Machine Learning

Adapted from David Patterson s slides on graduate computer architecture

Parallelism. CS6787 Lecture 8 Fall 2017

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

AMath 483/583 Lecture 11

EE/CSCI 451 Midterm 1

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Effect of memory latency

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Accelerating the Iterative Linear Solver for Reservoir Simulation

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Comparing Performance of Solid State Devices and Mechanical Disks

Mo Money, No Problems: Caches #2...

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Lecture 2. Memory locality optimizations Address space organization

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Storage Formats for Sparse Matrices in Java

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

SPARSE PERSISTENT RNN. Feiwen Zhu, 5/9/2017

An Introduction to Parallel Programming

EECS 482 Introduction to Operating Systems

Performance Analysis in the Real World of Online Services

LECTURE 5: MEMORY HIERARCHY DESIGN

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

HPC with Multicore and GPUs

Tools and Primitives for High Performance Graph Computation

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Scientific Computing on GPUs: GPU Architecture Overview

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Case Study 1: Optimizing Cache Performance via Advanced Techniques

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Chapter 6 Solutions S-3

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video

Optimization solutions for the segmented sum algorithmic function

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

Advanced optimizations of cache performance ( 2.2)

Review of previous examinations TMA4280 Introduction to Supercomputing

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Copyright 2012, Elsevier Inc. All rights reserved.

Processes and Threads

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

Lec 25: Parallel Processors. Announcements

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook Computer Systems: A Programmer s

Maximizing Face Detection Performance

Parallel solution for finite element linear systems of. equations on workstation cluster *

Conquering Massive Clinical Models with GPU. GPU Parallelized Logistic Regression

Lect. 2: Types of Parallelism

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Important Note. Today: Starting at the Bottom. DBMS Architecture. General HeapFile Operations. HeapFile In SimpleDB. CSE 444: Database Internals

Parallel Algorithm Engineering

Pipelining and Vector Processing

ECE453: Advanced Computer Architecture II Homework 1

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Rsyslog: going up from 40K messages per second to 250K. Rainer Gerhards

2

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Architecture-Conscious Database Systems

Transcription:

Carnegie Mellon University Sparse Training Data Tutorial of Parameter Server Mu Li! CSD@CMU & IDL@Baidu! muli@cs.cmu.edu

High-dimensional data are sparse Why high dimension?! make the classifier s job easier! linear method is often good enough! Why sparse?! easy to storage (only store non-zero entries)! affordable computation cost! Key difference to dense data when using! require a lot of random read/write

Store and Computing

Compressed storage Sparse row-major: 0 1 2 0 10 20 1 3 2 87 3 57 offset = [0, 2, 3, 4, 5] index = [1, 2, 0, 1, 0] value = 10 20 3 87 57 Sparse column-major: offset = [0, 2, 4, 5] index = [1, 3, 0, 2, 0] value = 3 57 10 87 20 Access a(i,j) under row-major: k = binary_search(index[offset[i]], index[offset[i+1]], j) return valid(k)? value(offset[i]+k) : 0

y = Ax all offset, index, Sample C++ codes: and value are read sequentially write y sequentially, but read x in random for i read x sequentially, but write y in random

10 times { 10 times { slides by Jeff Dean

Cost of y = Ax The computation cost is O(nnz(A))! The random access dominates the cost:!! L2-cache-reference(nnz(A)) In theory: process 1.4e8 nnz entries per second! In reality: 8.4e7 nnz entries per second! 4.3M x 17.4M sparse matrix! mac mbp, Intel i7 2.3GHz cpu! single thread

Real data!!! from sibyl 100 features per example is reasonable! feature groups! for y=ax, process 1e6 examples per second! for linear method, 1000 cores, 100 billion examples, 100 iterations, finish in 3h in ideal 10 11 examples 100 iterations/1000 cores/10 6 = 1000 second

Patterns of Sparsity 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 50 100 150 200 250 300 nnz features per example non-zero entries are distributed irregularly on features! imbalanced workload partition! ill conditional number

Multi-thread Implementation

CPU Multiple cores (4-8)! Multiple sockets (1-4)! 2-4 GHz clock! Memory interface 20-40GB/s! Internal bandwidth >100GB/s

Benefits of multi-thread fetch data from memory ~100 cycles Use more computation units! register float point units! Hide the memory latency! run something else when L1 L2 the data are not ready L3 main memory

Using ThreadPool A pool of threads, each one keeps fetching and executing unfinished tasks! Create a pool with n threads: ThreadPool pool(n)! Add a task into the pool: pool.add(task)! Start executing: pool.startworkers() thread 0: task 0 task 2 task 4 thread 1: task 1 task 3 task 5 time - >

Multi-threaded y = Ax Assume row major! Compute a segment of y!! Divide y into several segments, each one is assigned to a thread c++11 lambda functions

How about column-major? Equivalence to x = A T y for row-major A! multi-threads concurrently write the same y! Several possible solutions:! convert into a row-major matrix first! lock y[i] (or a segment) before write it! each thread writes only a segment of y

Experiments data: CTRa! 3.5 row major, 4.3M rows, 3 17.4M columns, 354M nnz entries! MBP Pro, Intel i7 2.3GHz, speedup x 2.5 2 1.5 y = Ax y = A T x 4 cores, 8 hyper-threads 1 1 2 3 4 5 6 7 8 # of threads

Coding Practice Implement! x = A T y You can reuse the codes at! https://github.com/mli/mlss14_a! CTRa in binary format is provided! CTRa_X.index: 354700138 uint32! CTRa_X.offset: 4349786 uint64! CTRa_X.info: information! 0-1 values, so ignore the value

Row major or column major No big difference for individual and whole access! choose the one how data are stored! Use row major when need read individual rows! SGD, minibatch SGD, online learning! Use column major when need read columns! (block) Coordinate descent! Converting cost 2*nnz(A) random access

More Other operations? BLAS, LAPACK! Timothy A. Davis, Direct Methods for Sparse Linear Systems, SIAM, 2006! Existing packages:! SuiteSparse, Eigen3! Use them as much as possible! however, problem-specific optimizations may improve the performance a lot, we will see later

Eigen3 Easy to install: all header files, just copy to a proper place! Not easy to read: a lot of templates! Good performance on dense data! Somewhat convenient to use! Jeff Dean is using it