A Shared Memory Parallel Algorithm for Data Reduction Using the Singular Value Decomposition Rhonda Phillips, Layne Watson, Randolph Wynne

Similar documents
IMPROVING THE PERFORMANCE OF A HYBRID CLASSIFICATION METHOD USING A PARALLEL ALGORITHM AND A NOVEL DATA REDUCTION TECHNIQUE.

Dimension reduction for hyperspectral imaging using laplacian eigenmaps and randomized principal component analysis

Continuous Iterative Guided Spectral Class Rejection Classification Algorithm: Part 2

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

DIMENSION REDUCTION FOR HYPERSPECTRAL DATA USING RANDOMIZED PCA AND LAPLACIAN EIGENMAPS

41st Cray User Group Conference Minneapolis, Minnesota

Dimension reduction for hyperspectral imaging using laplacian eigenmaps and randomized principal component analysis:midyear Report

An Approximate Singular Value Decomposition of Large Matrices in Julia

Performance analysis, development and improvement of programs, commands and BASH scripts in GNU/Linux systems

Developing Applications for HPRCs

A Multi-Tiered Optimization Framework for Heterogeneous Computing

EE/CSCI 451: Parallel and Distributed Computation

Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices

Fast Sample Generation with Variational Bayesian for Limited Data Hyperspectral Image Classification

OpenMP Shared Memory Programming

Beyond Latency and Throughput

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Scientific Computing. Some slides from James Lambers, Stanford

Scaled Machine Learning at Matroid

Lecture 13: Model selection and regularization

Why Multiprocessors?

Degrees of Freedom in Cached Interference Networks with Limited Backhaul

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Unsupervised Learning

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

Data: a collection of numbers or facts that require further processing before they are meaningful

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

ECE 669 Parallel Computer Architecture

V. Primary & Secondary Memory!

Unsupervised learning in Vision

Parallel Architecture. Sathish Vadhiyar

CS 229 Midterm Review

Latent Semantic Indexing

Challenges in Multiresolution Methods for Graph-based Learning

CPE300: Digital System Architecture and Design

Accelerated Machine Learning Algorithms in Python

Fast Anomaly Detection Algorithms For Hyperspectral Images

Objectives of the Course

Terrain Characterization

Shared-memory Parallel Programming with Cilk Plus

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

How Learning Differs from Optimization. Sargur N. Srihari

April 3, 2012 T.C. Havens

LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems

Clustering Lecture 5: Mixture Model

Fast Algorithms for Mining and Summarizing Co-evolving Sequences

The HPEC Challenge Benchmark Suite

Spectral Classification

QR Decomposition on GPUs

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Staged Memory Scheduling

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

Full Vehicle Dynamic Analysis using Automated Component Modal Synthesis. Peter Schartz, Parallel Project Manager ClusterWorld Conference June 2003

Bipartite Edge Prediction via Transductive Learning over Product Graphs

Fall 2007 Prof. Thomas Wenisch

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Parallel Algorithms for Accelerating Homomorphic Evaluation

DISTRIBUTED SHARED MEMORY

EE/CSCI 451: Parallel and Distributed Computation

ECE 2300 Digital Logic & Computer Organization. More Caches

Performance Tuning and OpenMP

Three Dimensional Texture Computation of Gray Level Co-occurrence Tensor in Hyperspectral Image Cubes

Flynn classification. S = single, M = multiple, I = instruction (stream), D = data (stream)

Hardware Accelerators

Principal Component Analysis for Distributed Data

Tiling: A Data Locality Optimizing Algorithm

Collaborative Filtering for Netflix

Distributed Machine Learning" on Spark

PRINCIPAL COMPONENT ANALYSIS IMAGE DENOISING USING LOCAL PIXEL GROUPING

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

CLUSTERING. JELENA JOVANOVIĆ Web:

Metric Learning for Large-Scale Image Classification:

Multiple-Choice Questionnaire Group C

Internet Coordinate Systems Tutorial

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Multi-task Multi-modal Models for Collective Anomaly Detection

ECE 5424: Introduction to Machine Learning

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

GPUML: Graphical processors for speeding up kernel machines

Performance analysis basics

Chapter 18 - Multicore Computers

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning

Performance improvements to peer-to-peer file transfers using network coding

Lab 9. Julia Janicki. Introduction

Lecture 2. Memory locality optimizations Address space organization

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Automatic Development of Linear Algebra Libraries for the Tesla Series

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

New Approaches for EEG Source Localization and Dipole Moment Estimation. Shun Chi Wu, Yuchen Yao, A. Lee Swindlehurst University of California Irvine

Accelerating Molecular Modeling Applications with Graphics Processors

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

The Fusion Distributed File System

High Performance Computing on GPUs using NVIDIA CUDA

Supervised Learning for Image Segmentation

Transcription:

A Shared Memory Parallel Algorithm for Data Reduction Using the Singular Value Decomposition Rhonda Phillips, Layne Watson, Randolph Wynne April 16, 2008

Outline Motivation Algorithms Study Area Results and Analysis Implementation Details Conclusions 2 Virginia Tech

Motivation Satellite images are increasing in size. Spectral and spatial resolutions are increasing. Having too many features for the training dataset degrades classification accuracy (overfitting issues). The SVD can be expensive to compute, especially for a large dataset. 3 Virginia Tech

SVD-based feature reduction SV DReduce: Consider X = UΣV t to be m n and m << n. In order to represent X using k dimensions (k < m), set the singular values in positions k + 1 to m in Σ to zero to form Σ. X can be represented by coordinates ΣV t. SV DT rainingreduce: Instead of using the image X as in SV DReduce, a representative training data set T is used to compute T = Û ˆΣˆV t. X is projected onto the basis set Û. 4 Virginia Tech

Study Area 0 95 190 380 570 760 Meters A hyperspectral image containing 224 bands taken over the Appomattox Buckingham State Forest was used to obtain all execution times listed. 5 Virginia Tech

Execution Times for SVDReduce U, the left singular vectors, are computed using the image X, and X is projected onto U. Operation Time (seconds) SVD Factorization 340.04 Matrix-Vector Multiply 184.33 Other 8.05 Total 532.42 6 Virginia Tech

Execution Times for SVDTrainingReduce Û, the left singular vectors, are computed using the training data T, and X is projected onto Û. Operation Time (seconds) SVD Factorization.24 Matrix-Vector Multiply 180.07 Other 4.71 Total 185.02 7 Virginia Tech

Parallel SVD based feature reduction SV DTrainingReduce is faster than SV DReduce, and is more suited to run on a parallel computer. The SVD factorization and the projection of X onto Û can be computed in parallel. The SVD factorization requires much less execution time than projecting X onto Û. In practice, using a (shared memory) parallel SVD factorization was slower than using the serial SVD factorization. 8 Virginia Tech

Parallel Speedup speedup 120 100 80 60 40 20 20 40 60 80 100 120 processors ideal static guided default Parallel speedup of psv DT rainingreduce including all input/output operations. 9 Virginia Tech

Parallel Speedup speedup 175 150 125 100 75 50 25 20 40 60 80 100 120 processors ideal static guided default Parallel speedup of psv DT rainingreduce without input/output execution times included. 10 Virginia Tech

Implementation Issues Scheduling Data Placement Private vs. Shared variables Cache 11 Virginia Tech

Scheduling The speedup increase using all processors can be repeated on smaller SGI Altixes. speedup 20 15 10 5 2 4 6 8 10 12 processors ideal static guided default Speedup without input/output using 12 processors. 12 Virginia Tech

Scheduling Referring to previous speedup graphs, there is an increase in speedup when guided scheduling and all processors are used. Guided scheduling is a type of dynamic scheduling that varies the chunk size. This speedup is also observed for dynamic scheduling. Although these are dynamic scheduling strategies, data initialization and placement (using dplace) is important. Allowing the underlying hardware and software to assign work to processors results in a large speedup, but only when using all processors. 13 Virginia Tech

Scheduling execution time 18 16 14 12 2 4 6 8 chunk size static dynamic guided guided no dplace Illustration of execution time differences using various scheduling and data placement strategies when all processors are involved in computation. 14 Virginia Tech

Data Placement speedup 120 100 80 60 40 50 60 70 80 90 100 processors ideal static guided default In this zoom of parallel speedup without input/output, the static and guided scheduling strategies using dplace and consistent memory access outperform a naive approach. 15 Virginia Tech

Variable Storage U is the basis set that every vector in X is projected onto. U can be physically stored as static or dynamic memory, and can be private or shared across processors during parallel execution. Execution times for different methods of storing U using 12 processors. Variable Type Time (seconds) static shared 20.586 static private 17.741 dynamic shared 17.419 16 Virginia Tech

Cache optimization In the previous slide, cache coherency overhead resulted in increased execution times for static memory that is shared across processors. As this algorithm was implemented in Fortran 95, which uses column major order, all data is accessed by column. Each processor then accesses data elements in the order they are laid out in memory, maximizing cache hits. The data is also distributed across columns to minimize the likelihood that processors are operating on the same cache line. 17 Virginia Tech

Conclusions psv DT rainingreduce is faster than SV DT rainingreduce, which is faster and parallelizes better than SV DReduce. If input/output are excluded from the execution times, psv DT rainingreduce scales well to 128 processors since the SVD factorization is a small portion of the algorithm. Even for straightforward parallel algorithms such as psv DT rainingreduce, performance tuning is essential. There is something interesting happening under the hood of the SGI Altix when dynamic scheduling strategies are employed and all processors are utilized. 18 Virginia Tech