GPU-Accelerated Incremental Correlation Clustering of Large Data with Visual Feedback

Similar documents
GPU-Accelerated Incremental Correlation Clustering of Large Data with Visual Feedback

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Introduction to Data Management CSE 344

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Viewpoint Review & Analytics

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Massive Data Analysis

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

Refraction Corrected Transmission Ultrasound Computed Tomography for Application in Breast Imaging

Embedded Technosolutions

Data Mining, Parallelism, Data Mining, Parallelism, and Grids. Queen s University, Kingston David Skillicorn

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Hybrid Implementation of 3D Kirchhoff Migration

STREAMING ALGORITHMS. Tamás Budavári / Johns Hopkins University ANALYSIS OF ASTRONOMY IMAGES & CATALOGS 10/26/2015

Epilog: Further Topics

NeuroMem. A Neuromorphic Memory patented architecture. NeuroMem 1

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

CHAPTER 4 DETECTION OF DISEASES IN PLANT LEAF USING IMAGE SEGMENTATION

Design guidelines for embedded real time face detection application

Real-time Data Compression for Mass Spectrometry Jose de Corral 2015 GPU Technology Conference

Designing a Domain-specific Language to Simulate Particles. dan bailey

Introduction to Data Management CSE 344

Computational issues for HI

Operational use of the Orfeo Tool Box for the Venµs Mission

6. Learning Partitions of a Set

The Fusion Distributed File System

Introduction to Database Systems CSE 414

Warps and Reduction Algorithms

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

Exploratory Analysis: Clustering

Cisco Tetration Analytics

A Systems View of Large- Scale 3D Reconstruction

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis

Handling and Processing Big Data for Biomedical Discovery with MATLAB

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

SYSPRO s Fluid Interface Design

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

MapReduce for Data Intensive Scientific Analyses

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

Lecture Topic Projects

Viscous Fingers: A topological Visual Analytic Approach

Towards a Performance- Portable FFT Library for Heterogeneous Computing

Digital Earth Routine on Tegra K1

Clustering Analysis based on Data Mining Applications Xuedong Fan

Dynamic 3D representation of information using low cost Cloud ready Technologies

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Spectral Classification

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Background. Parallel Coordinates. Basics. Good Example

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Technical Application Note

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

The latest trend of hybrid instrumentation

Information Coding / Computer Graphics, ISY, LiTH

k-means A classical clustering algorithm

Storage hierarchy. Textbook: chapters 11, 12, and 13

Irregular Graph Algorithms on Parallel Processing Systems

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library

Advanced Concepts for Large Data Visualization. SNU, February 28, 2012

Chapter 13 Strong Scaling

Visual Data Mining. Overview. Apr. 24, 2007 Visual Analytics presentation Julia Nam

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

Using LiDAR for Classification and

Database Management and Tuning

ΕΠΛ372 Παράλληλη Επεξεργάσια

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Maximizing Face Detection Performance

How It Works: High-Speed Super Large Ledger Technology. A report based on information supplied by. Wei-Tek Tsai Beijing Tiande Technologies

Qlik Sense Desktop. Data, Discovery, Collaboration in minutes. Qlik Sense Desktop. Qlik Associative Model. Get Started for Free

Machine Learning. Unsupervised Learning. Manfred Huber

University of Florida CISE department Gator Engineering. Clustering Part 4

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Diagnostics in Testing and Performance Engineering

The Future of High Performance Computing

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

CIE L*a*b* color model

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Harp-DAAL for High Performance Big Data Computing

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases

Clustering Part 4 DBSCAN

Digital system (SoC) design for lowcomplexity. Hyun Kim

Lecture 7: Parallel Processing

Automated Particle Size & Shape Analysis System

An Introduction to Big Data Formats

Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis

Next-Generation Cloud Platform

Hierarchical GEMINI - Hierarchical linear subspace indexing method

Transcription:

GPU-Accelerated Incremental Correlation Clustering of Large Data with Visual Feedback Eric Papenhausen and Bing Wang (Stony Brook University) Sungsoo Ha (SUNY Korea) Alla Zelenyuk (Pacific Northwest National Lab) Dan Imre (Imre Comsulting) Klaus Mueller (Stony Brook University and SUNY Korea)

The Internet of Things and People

The Large Synoptic Survey Telescope Will survey the entire visible sky deeply in multiple colors every week with its three-billion pixel digital camera Probe the mysteries of Dark Matter & Dark Energy 10 x more galaxies than Sloan Digital Sky Survey Movie-like window on objects that change or move rapidly

Our Data Aerosol Science Acquired by a state-of-the-art single particle mass spectrometer (SPLAT II) often deployed in an aircraft Used in atmospheric chemistry understand the processes that control the atmos. aerosol life cycle find the origins of climate change uncover and model the relationship between atmospheric aerosols and climate

Our Data Aerosol Science SPLAT II can acquire up to 100 particles per second at sizes between 50-3,000 nm at a precision of 1 nm Creates a 450-D mass spectrum for each particle SpectraMiner: Builds a hierarchy of particles based on their spectral composition Hierarchy is used n subsequent automated classification of new particle acquisitions in the field or in the lab SpectraMiner

SpectraMiner Tightly integrate the scientist into the data analytics Interactive clustering cluster sculpting Interaction needed since the data are extremely noisy Fully automated clustering tools typically do not return satisfactory results Strategy: Determine leaf nodes Merge using correlation metric via heap sort Correlation sensitive to article composition ratios (or mixing state)

SpectraMiner Scale Up CPU-based solution worked well for some time SPLAT II and new large campaigns present problems At 100 particles/s, the number of particles gathered in a single acquisition run can easily reach 100,000 This would take just a 15 minute time window Large campaigns are much longer & more frequent Datasets of 5-10M particles have become the norm Recently SPLAT II operated 24/7 for one month Had to reduce acquisition rate to 20 particles/s CPU-based solution took days/weeks to compute

Interlude: Big Data What Do You Need? #1: Well, data!! data = $$ look at LinkedIn, Facebook, Google, Amazon #2: High performance computing parallel computing (GPUs), cloud computing #3: Nifty computer algorithms for noise removal redundancy elimination and importance sampling missing data estimation outlier detection natural language processing and analysis image and video analysis learning a classification model

Interlude: Big Data What Do You Need? #1: Well, data!! data = $$ look at LinkedIn, Facebook, Google, Amazon #2: High performance computing parallel computing (GPUs), cloud computing #3: Nifty computer algorithms for noise removal redundancy elimination and importance sampling missing data estimation outlier detection natural language processing and analysis image and video analysis learning a classification model

Incremental k-means Sequential Basis of our trusted CPU-based solution (10 years old) Make the first point a cluster center While number of unclustered points > 0 Pt = next unclustered point Compare Pt to all cluster centers Find the cluster with the shortest distance If(distance < threshold) Cluster Pt into cluster center Else Make Pt a new cluster center End If End Second pass to cluster outliers

Incremental k-means Parallel New parallizable version of the previous algorithm Do Perform sequential k-means until C clusters emerge Num_Iterations = 0 While Num_Iterations < Max_iterations In Parallel: Compare all points to C centers In Parallel: Update the C cluster centers Num_Iterations++ End Output the C clusters If number of unclustered points == 0 End Else continue End

Comments and Observations Algorithm merges the incremental k-means algorithm with a parallel implementation (k=c) Design choices: C=96 good balance between CPU and GPU utilization With C>96 algorithm becomes CPU-bound With C<96 the GPU would be underutilized A multiple of 32 avoids divergent warps on the GPU Max_iterations = 5 worked best Advantages of the new scheme: Second pass of previous scheme no longer needed

GPU Implementation Platform 1-4 Tesla K20 GPUs Installed in a remote cloud server Future implementations will emphasize this cloud aspect more Parallelism Launch N/32 thread blocks of size 32 x 32 each Each thread compares a point with 3 cluster centers Make use of shared memory to avoid non-coalesced memory accesses

GPU Implementation Algorithm c1 = Centers[tid.y] c2 = Centers[tid.y + 32] // Second 32/96 loaded c3 = Centers[tid.y + 64] // Final 32/96 loaded pt = Points[tid.x] // First 32/96 loaded by thread block [clust, dist] = PearsonDist(pt, c1,c2,c3) // d xy =1-r xy [clust, dist] = IntraColumnReduction(clust,dist) //first thread in each column writes result If(tid.y == 0) Points.clust[tid.x] = clust Points.dist[tid.x] = dist End If

Quality Measures Measure cluster quality with the Davies-Bouldin (DB) index DB = 1 n n i=1 max j ( σ i + σ j M ij ) σ i and σ j are intra-cluster distances of clusters i, j M ij is the inter-cluster distance of clusters i, j DB should be as small as possible

Acceleration by Sub-Thresholding Size of the data was a large bottleneck Data points had to be kept around for a long time Cull points that were tightly clustered early These are the points that have a low Pearson s distance This also improved the DB index

Results Sub-Thresholding About 33x speedup

Results Multi-GPU 4-GPU has about 100x speedup over sequential

In-Situ Visual Feedback (1) Visualize cluster centers as summary snapshots Glimmer MDS algorithm was used Intuitive 2D layout for non-visualization experts Color map: Small clusters map to mostly white Large clusters map to saturated blue We find that early visualizations are already quite revealing This is shown by cluster size histogram Cluster size of M>10 is considered significant

In-Situ Visual Feedback (2) 79/96 998/3360 2004/13920

In-Situ Visual Feedback (3) 3001/52800 4002/165984 4207/336994

Relation To Previous Work (1) Main difference We perform k-means clustering for data reduction Previous work often uses map-reduce approaches Connection most often with MPI/OpenMP Distribute points onto a set of machines Compute (map) one iteration of local k-means in parallel Send the local k means to a set of reducers Compute their averages in parallel and send back to mappers Optionally skip the reduction step and instead broadcast to mappers for local averaging

Relation To Previous Work (2) GPU solutions Often only parallelize the point-cluster assignments Compute new cluster centers on the CPU due to low parallelism

Conclusions and Future Work Current approach quite promising Good speedup In-situ visualization of data reduction process with early valuable feedback Future work Load-balancing point removal for multi-gpu Anchored visualization so layout is preserved Enable visual steering of point reduction Extension to streaming data Also accelerate hierarchy building

Final Slide Thanks to NSF and DOE for funding Addional support from the Ministry of Korea Knowledge Economy (MKE) Any questions?