Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Size: px

Start display at page:

Download "Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies"

Mervin Mitchell
5 years ago
Views:

1 Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies Talk at Intel HPC Developer Conference 2017 (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University Xiaoyi Lu The Ohio State University

2 Big Data Processing and Deep Learning on Modern Clusters Intel HPC Developer Conference Multiple tiers + Workflow Front-end data accessing and serving (Online) Memcached + DB (e.g. MySQL), HBase, etc. Back-end data analytics and deep learning model training (Offline) HDFS, MapReduce, Spark, TensorFlow, BigDL, Caffe, etc.

3 Drivers of Modern HPC Cluster Architectures Multi-core Processors High Performance Interconnects - InfiniBand <1usec latency, 100Gbps Bandwidth> Multi-core/many-core technologies Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM Tianhe 2 Titan Stampede Tianhe 1A Intel HPC Developer Conference

Interconnects and Protocols in OpenFabrics Stack for HPC (http://openfabrics.

4 Interconnects and Protocols in OpenFabrics Stack for HPC ( Application / Middleware Interface Protocol Kernel Space TCP/IP Sockets TCP/IP RSockets Application / Middleware SDP TCP/IP Verbs RDMA RDMA Adapter Switch Ethernet Driver Ethernet Adapter Ethernet Switch 1/10/40/100 GigE IPoIB InfiniBand Adapter InfiniBand Switch IPoIB Hardware Offload Ethernet Adapter Ethernet Switch 10/40 GigE- TOE User Space InfiniBand Adapter InfiniBand Switch RSockets RDMA InfiniBand Adapter InfiniBand Switch SDP User Space iwarp Adapter Ethernet Switch iwarp User Space RoCE Adapter Ethernet Switch RoCE User Space InfiniBand Adapter InfiniBand Switch IB Native Intel HPC Developer Conference

5 Intel HPC Developer Conference Large-scale InfiniBand Installations 177 IB Clusters (35%) in the Jun 17 Top500 list ( Installations in the Top 50 (18 systems): 241,108 cores (Pleiades) at NASA/Ames (15 th ) 152,692 cores (Thunder) at AFRL/USA (36 th ) 220,800 cores (Pangea) in France (19 th ) 99,072 cores (Mistral) at DKRZ/Germany (38 th ) 522,080 cores (Stampede) at TACC (20 th ) 147,456 cores (SuperMUC) in Germany (40 th ) 144,900 cores (Cheyenne) at NCAR/USA (22 nd ) 86,016 cores (SuperMUC Phase 2) in Germany (41 st ) 72,800 cores Cray CS-Storm in US (27 th ) 74,520 cores (Tsubame 2.5) at Japan/GSIC (44 th ) 72,800 cores Cray CS-Storm in US (28 th ) 66,000 cores (HPC3) in Italy (47s th ) 124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (30 th ) 194,616 cores (Cascade) at PNNL (49 th ) 60,512 cores (DGX-1) at Facebook/USA (31 st ) 85,824 cores (Occigen2) at GENCI/CINES in France (50 th ) 60,512 cores (DGX SATURNV) at NVIDIA/USA (32 nd ) 73,902 cores (Centennial) at ARL/USA (52 nd ) 72,000 cores (HPC2) in Italy (33 rd ) and many more!

6 Intel HPC Developer Conference Increasing Usage of HPC, Big Data and Deep Learning HPC (MPI, RDMA, Lustre, etc.) Big Data (Hadoop, Spark, HBase, Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) Convergence of HPC, Big Data, and Deep Learning!!!

How Can HPC Clusters with High-Performance

Big Data and Deep Learning Applications?

bottlenecks be alleviated with new designs by

What are the major bottlenecks in current Big

benefit Big Data processing and Deep Learning?

7 How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data and Deep Learning Applications? Intel HPC Developer Conference Can the bottlenecks be alleviated with new designs by taking advantage of HPC technologies? What are the major bottlenecks in current Big Data processing and Deep Learning middleware (e.g. Hadoop, Spark)? Can RDMA-enabled high-performance interconnects benefit Big Data processing and Deep Learning? Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data and Deep Learning applications? How much performance benefits can be achieved through enhanced designs? How to design benchmarks for evaluating the performance of Big Data and Deep Learning middleware on HPC clusters? Bring HPC, Big Data processing, and Deep Learning into a convergent trajectory!

Intel HPC Developer Conference 2017 8 Can We Run Big

8 Intel HPC Developer Conference Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

9 Intel HPC Developer Conference Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

10 Intel HPC Developer Conference Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

11 Intel HPC Developer Conference Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Intel HPC Developer Conference 2017 12 Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Big Data Middleware (HDFS,

Point-to-Point Communication I/O and File Systems Communication and I/O Library Threaded Models and Synchronization QoS Virtualization Fault-Tolerance

12 Intel HPC Developer Conference Designing Communication and I/O Libraries for Big Data Systems: Challenges Applications Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Programming Models (Sockets) Benchmarks Other Protocols? Upper level Changes? Point-to-Point Communication I/O and File Systems Communication and I/O Library Threaded Models and Synchronization QoS Virtualization Fault-Tolerance Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs) Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Storage Technologies (HDD, SSD, and NVMe-SSD)

13 The High-Performance Big Data (HiBD) Project RDMA for Apache Spark RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions RDMA for Apache HBase RDMA for Memcached (RDMA-Memcached) RDMA for Apache Hadoop 1.x (RDMA-Hadoop) OSU HiBD-Benchmarks (OHB) Available for InfiniBand and RoCE Also run on Ethernet HDFS, Memcached, HBase, and Spark Micro-benchmarks Users Base: 260 organizations from 31 countries More than 23,900 downloads from the project site Intel HPC Developer Conference

14 Acceleration Case Studies and Performance Evaluation Basic Designs Hadoop Spark Memcached Advanced Designs Memcached with Hybrid Memory and Non-blocking APIs Efficient Indexing with RDMA-HBase TensorFlow with RDMA-gRPC Deep Learning over Big Data BigData + HPC Cloud Intel HPC Developer Conference

15 Design Overview of HDFS with RDMA Applications Design Features Others Java Socket Interface HDFS Write Java Native Interface (JNI) OSU Design Verbs RDMA-based HDFS write RDMA-based HDFS replication Parallel replication support On-demand connection setup 1/10/40/100 GigE, IPoIB Network RDMA Capable Networks (IB, iwarp, RoCE..) InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based HDFS with communication library written in native code N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 2012 N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014 Intel HPC Developer Conference

Enhanced HDFS with In-Memory and Heterogeneous Storage Applications Triple-H Data Placement Policies Hybrid Replication Eviction/Promotion Heterogeneous Storage RAM Disk SSD HDD Lustre Design

16 Enhanced HDFS with In-Memory and Heterogeneous Storage Applications Triple-H Data Placement Policies Hybrid Replication Eviction/Promotion Heterogeneous Storage RAM Disk SSD HDD Lustre Design Features Three modes Default (HHH) In-Memory (HHH-M) Lustre-Integrated (HHH-L) Policies to efficiently utilize the heterogeneous storage devices RAM, SSD, HDD, Lustre Eviction/Promotion based on data usage pattern Hybrid Replication Lustre-Integrated mode: Lustre-based fault-tolerance N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid 15, May 2015 Intel HPC Developer Conference

17 Design Overview of MapReduce with RDMA MapReduce Job Tracker Java Socket Interface 1/10/40/100 GigE, IPoIB Network Applications Task Tracker Map Reduce Java Native Interface (JNI) OSU Design Verbs RDMA Capable Networks (IB, iwarp, RoCE..) Design Features RDMA-based shuffle Prefetching and caching map output Efficient Shuffle Algorithms In-memory merge On-demand Shuffle Adjustment Advanced overlapping map, shuffle, and merge shuffle, merge, and reduce On-demand connection setup InfiniBand/RoCE support Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Java based MapReduce with communication library written in native code M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014 Intel HPC Developer Conference

Intel HPC Developer Conference 2017 18 Performance Numbers of RDMA for Apache Hadoop 2.

18 Intel HPC Developer Conference Performance Numbers of RDMA for Apache Hadoop 2.x RandomWriter & TeraGen in OSU-RI2 (EDR) Execution Time (s) IPoIB (EDR) OSU-IB (EDR) Data Size (GB) Reduced by 3x Execution Time (s) IPoIB (EDR) OSU-IB (EDR) RandomWriter Cluster with 8 Nodes with a total of 64 maps Data Size (GB) TeraGen Reduced by 4x RandomWriter 3x improvement over IPoIB for GB file size TeraGen 4x improvement over IPoIB for GB file size

19 Execution Time (s) Performance Numbers of RDMA for Apache Hadoop 2.x Sort & TeraSort in OSU-RI2 (EDR) Sort IPoIB (EDR) OSU-IB (EDR) Data Size (GB) 61% improvement over IPoIB for GB data Reduced by 61% Sort Cluster with 8 Nodes with a total of 64 maps and 14 reduces Execution Time (s) TeraSort IPoIB (EDR) OSU-IB (EDR) TeraSort Cluster with 8 Nodes with a total of 64 maps and 32 reduces 18% improvement over IPoIB for GB data Reduced by 18% Data Size (GB) Intel HPC Developer Conference

Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks Netty Server Spark Core Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO,

20 Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks Netty Server Spark Core Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO, RDMA-Plugin) NIO Server Java Socket Interface RDMA Server Netty Client NIO Client RDMA Client Java Native Interface (JNI) Native RDMA-based Comm. Engine Design Features RDMA based shuffle plugin SEDA-based architecture Dynamic connection management and sharing Non-blocking data transfer Off-JVM-heap buffer management InfiniBand/RoCE support 1/10/40/100 GigE, IPoIB Network Enables high performance RDMA communication, while supporting traditional socket interface JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData 16, Dec RDMA Capable Networks (IB, iwarp, RoCE..) Intel HPC Developer Conference

21 Performance Evaluation on SDSC Comet HiBench PageRank Time (sec) IPoIB RDMA Huge BigData Gigantic Data Size (GB) 37% 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) RDMA-based design for Spark RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Intel HPC Developer Conference Time (sec) IPoIB RDMA Huge BigData Gigantic Data Size (GB) 43%

Time (us) 1000 100 Memcached Performance (FDR Interconnect) 10 1 Memcached GET Latency OSU-IB (FDR) 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Memcached Get latency Latency Reduced by nearly 20X Message

22 Time (us) Memcached Performance (FDR Interconnect) 10 1 Memcached GET Latency OSU-IB (FDR) K 2K 4K Memcached Get latency Latency Reduced by nearly 20X Message Size Thousands of Transactions per Second (TPS) 4 bytes OSU-IB: 2.84 us; IPoIB: us, 2K bytes OSU-IB: 4.49 us; IPoIB: us Memcached Throughput (4bytes) Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) 0 Memcached Throughput No. of Clients 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s, Nearly 2X improvement in throughput J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP 11 J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid 12 2X Intel HPC Developer Conference

23 Acceleration Case Studies and Performance Evaluation Basic Designs Hadoop Spark Memcached Advanced Designs Memcached with Hybrid Memory and Non-blocking APIs Efficient Indexing with RDMA-HBase TensorFlow with RDMA-gRPC Deep Learning over Big Data BigData + HPC Cloud Intel HPC Developer Conference

24 Performance Evaluation on IB FDR + SATA/NVMe SSDs (Hybrid Memory) slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty Latency (us) Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid- NVMe IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid- NVMe Data Fits In Memory Data Does Not Fit In Memory Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value pair size, total size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when data does not fit in memory) When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem When data does not fit in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem Intel HPC Developer Conference

25 Average Latency (us) Performance Evaluation with Non-Blocking Memcached API Set Get Set Get Set Get Set Get Set Get Set Get Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem Miss Penalty (Backend DB Access Overhead) Client Wait Server Response Cache Update Cache check+load (Memory and/or SSD read) Slab Allocation (w/ SSD write on Out-of-Mem) IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonBb H-RDMA-Opt-NonB- H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee Intel HPC Developer Conference

Agg. Throughput (MBps) Performance Evaluation with Boldio for Lustre + Burst-Buffer 70000.00 60000.00 50000.00 40000.00 30000.00 20000.00 10000.00 0.

throughput and 7x for read throughput execution time of Hadoop benchmarks over Lustre, e.g. Wordcount, Cloudburst by >21% Contrasting with Alluxio (formerly Tachyon) Performance degrades about 15x when Alluxio cannot leverage local storage (Alluxio-Local vs.

26 Agg. Throughput (MBps) Performance Evaluation with Boldio for Lustre + Burst-Buffer Lustre-Direct Alluxio-Remote ~3x InfiniBand QDR, 24GB RAM + PCIe-SSDs, 12 nodes, 32/48 Map/Reduce Tasks, 4-node Memcached cluster Boldio can improve throughput over Lustre by about 3x for write throughput and 7x for read throughput execution time of Hadoop benchmarks over Lustre, e.g. Wordcount, Cloudburst by >21% Contrasting with Alluxio (formerly Tachyon) Performance degrades about 15x when Alluxio cannot leverage local storage (Alluxio-Local vs. Alluxio-Remote) Boldio can improve throughput over Alluxio with all remote workers by about 3.5x - 8.8x (Alluxio-Remote vs. Boldio) D. Shankar, X. Lu, D. K. Panda, Boldio: A Hybrid and Resilient Burst-Buffer over Lustre for Accelerating Big Data I/O, IEEE Big Data GB 40 GB 20 GB 40 GB Write Alluxio-Local Boldio DFSIO Throughput Read ~7x Latency (sec) Lustre-Direct Alluxio-Remote Boldio 21% WordCount InvIndx CloudBurst Spark TeraGen Hadoop/Spark Workloads Intel HPC Developer Conference

27 Accelerating Indexing Techniques on HBase with RDMA Challenges Operations on Distributed Ordered Table (DOT) with indexing techniques are network intensive Additional overhead of creating and maintaining secondary indices Can RDMA benefit indexing techniques (Apache Phoenix and CCIndex) on HBase? Results Evaluation with Apache Phoenix and CCIndex Up to 2x improvement in query throughput Up to 35% reduction in application workload execution time Collaboration with Institute of Computing Technology, Chinese Academy of Sciences S. Gugnani, X. Lu, L. Zha, and D. K. Panda, Characterizing and Accelerating Indexing Techniques on Distributed Ordered Tables, IEEE BigData, Intel HPC Developer Conference THROUGHPUT EXECUTION TIME TPC-H Query Benchmarks HBase RDMA-HBase HBase-Phoenix RDMA-HBase-Phoenix HBase-CCIndex RDMA-HBase-CCIndex Query1 Increased by 2x Query2 Ad Master Application Workloads HBase-Phoenix RDMA-HBase-Phoenix HBase-CCIndex RDMA-HBase-CCIndex Reduced by 35% Workload1 Workload2 Workload3 Workload4

28 Intel HPC Developer Conference Overview of RDMA-gRPC with TensorFlow Worker /job:ps/task:0 CPU GPU RDMA grpc server/ client Client Master RDMA grpc server/ client Worker /job:worker/task:0 CPU GPU Worker services communicate among each other using RDMA-gRPC

29 Time (Seconds) Performance Benefit for TensorFlow 0 Default grpc grpc + Verbs grpc + MPI OSU RDMA-gRPC2 sigmoid net TensorFlow performance evaluation on RI2 Up to 19% performance speedup over IPoIB for Sigmoid net (20 epochs). Up to 35% and 30% performance speedup over IPoIB for resnet50 and Inception3 (batch size 8). 2 Nodes (2 GPUS) Images / Second Default grpcresnet50 grpc + Verbs grpc + MPI OSU-RDMA-gRPC 2 Nodes Intel HPC Developer Conference CNN inception3 R. Biswas, X. Lu, and D. K. Panda, Accelerating grpc and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, Under Review.

30 High-Performance Deep Learning over Big Data (DLoBD) Stacks Challenges of Deep Learning over Big Data (DLoBD) Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs? What are the performance characteristics of representative DLoBD stacks on RDMA networks? Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Outof-band communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource utilization RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark) can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy Epochs Time (secs) IPoIB-Time RDMA-Time IPoIB-Accuracy RDMA-Accuracy Epoch Number 2.6X Accuracy (%) X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI Intel HPC Developer Conference

31 Acceleration Case Studies and Performance Evaluation Basic Designs Hadoop Spark Memcached Advanced Designs Memcached with Hybrid Memory and Non-blocking APIs Efficient Indexing with RDMA-HBase TensorFlow with RDMA-gRPC Deep Learning over Big Data BigData + HPC Cloud Intel HPC Developer Conference

32 Virtualization-aware and Automatic Topology Detection Schemes in Hadoop on InfiniBand Challenges Design Existing designs in Hadoop not virtualizationaware No support for automatic topology detection Automatic Topology Detection using MapReduce-based utility Requires no user input Can detect topology changes during runtime without affecting running jobs Virtualization and topology-aware 0 communication through map task scheduling and Default Distributed Default Distributed Mode Mode Mode Mode YARN container allocation policy extensions CloudBurst Self-join S. Gugnani, X. Lu, and D. K. Panda, Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds, CloudCom 16, December 2016 Intel HPC Developer Conference EXECUTION TIME EXECUTION TIME Hadoop Benchmarks RDMA-Hadoop Hadoop-Virt 40 GB 60 GB 40 GB 60 GB 40 GB 60 GB Sort WordCount PageRank Hadoop Applications RDMA-Hadoop Hadoop-Virt Reduced by 34% Reduced by 55%

33 Intel HPC Developer Conference Concluding Remarks Discussed challenges in accelerating Big Data middleware with HPC technologies Presented basic and advanced designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, RPC, HBase, Memcached, Spark, grpc, and TensorFlow Results are promising Many other open issues need to be solved Will enable Big Data community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner Looking forward to collaboration with the community

34 Intel HPC Developer Conference OSU Participating at Multiple Events on BigData Acceleration Tutorial BoF Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management (Sunday, 1:30-5:00 pm, Room #201) BigData and Deep Learning (Tuesday, 5:15-6:45pm, Room #702) SigHPC Big Data BoF (Wednesday, 12:15-1:15pm, Room #603) Clouds for HPC, Big Data, and Deep Learning (Wednesday, 5:15-7:00pm, Room #701) Booth Talks OSU Booth (Tuesday, 10:00-11:00am, Booth #1875) Mellanox Theater (Wednesday, 3:00-3:30pm, Booth #653) OSU Booth (Thursday, 1:00-2:00pm, Booth #1875) Student Poster Presentation Accelerating Big Data processing in Cloud (Tuesday, 5:15-7:00pm, Four Seasons Ballroom) Details at

35 Intel HPC Developer Conference Funding Acknowledgments Funding Support by Equipment Support by

36 Personnel Acknowledgments Current Students A. Awan (Ph.D.) M. Bayatpour (Ph.D.) S. Chakraborthy (Ph.D.) C.-H. Chu (Ph.D.) S. Guganani (Ph.D.) J. Hashmi (Ph.D.) N. Islam (Ph.D.) M. Li (Ph.D.) M. Rahman (Ph.D.) D. Shankar (Ph.D.) A. Venkatesh (Ph.D.) J. Zhang (Ph.D.) Current Research Scientists X. Lu H. Subramoni Current Post-doc A. Ruhela Current Research Specialist J. Smith M. Arnold Past Students A. Augustine (M.S.) P. Balaji (Ph.D.) S. Bhagvat (M.S.) A. Bhat (M.S.) D. Buntinas (Ph.D.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) N. Dandapanthula (M.S.) V. Dhanraj (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) W. Huang (Ph.D.) W. Jiang (M.S.) J. Jose (Ph.D.) S. Kini (M.S.) M. Koop (Ph.D.) K. Kulkarni (M.S.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) K. Kandalla (Ph.D.) P. Lai (M.S.) J. Liu (Ph.D.) M. Luo (Ph.D.) A. Mamidala (Ph.D.) G. Marsh (M.S.) V. Meshram (M.S.) A. Moody (M.S.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) X. Ouyang (Ph.D.) S. Pai (M.S.) S. Potluri (Ph.D.) R. Rajachandrasekar (Ph.D.) G. Santhanaraman (Ph.D.) A. Singh (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) H. Subramoni (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.) Past Research Scientist K. Hamidouche S. Sur Past Programmers D. Bureddy J. Perkins Past Post-Docs D. Banerjee X. Besseron H.-W. Jin J. Lin M. Luo E. Mancini S. Marcarelli J. Vienne H. Wang Intel HPC Developer Conference

37 Intel HPC Developer Conference Thank You! {panda, Network-Based Computing Laboratory The High-Performance Big Data Project

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu