IIT, Chicago, November 4, 2011

Similar documents
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING. Yang Ruan. Advised by Geoffrey Fox

Applying Twister to Scientific Applications

Scalable Parallel Scientific Computing Using Twister4Azure

Azure MapReduce. Thilina Gunarathne Salsa group, Indiana University

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

Clouds and MapReduce for Scientific Applications

Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure

DRYADLINQ CTP EVALUATION

Sequence Clustering Tools

Browsing Large Scale Cheminformatics Data with Dimension Reduction

MapReduce for Data Intensive Scientific Analyses

Large Scale Sky Computing Applications with Nimbus

Building Effective CyberGIS: FutureGrid. Marlon Pierce, Geoffrey Fox Indiana University

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

SCALABLE PARALLEL COMPUTING ON CLOUDS:

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Applicability of DryadLINQ to Scientific Applications

MOHA: Many-Task Computing Framework on Hadoop

Collective Communication Patterns for Iterative MapReduce

Chapter 5. The MapReduce Programming Model and Implementation

SCALABLE AND ROBUST CLUSTERING AND VISUALIZATION FOR LARGE-SCALE BIOINFORMATICS DATA

Hybrid cloud and cluster computing paradigms for life science applications

Genetic Algorithms with Mapreduce Runtimes

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Sky Computing on FutureGrid and Grid 5000 with Nimbus. Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes Bretagne Atlantique Rennes, France

Towards a Collective Layer in the Big Data Stack

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Cloud Technologies for Bioinformatics Applications

Applying Twister to Scientific Applications

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Challenges for Data Driven Systems

HPC learning using Cloud infrastructure

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Data-enabled Science and Engineering: Scalable High Performance Data Analytics

CISC 7610 Lecture 2b The beginnings of NoSQL

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Abstract 1. Introduction

Y790 Report for 2009 Fall and 2010 Spring Semesters

Seung-Hee Bae. Assistant Professor, (Aug current) Computer Science Department, Western Michigan University, Kalamazoo, MI, U.S.A.

Lecture 11 Hadoop & Spark

SCALABLE HIGH PERFORMANCE MULTIDIMENSIONAL SCALING

IoT Platform using Geode and ActiveMQ

A Cloud Framework for Big Data Analytics Workflows on Azure

Harp-DAAL for High Performance Big Data Computing

Databases 2 (VU) ( / )

An Introduction to the SPEC High Performance Group and their Benchmark Suites

ARCHITECTURE AND PERFORMANCE OF RUNTIME ENVIRONMENTS FOR DATA INTENSIVE SCALABLE COMPUTING. Jaliya Ekanayake

The amount of data increases every day Some numbers ( 2012):

Large Scale Computing Infrastructures

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Abstract 1. Introduction

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase. Chen Zhang Hans De Sterck University of Waterloo

L22: SC Report, Map Reduce

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

Shark: Hive on Spark

Enosis: Bridging the Semantic Gap between

Next-Generation Cloud Platform

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures

Virtual Appliances and Education in FutureGrid. Dr. Renato Figueiredo ACIS Lab - University of Florida

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Big Data Management and NoSQL Databases

CS November 2017

SCALE AND SECURE MOBILE / IOT MQTT TRAFFIC

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Design Patterns for Scientific Applications in DryadLINQ CTP

Programming Systems for Big Data

SCHEME OF TEACHING AND EXAMINATION B.E. (ISE) VIII SEMESTER (ACADEMIC YEAR )

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

CS November 2018

Introduction to. Amazon Web Services. Thilina Gunarathne Salsa Group, Indiana University. With contributions from Saliya Ekanayake.

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Case Studies in Storage Access by Loosely Coupled Petascale Applications

DryadLINQ for Scientific Analyses

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Harp: Collective Communication on Hadoop

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

HPC, Grids, Clouds: A Distributed System from Top to Bottom Group 15

Improving the MapReduce Big Data Processing Framework

MapReduce: A Programming Model for Large-Scale Distributed Computation

Bringing code to the data: from MySQL to RocksDB for high volume searches

Forget about the Clouds, Shoot for the MOON

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

MapReduce. U of Toronto, 2014

Cloud Computing CS

CSE6331: Cloud Computing

RA-GRS, 130 replication support, ZRS, 130

High Performance and Cloud Computing (HPCC) for Bioinformatics

Distributed Computing. Santa Clara University 2016

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

Tools for Social Networking Infrastructures

Chapter 3 Virtualization Model for Cloud Computing Environment

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Cloud I - Introduction

Transcription:

IIT, Chicago, November 4, 2011 SALSA HPC Group http://salsahpc.indiana.edu Indiana University

A New Book from Morgan Kaufmann Publishers, an imprint of Elsevier, Inc., Burlington, MA 01803, USA. (ISBN: 9780123858801) Distributed Systems and Cloud Computing: From Parallel Processing to the Internet of Things Kai Hwang, Geoffrey Fox, Jack Dongarra

Twister Bingjing Zhang, Richard Teng Funded by Microsoft, Indiana University's Faculty Research Support Program and NSF OCI-1032677 Grant Twister4Azure Thilina Gunarathne Funded by Microsoft High-Performance Visualization Algorithms For Data-Intensive Analysis Seung-Hee Bae and Jong Youl Choi Funded by NIH Grant 1RC2HG005806-01

DryadLINQ CTP Evaluation Hui Li, Yang Ruan, and Yuduo Zhou Funded by Microsoft Cloud Storage, FutureGrid Xiaoming Gao, Stephen Wu Funded by Indiana University's Faculty Research Support Program and Natural Science Foundation Grant 0910812 Million Sequence Challenge Saliya Ekanayake, Adam Hughs, Yang Ruan Funded by NIH Grant 1RC2HG005806-01 Cyberinfrastructure for Remote Sensing of Ice Sheets Jerome Mitchell Funded by NSF Grant OCI-0636361

MICROSOFT 5

6 Alex Szalay, The Johns Hopkins University SALSA

Paradigm Shift in Data Intensive Computing

Intel s Application Stack SALSA

Applications Programming Model Runtime Storage Support Scientific Simulations (Data Mining and Data Analysis) Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling) Distributed File Systems Security, Provenance, Portal Services and Workflow High Level Language Object Store Data Parallel File System Infrastructure Linux HPC Bare-system Amazon Cloud Virtualization Windows Server HPC Bare-system Azure Cloud Virtualization Grid Appliance Hardware CPU Nodes GPU Nodes SALSA

SALSA

What are the challenges? These Providing challenges both cost must effectiveness be met for and both powerful computation parallel and programming storage. If computation paradigms that and is storage capable are of handling separated, the it s incredible not possible increases to bring dataset computing sizes. to data. (large-scale data analysis for Data Intensive applications ) Data locality Research issues its impact on performance; the factors that affect data locality; the maximum degree of data locality that can be achieved. Factors portability beyond between data locality HPC and to improve Cloud systems performance scaling performance fault tolerance on it will result in performance degradation. Task granularity and load balance In MapReduce, task granularity is fixed. This mechanism has two drawbacks 1) limited degree of concurrency 2) load unbalancing resulting from the variation of task execution time. To achieve the best data locality is not always the optimal scheduling decision. For instance, if the node where input data of a task are stored is overloaded, to run the task

12 MICROSOFT 12

13 MICROSOFT SALSA

Clouds hide Complexity Cyberinfrastructure Is Research as a Service SaaS: Software as a Service (e.g. Clustering is a service) PaaS: Platform as a Service IaaS plus core software capabilities on which you build SaaS (e.g. Azure is a PaaS; MapReduce is a Platform) IaaS (HaaS): Infrasturcture as a Service (get computer time with a credit card and with a Web interface like EC2) 14 SALSA

Topic Cloud /Grid architecture Middleware frameworks Cloud-based Services and Education Please High-performance sign computing and return your video waiver. Virtualization technologies Security and Risk Software as a Service (SaaS) Auditing, monitoring and scheduling Web services Load balancing Plan Optimal deployment to arrive configurationearly to your session in order to copy Fault tolerance and reliability Novel your Programming presentation Models for Large to the conference PC. Utility computing Hardware as a Service (HaaS) Scalable Scheduling on Heterogeneous Autonomic Computing Peer to peer computing Data grid & Semantic web Poster drop-off is at Scholars Hall on Wednesday from 7:30 am Noon. Please take your poster with you after the session on Wednesday evening. New and Innovative Pedagogical Approaches Scalable Fault Resilience Techniques for Large IT Service and Relationship Management Power-aware Profiling, Modeling, and Integration of Mainframe and Large Systems Consistency models Innovations in IP (esp. Open Source) Systems 0 10 20 30 40 50 60 70 80 90 100 Number of Submissions Submissions

Gartner 2009 Hype Curve Source: Gartner (August 2009) HPC? SALSA

L1 cache reference Branch mispredict L2 cache reference Mutex lock/unlock Main memory reference Compress 1K w/cheap compression algorithm Send 2K bytes over 1 Gbps network Read 1 MB sequentially from memory Round trip within same datacenter Disk seek Read 1 MB sequentially from disk Send packet CA->Netherlands->CA 0.5 ns 5 ns 7 ns 25 ns 100 ns 3,000 ns 20,000 ns 250,000 ns 500,000 ns 10,000,000 ns 20,000,000 ns 150,000,000 ns 17

Programming Models and Tools MapReduce in Heterogeneous Environment MICROSOFT 18

Next Generation Sequencing Pipeline on Cloud MapReduce FASTA File N Sequences Blast block Pairings Pairwise Distance Calculation Dissimilarity Matrix N(N-1)/2 values Pairwise clustering Clustering MPI Visualization Visualization Plotviz Plotviz 1 2 3 MDS 4 5 4 Users submit their jobs to the pipeline and the results will be shown in a visualization tool. This chart illustrate a hybrid model with MapReduce and MPI. Twister will be an unified solution for the pipeline mode. The components are services and so is the whole pipeline. We could research on which stages of pipeline services are suitable for private or commercial Clouds. 19 SALSA

Data Deluge Experiencing in many domains Motivation MapReduce Data Centered, QoS Classic Parallel Runtimes (MPI) Efficient and Proven techniques Expand the Applicability of MapReduce to more classes of Applications Map-Only Input map MapReduce Input map Iterative MapReduce iterations Input map More Extensions Pij Output reduce reduce

Distinction on static and variable data Configurable long running (cacheable) map/reduce tasks Pub/sub messaging based communication/data transfers Broker Network for facilitating communication

Main program s process space Worker Nodes configuremaps(..) configurereduce(..) Local Disk while(condition){ Cacheable map/reduce tasks Iterations runmapreduce(..) May send <Key,Value> pairs directly Map() Reduce() updatecondition() } //end while close() Combine() operation Communications/data transfers via the pub-sub broker network & direct TCP Main program may contain many MapReduce invocations or iterative MapReduce invocations

Master Node Twister Driver Main Program Twister Daemon B B B One broker serves several Twister daemons B Pub/sub Broker Network Twister Daemon map reduce Cacheable tasks Worker Pool Worker Pool Local Disk Worker Node Scripts perform: Data distribution, data collection, and partition file creation Local Disk Worker Node

24

Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for input/output/intermediate data storage.

Map Combine New Job Scheduling Queue Job Start Map Map Combine Combine Reduce Reduce Merge Add Iteration? Yes Job Finish No Left over tasks that did not get scheduled through bulleting board. Worker Role Map Workers Map Map 1 2 Reduce Workers Red Red 1 2 Map n Red n In Memory Data Cache Data Cache Hybrid scheduling of the new iteration Job Bulletin Board + In Memory Cache + Execution History Map Task Meta Data Cache New Iteration

Distributed, highly scalable and highly available cloud services as the building blocks. Utilize eventually-consistent, high-latency cloud services effectively to deliver performance comparable to traditional MapReduce runtimes. Decentralized architecture with global queue based dynamic task scheduling Minimal management and maintenance overhead Supports dynamically scaling up and down of the compute resources. MapReduce fault tolerance

Parallel Efficiency BLAST Sequence Search Cap3 Sequence Assembly Smith Waterman Sequence Alignment 100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% Twister4Azure Amazon EMR Apache Hadoop Num. of Cores * Num. of Files

Task Execution Time Histogram Number of Executing Map Task Histogram Strong Scaling with 128M Data Points Weak Scaling

Weak Scaling Data Size Scaling Azure Instance Type Study Number of Executing Map Task Histogram

Parallel Visualization Algorithms PlotViz

GTM MDS (SMACOF) Purpose Non-linear dimension reduction Find an optimal configuration in a lower-dimension Iterative optimization method Input Objective Function Vector-based data Maximize Log-Likelihood Non-vector (Pairwise similarity matrix) Minimize STRESS or SSTRESS Complexity O(KN) (K << N) O(N 2 ) Optimization Method EM Iterative Majorization (EM-like)

1 2 K latent points A B C N data points 1 2 A B C GTM / GTM-Interpolation Parallel HDF5 ScaLAPACK MPI / MPI-IO Parallel File System Cray / Linux / Windows Cluster Finding K clusters for N data points Relationship is a bipartite graph (bi-graph) Represented by K-by-N matrix (K << N) Decomposition for P-by-Q compute grid Reduce memory requirement by 1/PQ

Parallel MDS O(N 2 ) memory and computation required. 100k data 480GB memory Balanced decomposition of NxN matrices by P-by-Q grid. Reduce memory and computing requirement by 1/PQ Communicate via MPI primitives r1 r2 c1 c2 c3 MDS Interpolation Finding approximate mapping position w.r.t. k- NN s prior mapping. Per point it requires: O(M) memory O(k) computation Pleasingly parallel Mapping 2M in 1450 sec. vs. 100k in 27000 sec. 7500 times faster than estimation of the full MDS. 34

MPI, Twister n In-sample 1 2... N-n Out-of-sample P-1 p Total N data Training Twister Trained data Interpolation Interpolated map Full data processing by GTM or MDS is computing- and memory-intensive Two step procedure Training : training by M samples out of N data Interpolation : remaining (N-M) out-of-samples are approximated without training

Chemical compounds shown in literatures, visualized by MDS (left) and GTM (right) Visualized 234,000 chemical compounds which may be related with a set of 5 genes of interest (ABCB1, CHRNB2, DRD2, ESR1, and F2) based on the dataset collected from major journal literatures which is also stored in Chem2Bio2RDF system. PubChem data with CTD visualization by using MDS (left) and GTM (right) About 930,000 chemical compounds are visualized as a point in 3D space, annotated by the related genes in Comparative Toxicogenomics Database (CTD)

This demo is for real time visualization of the process of multidimensional scaling(mds) calculation. We use Twister to do parallel calculation inside the cluster, and use PlotViz to show the intermediate results at the user client computer. The process of computation and monitoring is automated by the program.

MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children s Research Institute

II. Send intermediate results Client Node Master Node Twister Driver Twister-MDS ActiveMQ Broker I. Send message to start the job MDS Monitor III. Write data PlotViz IV. Read data Local Disk

Master Node Twister Driver Twister-MDS Pub/Sub Broker Network MDS Output Monitoring Interface Twister Daemon map Twister Daemon reduce map calculatebc reduce Worker Pool calculatestress Worker Pool Worker Node Worker Node

Gene Sequences (N = 1 Million) Select Referenc e Reference Sequence Set (M = 100K) Pairwise Alignment & Distance Calculation Distance Matrix N - M Sequence Set (900K) Interpolative MDS with Pairwise Distance Calculation Reference Coordinates x, y, z Multi- Dimensional Scaling (MDS) O(N 2 ) N - M Coordinates x, y, z Visualization 3D Plot

Twister Daemon Node ActiveMQ Broker Node Twister Driver Node 7 Brokers and 32 Computing Nodes in total Broker-Driver Connection 5 Brokers and 4 Computing Nodes in total Broker-Daemon Connection Broker-Broker Connection

Total Execution Time (Seconds) 1600.000 1400.000 Twister-MDS Execution Time 100 iterations, 40 nodes, under different input data sizes 1508.487 1404.431 1200.000 1000.000 800.000 816.364 737.073 600.000 400.000 359.625 303.432 200.000 189.288 148.805 0.000 38400 51200 76800 102400 Number of Data Points Original Execution Time (1 broker only) Current Execution Time (7 brokers, the best broker number)

Broadcasting Time (Unit: Second) (In Method C, centroids are split to 160 blocks, sent through 40 brokers in 4 rounds) 100.00 90.00 93.14 80.00 70.00 70.56 60.00 50.00 46.19 40.00 30.00 20.00 10.00 13.07 18.79 24.50 0.00 400M 600M 800M Method C Method B

Twister New Architecture Master Node Worker Node Worker Node Broker Broker Broker Configure Mapper map broadcasting chain Add to MemCache map Map Reduce merge reduce Cacheable tasks reduce collection chain Twister Driver Twister Daemon Twister Daemon

Chain/Ring Broadcasting Twister Daemon Node Twister Driver Node Driver sender: send broadcasting data get acknowledgement send next broadcasting data Daemon sender: receive data from the last daemon (or driver) cache data to daemon Send data to next daemon (waits for ACK) send acknowledgement to the last daemon

Chain Broadcasting Protocol Driver Daemon 0 send get ack send get ack receive handle data send ack receive handle data get ack send ack Daemon 1 receive handle data send ack receive handle data get ack send ack Daemon 2 I know this is the end of Daemon Chain receive handle data ack receive handle data ack get ack get ack get ack ack ack I know this is the end of Cache Block

Broadcasting Time (Unit: Seconds) Broadcasting Time Comparison Broadcasting Time Comparison on 80 nodes, 600 MB data, 160 pieces 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Execution No. Chain Broadcasting All-to-All Broadcasting, 40 brokers

Applications & Different Interconnection Patterns Map Only Classic MapReduce Iterative Reductions Twister Loosely Synchronous Input map Input map Input map iterations Pij Output reduce reduce CAP3 Analysis Document conversion (PDF -> HTML) Brute force searches in cryptography Parametric sweeps High Energy Physics (HEP) Histograms SWG gene alignment Distributed search Distributed sorting Information retrieval Expectation maximization algorithms Clustering Linear Algebra Many MPI scientific applications utilizing wide variety of communication constructs including local interactions - CAP3 Gene Assembly - PolarGrid Matlab data analysis - Information Retrieval - HEP Data Analysis - Calculation of Pairwise Distances for ALU Sequences - Kmeans - Deterministic Annealing Clustering - Multidimensional Scaling MDS - Solving Differential Equations and - particle dynamics with short range forces Domain of MapReduce and Iterative Extensions MPI SALSA

Development of library of Collectives to use at Reduce phase Broadcast and Gather needed by current applications Discover other important ones Implement efficiently on each platform especially Azure Better software message routing with broker networks using asynchronous I/O with communication fault tolerance Support nearby location of data and computing using data parallel file systems Clearer application fault tolerance model based on implicit synchronizations points at iteration end points Later: Investigate GPU support Later: run time for data parallel languages like Sawzall, Pig Latin, LINQ

Convergence is Happening Data Intensive Paradigms Data intensive application (three basic activities): capture, curation, and analysis (visualization) Cloud infrastructure and runtime Clouds Multicore Parallel threading and processes

FutureGrid: a Grid Testbed IU Cray operational, IU IBM (idataplex) completed stability test May 6 UCSD IBM operational, UF IBM stability test completes ~ May 12 Network, NID and PU HTC system operational UC IBM stability test completes ~ May 27; TACC Dell awaiting delivery of components Private Public FG Network NID: Network Impairment Device SALSA

SALSAHPC Dynamic Virtual Cluster on FutureGrid -- Demo at SC09 Demonstrate the concept of Science on Clouds on FutureGrid Dynamic Cluster Architecture Monitoring & Control Infrastructure Monitoring Infrastructure SW-G Using Hadoop SW-G Using Hadoop SW-G Using DryadLINQ Pub/Sub Broker Network Monitoring Interface Linux Baresystem Linux on Xen XCAT Infrastructure Windows Server 2008 Bare-system idataplex Bare-metal Nodes (32 nodes) Virtual/Physical Clusters XCAT Infrastructure idataplex Baremetal Nodes Summarizer Switcher Switchable clusters on the same hardware (~5 minutes between different OS such as Linux+Xen to Windows+HPCS) Support for virtual clusters SW-G : Smith Waterman Gotoh Dissimilarity Computation as an pleasingly parallel problem suitable for MapReduce style applications SALSA

SALSAHPC Dynamic Virtual Cluster on FutureGrid -- Demo at SC09 Demonstrate the concept of Science on Clouds using a FutureGrid cluster Top: 3 clusters are switching applications on fixed environment. Takes approximately 30 seconds. Bottom: Cluster is switching between environments: Linux; Linux +Xen; Windows + HPCS. Takes approxomately 7 minutes SALSAHPC Demo at SC09. This demonstrates the concept of Science on Clouds using a FutureGrid idataplex. SALSA

Experimenting Lucene Index on HBase in an HPC Environment Background: data intensive computing requires storage solutions for huge amounts of data One proposed solution: HBase, Hadoop implementation of Google s BigTable SALSA

System design Table schemas: - title index table: <term value> --> {frequencies:[<doc id>, <doc id>,...]} - texts index table: <term value> --> {frequencies:[<doc id>, <doc id>,...]} - texts term position vector table: <term value> --> {positions:[<doc id>, <doc id>,...]} Natural integration with HBase Reliable and scalable index data storage Real-time document addition and deletion MapReduce programs for building index and analyzing index data SALSA

System implementation Experiments completed in the Alamo HPC cluster of FutureGrid MyHadoop -> MyHBase Workflow: SALSA

Index data analysis SALSA

SALSA HPC Group Indiana University http://salsahpc.indiana.edu