ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004

Similar documents
Very Large Dataset Access and Manipulation: Active Data Repository (ADR) and DataCutter

Cost Models for Query Processing Strategies in the Active Data Repository

Multiple Query Optimization for Data Analysis Applications on Clusters of SMPs

Query Planning for Range Queries with User-defined Aggregation on Multi-dimensional Scientific Datasets

A Component-based Implementation of Iso-surface Rendering for Visualizing Large Datasets (Extended Abstract)

DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of Maryland, College Park and Johns Hopkins Medical Institutions

Large Irregular Datasets and the Computational Grid

Scheduling Multiple Data Visualization Query Workloads on a Shared Memory Machine

Comparing the Performance of High-Level Middleware Systems in Shared and Distributed Memory Parallel Environments

Querying Very Large Multi-dimensional Datasets in ADR

Impact of High Performance Sockets on Data Intensive Applications

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Image Processing for the Grid: A Toolkit for Building Grid-enabled Image Processing Applications.

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Optimizing Multiple Queries on Scientific Datasets with Partial Replicas

Scheduling Multiple Data Visualization Query Workloads on a Shared Memory Machine

Chapter 17: Parallel Databases

Supporting SQL-3 Aggregations on Grid-based Data Repositories

Servicing Range Queries on Multidimensional Datasets with Partial Replicas

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Improving Access to Multi-dimensional Self-describing Scientific Datasets Λ

Lecture 4: Principles of Parallel Algorithm Design (part 4)

DataCutter and A Client Interface for the Storage Resource Broker. with DataCutter Services. Johns Hopkins Medical Institutions

Chapter 18: Parallel Databases

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Introduction to MapReduce

The Virtual Microscope

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

Scheduling File Transfers for Data-Intensive Jobs on Heterogeneous Clusters

Design of Parallel Algorithms. Models of Parallel Computation

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Principles of Parallel Algorithm Design: Concurrency and Mapping

Active Proxy-G: Optimizing the Query Execution Process in the Grid

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

Introduction to Grid Computing

New Challenges In Dynamic Load Balancing

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Developing MapReduce Programs

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Objectives and Functions Convenience. William Stallings Computer Organization and Architecture 7 th Edition. Efficiency

Chapter 11: File System Implementation. Objectives

Operating System Support

Parallel Databases C H A P T E R18. Practice Exercises

Design Issues 1 / 36. Local versus Global Allocation. Choosing

Advanced Compiler and Runtime Support for Data Intensive Applications

Multidimensional Data and Modelling - DBMS

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Indexing Cached Multidimensional Objects in Large Main Memory Systems

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

CSE 153 Design of Operating Systems

Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ

A Simple Mass Storage System for the SRB Data Grid

The Google File System

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

An Architectural Approach to Inter-domain Measurement Data Sharing

Main Memory. Electrical and Computer Engineering Stephen Kim ECE/IUPUI RTOS & APPS 1

COLA: Optimizing Stream Processing Applications Via Graph Partitioning

Test On Line: reusing SAS code in WEB applications Author: Carlo Ramella TXT e-solutions

R-Trees. Accessing Spatial Data

Chapter 8: Virtual Memory. Operating System Concepts

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Application Emulators and Interaction with Simulators

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Comparing Centralized and Decentralized Distributed Execution Systems

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Knowledge-based Grids

GRAMPS Beyond Rendering. Jeremy Sugerman 11 December 2009 PPL Retreat

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Chapter 3. Database Architecture and the Web

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Performance and Optimization Issues in Multicore Computing

Principles of Parallel Algorithm Design: Concurrency and Mapping

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Introduction. CS3026 Operating Systems Lecture 01

Lecture 14. Performance Programming with OpenMP

Last Class: Deadlocks. Where we are in the course

Optimal Performance for your MacroView DMF Solution

Addresses in the source program are generally symbolic. A compiler will typically bind these symbolic addresses to re-locatable addresses.

Multimedia Storage Servers

Chapter 9: Virtual Memory

The Google File System

Ceph: A Scalable, High-Performance Distributed File System

I/O CHARACTERIZATION AND ATTRIBUTE CACHE DATA FOR ELEVEN MEASURED WORKLOADS

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

A Parallel Implementation of 4-Dimensional Haralick Texture Analysis for Disk-resident Image Datasets

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

Advanced Databases: Parallel Databases A.Poulovassilis

Chapter 9 Memory Management

Marten van Dijk Syed Kamran Haider, Chenglu Jin, Phuong Ha Nguyen. Department of Electrical & Computer Engineering University of Connecticut

Map-Reduce. Marco Mura 2010 March, 31th

Apache Flink. Alessandro Margara

Current Topics in OS Research. So, what s hot?

Transcription:

ADR and DataCutter Sergey Koren CMSC818S Thursday March 4 th, 2004

Active Data Repository Used for building parallel databases from multidimensional data sets Integrates storage, retrieval, and processing of multi-dimensional data sets Customization for application specific processing (Initialize, Map, Aggregate, Output) Support common operations like memory management, data retrieval, and scheduling 2

Application Properties Irregular multi-dimensional datasets Meshes, sensor data Input and output often disk resident Use a subset of data, access data through a range query (bounding box) 3

Application Processing Retrieve data that matches the range query Mapping input to output Accumulator used to hold intermediate result Aggregate input value with the intermediate result Aggregate operation usually commutative and associative so can aggregate in any order Output is usually much smaller so the processing steps are called a reduction 4

Processing Loop 5

Architecture Front-end Interacts with clients, forwards range queries to back-end Parallel Back-end Retrieve data items that intersect query Perform user-defined functions to generate output 6

Services Attribute Space Service Manages use of attribute spaces and user map functions Dataset Service Manages datasets stored on ADR back-end Indexing Service Manages indexes (user defined and ADR) Data Aggregation Service Manages user provided function for aggregation Query Planning Service Determines query plan to process a set of queries Query Execution Service Manages resources and carries out query plan 7

ADR Application Suite 8

Loading Data User decomposes data into chunks Compute placement information Move to location and compute an index Minimum Bounding Rectangle for each chunk Distribute chunks across disks using declustering algorithm R-Tree index using MBR Chunks are units of access and only read/written by local processor 9

Query Planning Tiling If the output is too large to fit entirely into memory Each tile is a subset of output chunks Results in implicit tiling on input chunks Workload Partitioning Each processor gets responsibility for processing subset of input/accumulator chunks 10

Query Execution Processing operations on storage manager to avoid copying Four steps 1. Initialization Accumulator elements for current tile allocated 2. Local Reduction Processors retrieve chunks on local disk and aggregate into accumulator 3. Global Combine Partial results from step 2 are combined 4. Output Handling Output computed from accumulator Overlap disk, network, and processing operations 11

Processing Strategies FRA (Fully Replicated Accumulator) Processor performs processing associated with its local input chunks Output partitioned among processors. When output chunk is assigned to a tile, corresponding accumulator chunk is put into a set of local chunks of processor that owns the output chunk Accumulator chunk assigned as ghost chunk on all other processors Effectively replicates all accumulator chunks on all processors Processors generates partial results using its input chunks 12

Processing Strategies (2) SRA (Sparsely Replicated Accumulator) FRA wasteful of memory because replicates all, even those that have no input chunks mapping to them Initialization overhead in Initialization phase and communication and computation overhead in Global Combine phase Same as FRA but, a ghost chunk is allocated only if the processor has a local input chunk that projects to the accumulator chunk 13

Processing Strategies (3) DA (Distributed Accumulator) Each processor is given responsibility to carry out work associated with a set of output chunks Accumulator chunks are not replicated to remote processes Remote input chunks that map to local output chunks must be forwarded to processor 14

Processing Strategies (4) Comparison DA avoid inter-processor communication for accumulator chunks in initialize and for ghost chunks in the global combine FRA and SRA avoid inter-processor communication for input chunks DA introduces communication in local reduction phase for input chunks 15

Results Communication volume and computation time per processor for DA decreases as processors increase Initialization and global combine phases for FRA and SRA remain constant DA has higher communication volume and more load imbalance when processors and data size increase DA communication volume is proportional to number of input chunks and fan-out FRA proportional to number of output chunks 16

Current Version http://www.cs.umd.edu/projects/hpsl/chaos/r esearchareas/adr/ Version 1.0 Updated in December 2000 17

DataCutter Indexing Works with multi-dimensional data Filter-based programming model Location independence Users define connectivity Communicate through uni-directional streams 18

Indexing Data and index files Segments contain some data tied to the multi-dimensional space MBR (Minimum Bounding Rectangle) that encompasses data points in segment Segments are units of access 19

Indexing (cont ) Use R-Trees, but a lot of data makes index large Use both summary index files and detailed index files Summary associates with multiple segments and detailed index files Detailed index specifies segments Can be organized into groups 20

Filters Component based model, components are called filters Communicate through uni-directional pipes Location independent, connections through names Optimize use of limited resources 21

Filter Implementation Filter provides three functions Init (initializework in new version) Called whenever a new unit of work arrives Creates structures Process Called to process incoming data (can return end of work or end of filter) Finalize (finalizework in new version) Called to clean up 22

Placement Communication Place filters with large volume of communications close (on same host) Filters close to data they use Minimize communications on slow links Computation Heavy computations on less loaded hosts 23

Streams Streams for files and for inter-filter communications Buffer abstraction used to send data Default is TCP, if Globus is installed, can use GSS TCP 24

Applications Decompose application into set of filters Console process, can initiate work or forward work it gets Can receive output from filters or not Filters placed on hosts, can have multiple filters on one host Filters can move between units of work However, doesn t move state/code Can send a buffer to restore state Can pin to a host 25

Execution Dird A directory daemon in a well known location Appd Register location with the dird Must be running on all hosts where a filter will exist Placement specification says what filter goes on what hosts No copying of code/binaries, must be done manually 26

Examples Virtual Microscope Read, Decompress, Clip, Zoom, View External Sort Partitioner and Sorter 27

Virtual Microscope Shows that splitting into files can reduce load times (because loading file from HPSS forces entire file to be placed in cache) Too much splitting overhead of opening too many files Indexing the data properly is important 28

Results 29

Virtual Microscope (cont ) Two machines, server and client Vary placement of filters, the zoom and the decompress filters Complex relation on placement Decompress most computational Communication between two machines Communication between view filter and client driver 30

Results 31

External Sort Vary available memory Baseline, comparison where memory is reduced on all machines Several results with half nodes with reduced memory, half with full Placement can balance heterogeneity, but have to be careful 32

Results 33

Parallel Support Groups of filter Multiple instances handle different units of work at the same time When to create new instances or reuse Multiple copies of filter Either a filter is a parallel program 1-filter Or multiple instances of a filter n-filter Transparent copies Filter unaware of the copy 34

Transparent Copies Filter service decides which one of the copies gets working data Not applicable for all filters, a filter can request not to be copied Performance results in an application where one filter is the most expensive show significant improvement 35

Transparent Copies (cont ) How to decide which one of the copies gets data? For copies on one machine, they share a queue For copies on different machines different policies Round Robin Weighted Round Robin Demand-based Sliding Window User Defined 36

Additions to DataCutter DataCutter integrated with SRB Java filters, can mix and match Java and C++ in an application Of course, overhead of virtual machine creation But, multiple java filters on one machine much better, JIT compilation helps Testing was done with version 1.3 Most expensive part is the JNI calls 37

Current Version http://www.datacutter.org/ Version 3.1.1 now available Support for external tools to add functionality (like Globus toolkit) Improvements of function names and filter set up 38

References Assigned papers Michael Beynon, Chialin Chang, Umit Catalyurek, Tahsin Kurc, Alan Sussman, Henrique Andrade, Renato Ferreira and Joel Saltz. Processing Large-Scale Multidimensional Data in Parallel and Distributed Environments. Parallel Computing. 28(5):827-859. May 2002. Special issue on Data Intensive Computing. DataCutter 3.0.0 Manual from http://www.datacutter.org NPACI All Hands Meeting Tutorial, San Diego Supercomputing Center, San Diego, CA, February 9, 2000. Very Large Dataset Access and Manipulation, Lawrence Livermore National Laboratory, March 2000. Performance Optimization of Component-based Data Intensive Applications, Lawrence Livermore National Laboratory, July 2001. 39