BigDataBench: a Benchmark Suite for Big Data Application

Similar documents
BigDataBench: a Big Data Benchmark Suite from Web Search Engines

DCBench: a Data Center Benchmark Suite

CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Sorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks

Multi-tenancy version of BigDataBench

BOPS, Not FLOPS! A New Metric, Measuring Tool, and Roofline Performance Model For Datacenter Computing. Chen Zheng ICT,CAS

BigDataBench-MT: Multi-tenancy version of BigDataBench

Albis: High-Performance File Format for Big Data Systems

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

DCBench. A Benchmark suite for Data Center Workloads

Performance Benefits of DataMPI: A Case Study with BigDataBench

Service Oriented Performance Analysis

MOHA: Many-Task Computing Framework on Hadoop

Warehouse- Scale Computing and the BDAS Stack

ibench: Quantifying Interference in Datacenter Applications

TPCX-BB (BigBench) Big Data Analytics Benchmark

Statistics Driven Workload Modeling for the Cloud

Evolution From Shark To Spark SQL:

Composite Metrics for System Throughput in HPC

Native-Task Performance Test Report

Lecture 2: Memory Systems

Deployment Planning and Optimization for Big Data & Cloud Storage Systems

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

How to use BigDataBench workloads and data sets

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Searching the Deep Web

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

StreamBox: Modern Stream Processing on a Multicore Machine

CS 425 / ECE 428 Distributed Systems Fall 2015

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

IsoStack Highly Efficient Network Processing on Dedicated Cores

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Twitter data Analytics using Distributed Computing

BIG DATA TESTING: A UNIFIED VIEW

Introduction to MapReduce

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

On Big Data Benchmarking

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

VK Multimedia Information Systems

The Fusion Distributed File System

LeanBench: comparing software stacks for batch and query processing of IoT data

HPCC Results. Nathan Wichmann Benchmark Engineer

High Performance Computing on GPUs using NVIDIA CUDA

Evolving To The Big Data Warehouse

Sampling Large Graphs for Anticipatory Analysis

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

CS427 Multicore Architecture and Parallel Computing

Models and Metrics for Energy-Efficient Computer Systems. Suzanne Rivoire May 22, 2007 Ph.D. Defense EE Department, Stanford University

Nested Virtualization and Server Consolidation

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Decentralized Distributed Storage System for Big Data

Summary of Computer Architecture

Big Data Analysis Using Hadoop and MapReduce

Towards Performance and Scalability Analysis of Distributed Memory Programs on Large-Scale Clusters

Scaling Throughput Processors for Machine Intelligence

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Systems Architecture II

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization

Strong Bridges and Strong Articulation Points of Directed Graphs

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Searching the Deep Web

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

Introduction to OpenMP. Lecture 10: Caches

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

MEGAN5 tutorial, September 2014, Daniel Huson

Consolidating Complementary VMs with Spatial/Temporalawareness

Best Practices for Validating the Performance of Data Center Infrastructure. Henry He Ixia

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Memory hierarchy review. ECE 154B Dmitri Strukov

An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements

High Performance Computing on MapReduce Programming Framework

Next-Generation Cloud Platform

Correlation based File Prefetching Approach for Hadoop

How to use the BigDataBench simulator versions

Chapter 5 Input/Output. I/O Devices

The Implications from Benchmarking Three Big Data Systems

Distributed computing: index building and use

OpenVMS Performance Update

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Challenges for Data Driven Systems

I/O Characterization of Commercial Workloads

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

MapReduce Algorithms

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

A Fast and High Throughput SQL Query System for Big Data

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

CS Project Report

ECS289: Scalable Machine Learning

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Introduction to MapReduce Algorithms and Analysis

Transcription:

BigDataBench: a Benchmark Suite for Big Data Application Wanling Gao Institute of Computing Technology, Chinese Academy of Sciences HVC tutorial in conjunction with The 19th IEEE International Symposium on High Performance Computer Architecture () INSTITUTE OF COMPUTING TECHNOLOGY 1

Outline Motivation BigDataBench Overview Usage Guide Generating Big Data Configuring Workloads A Use Case Future Work

Requirements for Big Data Benchmark To truly reflect Use-cases & requirements To rapidly evolve With new workloads and use cases To widely cover Application domains, data types and use-case scenarios <<Proposal for a Big Data Benchmark Repository>>--Andries Engelbrecht. WBDB2012

Benchmark for Big Data: State of Practice Sort Benchmark Only one application: MinuteSort, JouleSort, TeraByte Sort One-fits-all solution? Sort : mainly integer comparison operation and I/O bound pattern Fixed data scale Fixed data format NO! 100-byte input records with a 10-byte random key to be sorted for test.? = Sort http://sortbenchmark.org/ 4

Benchmark for Big Data: State of Practice GridMix, Hadoop microbenchmark Generating data set randomly Cons Ignoring the characteristics of real-world data 5

Challenges Current State--Immature We Don't Know Enough to Make a Big Data Benchmark Suite An Academia-Industry View, Yanpei Chen, UC Berkeley/Cloudera WBDB2012 Our incremental solution To single out foundation application in the most important domain Then to expand 6/ e

Outline Motivation BigDataBench Overview Usage Guide Generating Big Data Configuring Workloads A Use Case Future Work

Benchmark overview Methodology BigDataBench

Methodology Overview Step 1 Investigate Application Domains Step 2 Choose Typical Workloads Step 3 Create Big Data Benchmarks 9/

Step 1:Investigate Application Domains So many application domains Which application domain should I choose? 10/

Step 1:Investigate Application Domain 92% of online adults use search engines to find information on the web. (pew internet study) 11/

Step 1 (Cont ) Search Engine is a key data center application 40% of top20 websites Search Engine Social Network Electronic Commerce Media Streaming Others 15% 5% 40% 15% 25% http://www.alexa.com/topsites/global;0 12/ Top 20 websites

Step 1 (Cont ) Step 1 Investigate Application Domain Search Engine Step 2 Choose Typical Workloads Step 3 Create Big Data Benchmarks 13/

Step 2:Choose Typical Workloads Representative workloads in the most important application domain Widely used in other domains Having representative features Typical Workloads 14/

Step 2:Choose Typical Workloads Sort Grep Typical Wordcount Naïve Bayes 15/

Step 2:Representative Workloads in Search Engine Workloads Sort Roles in the search engine URL sorting Word frequency sorting Other sorting Wordcount Grep Word frequency count Abstracting content from HTML Abstracting content from TextFile String replacement Naïve Bayes Web page classification News classification 2013/5/29

Step 2:Widely Used in Other Domains Workloads Sort Wordcount Grep Naïve Bayes Application Scenarios in Industry Pages sorting by its ID (Web storage in database) Calculating the TF-IDF base information,such as term frequency Obtain the user operations count to analysis their social behavior (in Wolfram Alpha) Log analysis Web information extraction Fuzzy search Spam recognition(spam Filtering ) Bioinformatics(Naïve Bayesian Classifier for Rapid Assignment of rrna Sequences into the New Bacterial Taxonomy)

Step 2:Having Representative Features Workloads Resource Characteristic Computing Complexity Instructions Read/Write Sort I/O bound O(n*lgn) Integer comparison domination 0.75 Wordcount CPU bound O(n) Integer comparison and calculation domination 1.15 Grep Hybrid O(n) Integer comparison domination 1.50 Naïve Bayes / O(m*n) [m: the length of dictionary] Floating-point computation domination 1.70

Step 2 (Cont ) Step 1 Investigate Application Domain Search Engine Step 2 Choose Typical Workloads Sort Wordcount Grep Naïve Bayes Step 3 Create Big Data Bechmarks 19/

Step 3: Big Data Puzzle Confidential for company Big Data Difficult to download Easily get small-scale data 20/

Step 3: Generating Big Data To preserve the characteristics of real-world data Characteristic Analysis Expand Small-scale Data Semantic Usually less than 100MB Word frequency Locality Temporally Spatially Big Data Word reuse distance Dependency of words

Step 3:Create Big Data Benchmarks Sort Data Generation Tool Big Data Wordcount Grep Naïve Bayes Generate from discretionary small-scale data Use HDFS as the storage Representative workloads in Search Engine 22/

Step 3 Step 1 Investigate Application Domain Search Engine Step 2 Choose Typical Workloads Step 3 Create Big Data Benchmarks Sort Wordcount Grep Naïve Bayes Generate big data from small-scale data 23/

BigDataBench Overview Methodology BigDataBench

BigDataBench format conversion Sequence file Sort Wordcount Data Generation Tool Text Data 10GB-1PB Grep Naïve Bayes Seed Data 1 Characteristic Analysis Expand Data Set 1 Seed Data 2 Data Set 2 The Same Characteristics 2013/5/29

Outline Motivation BigDataBench Overview Usage Guide Generating Big Data Configuring Workloads A Use Case Future Work

Use Guide Prerequisite Software Java (version 1.6.0_20 and later) Hadoop(version 1.0.2 and later)

Outline Motivation BigDataBench Overview Usage Guide Generating Big Data Configuring Workloads A Use Case Future Work

Analyzing the Seed Data $HADOOP_HOME/bin/hadoop jar count.jar arg1 arg2 arg1: input file, which saves the seed data arg2: output file, which saves the characteristics of seed data An example $HADOOP_HOME/bin/hadoop jar count.jar /gwl/seed test 2013/5/29

Generating Data $HADOOP_HOME/bin/hadoop jar TextProduce.jar arg1 arg2 arg3 arg4 arg5 arg1: input file, also the output file of first program arg2: output file, which will save the new data arg3: the numbers of types in the seed data arg4: the numbers of news need to be generated arg5: the groups of each type of news will be divided 2013/5/29

About arg5 The arg5 is adjustable as to the reduce slots in the cluster Usually, the more the number of reduce slots, the bigger the arg5 should be set Data Size 100G 1T 10T Reduce slots 100 180 280 arg5 5 9 14

An Execution Example $HADOP_HOME/bin/hadoop jar TextProduce.jar bayes-input file-100g 20 75000000 5 Program Name the numbers of news need to be generated. Generally, 75- million represents 100GB The file which stores the characteristics of seed data The filename of the output file the numbers of types in the seed data 32/

Outline Motivation BigDataBench Overview Usage Guide Generating Big Data Configuring Workloads Use Cases Future Work

How to Convert the Data Format sort-transfer.sh arg1 arg2 arg1: the input file, which contains the original data arg2: the output file, which saves the conversion data 2013/5/29

2013/5/29

Choosing Among Four Workloads How to benchmark 1 run-sort.sh arg 2 run-wordcount.sh arg 3 run-grep.sh arg 4 run-bayes.sh arg (arg: the data scale) 2013/5/29

2013/5/29

Outline Motivation BigDataBench Overview Usage Guide Generating Big Data Configuring Workloads Use Cases Future Work

Use Case 1: Optimizing the Hadoop I have 64 hadoop nodes to process 100TB data, but how to configure the parameters? Xeon Xeon Xeon Xeon 39/

Use Case 1: Optimizing the Hadoop Configuration 1 Result 1 Map slot Reduce slot Java_opts Configuration 2 BigData Bench under 100TB User V.S observation metrics Result 2 Optimal configuration of Hadoop under 100TB 40/

Use Case2: OS Performance I want to build a brand new OS for big data! So, which part of kernel should I optimize? How to evaluate the OS kernel? 41/

Use Case2: OS Performance Step1: Big Data Selection Step2: Increasing Data Set Step3: Fetch OS Performance Metric Text data I/O Capacity Webpage data Latency Scheduling Ranking data Bandwidth Trading data Interrupt Handler 42/

Outline Motivation BigDataBench Overview Usage Guide Generating Big Data Configuring Workloads A Use Case Future Work

Future Work To include workloads in other important domains Recommendation system Graph computing etc Release BigDataBench soon on http://prof.ict.ac.cn/ictbench Please give us your email. We will notify you the updates. 2013/5/29

Thank you! gaowanling@ict.ac.cn 45/