大数据基准测试 : 原理 方法和应用. 詹剑锋 中国科学院计算技术研究所中国科学院大学 可信云服务大会, 北京 INSTITUTE OF COMPUTING TECHNOLOGY

Similar documents
测试基础架构 演进之路. 茹炳晟 (Robin Ru) ebay 中国研发中心

A Benchmark For Stroke Extraction of Chinese Characters

IEEE 成立于 1884 年, 是全球最大的技术行业协会, 凭借其多样化的出版物 会议 教育论坛和开发标准, 在激励未来几代人进行技术创新方面做出了巨大的贡献, 其数据库产品 IEL(IEEE/IET Electronic Library)

Previous on Computer Networks Class 18. ICMP: Internet Control Message Protocol IP Protocol Actually a IP packet

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

计算机科学与技术专业本科培养计划. Undergraduate Program for Specialty in Computer Science & Technology

BigDataBench-MT: Multi-tenancy version of BigDataBench

Silverlight 3 概览 俞晖市场推广经理微软 ( 中国 ) 有限公司

Understanding IO patterns of SSDs

如何查看 Cache Engine 缓存中有哪些网站 /URL

Machine Vision Market Analysis of 2015 Isabel Yang

Oracle 一体化创新云技术 助力智慧政府信息化战略. Copyright* *2014*Oracle*and/or*its*affiliates.*All*rights*reserved.** *

操作系统原理与设计. 第 13 章 IO Systems(IO 管理 ) 陈香兰 2009 年 09 月 01 日 中国科学技术大学计算机学院

Chapter 7: Deadlocks. Operating System Concepts 9 th Edition

Apache OpenWhisk + Kubernetes:

OTAD Application Note

云计算入门 Introduction to Cloud Computing GESC1001

Spark Standalone 模式应用程序开发 Spark 大数据博客 -

Triangle - Delaunay Triangulator

Operating Systems. Chapter 4 Threads. Lei Duan

H3C CAS 虚拟机支持的操作系统列表. Copyright 2016 杭州华三通信技术有限公司版权所有, 保留一切权利 非经本公司书面许可, 任何单位和个人不得擅自摘抄 复制本文档内容的部分或全部, 并不得以任何形式传播 本文档中的信息可能变动, 恕不另行通知

我们应该做什么? 告知性分析 未来会发生什么? 预测性分析 为什么会发生 诊断性分析 过去发生了什么? 描述性分析 高级分析 传统 BI. Source: Gartner

Multi-tenancy version of BigDataBench

智能终端与物联网应用 课程建设与实践. 邝坚 嵌入式系统与网络通信研究中心北京邮电大学计算机学院

网络测量与行为学 网络测量与行为学概述. 程光 东南大学计算机科学与工程学院 CERNET 华东 ( 北 ) 地区网络中心江苏省计算机网络技术重点实验室

2. Introduction to Digital Media Format

: Operating System 计算机原理与设计

<properties> <jdk.version>1.8</jdk.version> <project.build.sourceencoding>utf-8</project.build.sourceencoding> </properties>

Bi-monthly report. Tianyi Luo

1. DWR 1.1 DWR 基础 概念 使用使用 DWR 的步骤. 1 什么是 DWR? Direct Web Remote, 直接 Web 远程 是一个 Ajax 的框架

Virtual Memory Management for Main-Memory KV Database Using Solid State Disk *

IEEE 成立于 1884 年, 是全球最大的技术行业协会, 凭借其多样化的出版物 会议 教育论坛和开发标准, 在激励未来几代人进行技术创新方面做出了巨大的贡献, 其数据库产品 IEL(IEEE/IET Electronic

Bing.com scholar. Мобильный портал WAP версия: wap.altmaster.ru

三 依赖注入 (dependency injection) 的学习

Smart Services Lucy Huo (Senior Consultant, UNITY Business Consulting) April 27, 2016

VAS 5054A FAQ ( 所有 5054A 整合, 中英对照 )

Cyber Security Introduction

AvalonMiner Raspberry Pi Configuration Guide. AvalonMiner 树莓派配置教程 AvalonMiner Raspberry Pi Configuration Guide

新一代 ODA X5-2 低调 奢华 有内涵

High Volume Throughput Computers (HVC): An ICT View of Datacenter Computers

本科专业人才培养计划 信息学科大类分册 华中科技大学教务处 二 O 一五年七月

在数据中心中加速 AI - Xilinx 机器学习套件 (Xilinx ML Suite )

Green Computing Cloud Computing LSD Tech Co., Ltd SSD server & SSD Storage Cloud SSD Supercomputer LSD Tech Co., LTD

HBase 在 hulu 的使用和实践. hulu

The Design of Everyday Things

北 京 忆 恒 创 源 科 技 有 限 公 司 16

云计算入门 Introduction to Cloud Computing GESC1001

XML allows your content to be created in one workflow, at one cost, to reach all your readers XML 的优势 : 只需一次加工和投入, 到达所有读者的手中

Technology: Anti-social Networking 科技 : 反社交网络

OpenCascade 的曲面.

Chapter 1 (Part 2) Introduction to Operating System

Logitech G302 Daedalus Prime Setup Guide 设置指南

3dvia Composer Solidworks

绝佳的并行处理 - FPGA 加速的根本基石

IBM 开源技术微讲堂容器技术与微服务系列

Presentation Title. By Author The MathWorks, Inc. 1

2.8 Megapixel industrial camera for extreme environments

5.1 Megapixel machine vision camera with GigE interface

Apache Kafka 源码编译 Spark 大数据博客 -

Declaration of Conformity STANDARD 100 by OEKO TEX

China Next Generation Internet (CNGI) project and its impact. MA Yan Beijing University of Posts and Telecommunications 2009/08/06.

上汽通用汽车供应商门户网站项目 (SGMSP) User Guide 用户手册 上汽通用汽车有限公司 2014 上汽通用汽车有限公司未经授权, 不得以任何形式使用本文档所包括的任何部分

Microsemi - Leading Innovation for China s Hyperscale Data Centers

Safe Memory-Leak Fixing for C Programs

软件测试 05 变异测试 玄跻峰 武汉大学计算机学院. URL:

Software Engineering. Zheng Li( 李征 ) Jing Wan( 万静 )

PCU50 的整盘备份. 本文只针对操作系统为 Windows XP 版本的 PCU50 PCU50 启动硬件自检完后, 出现下面文字时, 按向下光标键 光标条停在 SINUMERIK 下方的空白处, 如下图, 按回车键 PCU50 会进入到服务画面, 如下图

最短路径算法 Dijkstra 一 图的邻接表存储结构及实现 ( 回顾 ) 1. 头文件 graph.h. // Graph.h: interface for the Graph class. #if!defined(afx_graph_h C891E2F0_794B_4ADD_8772_55BA3

SESEC IV. China Cybersecurity. Standardization Monthly. Newsletter. June 2018

Filters: E-Syn, Momentum, Transient and the DAC

LAB 3: DC Simulations and Circuit Modeling

武汉大学 学年度第 1 学期 多核架构及编程技术 试卷(A)

付敏跃. 浙江大学控制系 University of Newcastle, Australia

赛灵思技术日 XILINX TECHNOLOGY DAY 用赛灵思 FPGA 加速机器学习推断 张帆资深全球 AI 方案技术专家

LAB 5: S-parameter Simulation and Optimization

Mobile & Embedded DevCon 2005

Oriented Scene Text Detection Revisited. Xiang Bai Huazhong University of Science and Technology

Handbook of BigDataBench (Version 3.1) A Big Data Benchmark Suite

东莞市东颂电子有限公司 DONGGUAN DONGSONG ELECTRONIC CO., LTD SPECIFICATION FOR APPROVAL

Build a Key Value Flash Disk Based Storage System. Flash Memory Summit 2017 Santa Clara, CA 1

Interactive fixes for software configuration

Advanced Design System Fundamentals

Chapter 1 (Part 1) Computer Abstractions and Technology ( 计算器抽象化与科技 )

软件测试. 04 缺陷跟踪与 Bug 仓库研究

Keygen Codes For Photoshop Cs6 ->>> DOWNLOAD

Michael Bailou Huang, LAc, MAc, MLS, MEd Health Sciences Library, Stony Brook University, USA 黄柏楼美国石溪大学健康科学图书馆

EqualLogic Best Practices for SQL Server Deployments

Division of Science and Technology

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

计算机组成原理第二讲 第二章 : 运算方法和运算器 数据与文字的表示方法 (1) 整数的表示方法. 授课老师 : 王浩宇

DEV Office 客户端开发增强

DBI-B311. Revolution R 和微软数据平台 赵利超微软数据平台技术专家

[ 电子书 ]Spark for Data Science PDF 下载 Spark 大数据博客 -

Compile times - assert macros

MeeGo : An Open Source OS Solution For Client Devices

提升设备制造应用效能 -- 适应物联网发展的嵌入式 IPC. 麦文浩 Max,Mak Embedded IPC ARK PSM

LAB 5: S-parameter Simulation and Optimization

CHAPTER 5 NEW INTERNET APPLICATIONS

Outline. Motivations (1/3) Distributed File Systems. Motivations (3/3) Motivations (2/3)

1. Spring 整合 Jdbc 进行持久层开发

Transcription:

大数据基准测试 : 原理 方法和应用 詹剑锋 http://prof.ict.ac.cn/bigdatabench 中国科学院计算技术研究所中国科学院大学 2015.7.31 2015 可信云服务大会, 北京 INSTITUTE OF COMPUTING TECHNOLOGY

Outline 原理 方法 BigDataBench

计量的意义 科学和人类日常生活的基础 牛顿 ( 力 ) 开尔文 ( 温度 ) 瓦特 ( 功率 )

开尔文名言 If you can t measure it, you can t improve it. 无法计量, 就无法改进!

大数据基准测试的本质 造一把量大数据系统的尺子 不幸的是 系统太复杂 应用太多样 指标不直观

什么是基准测试程序? The process of running a specific program or workload on a specific machine or system and measuring the resulting performance. Saavedra, R. H., Smith, A. J.: Analysis of benchmark characteristics and benchmark performance prediction, ACM Transactions on Computer System, vol. 14, no. 4, (1996) 344-384

什么是基准测试程序套件? A popular measure of performance with a variety of applications To overcome the danger of placing too many eggs in one basket the weakness of any one benchmark is lessened by the presence of the other benchmarks characterize the relative performance e.g. EEMBC, SPEC -- Computer architecture: a quantitative approach

基准测试程序原理 Few explicitly discusses benchmarking principles After-thought However, implicit principles indeed exist.

怎样才算一个好的基准测试程序? Relevant Good Benchmark Portable Scalable Simple

TPC 基准测试程序原理 Relevant meaningful within the target domain Simple Good metric(s) linear, orthogonal, monotonic Portable applicable to a broad spectrum of hardware/architecture Coverage does not oversimplify the typical environment Acceptance Vendors and Users embrace it -- Charles Levine: TPC-C: The OLTP Benchmark, Sigmod, 1997

SPEC 原理 SPEC: Systems Performance Evaluation Cooperative Application-oriented test real-life situations Portability written in a platform neutral programming language Repeatable and reliable Consistency and fairness each specification must define clear rules for executing and reporting results -- Renzo Angles: Benchmark principles and Methods

基准测试利弊 Good benchmarks Define the playing field Accelerate progress Engineers do a great job once objective is measurable and repeatable Set the performance agenda Measure release-to-release progress Benchmark abuse Benchmarketing Benchmark wars more $ on ads than development -- TPC Benchmarks: talked by Charles Levine at 1997

大数据基准测试挑战 One-size-fits-all vs. one-size-fits-a-bunch Hardware: General-purpose vs. specific-purpose Data management OLTP, NoSQL, DW, offline/interactive analytic, streaming Diverse/representative vs. benchmark cost Open problem Increasing workloads, data, and software stacks. Simple (understandable) vs. complex Specific vs. abstract Semantic-specific

提纲 原理 方法 BigDataBench

基准测试构造方法 Top-down: representative program selection can yield accurate representations of the program space of interest usually impossible to make any form of hard statements about the representativeness Bottom-up: diverse range of characteristics program characteristics are quantities that can be measured and compared not all portions of the characteristics space are equally important -- C. Bienia. Benchmarking modern multiprocessors. PhD thesis, Princeton University, 2011.

TPC 功能负载模型 Application domain encapsulate user cases Functions of abstraction abstraction of the implementations of use cases in different application domains. Systems View and Physical View Different systems and hardware -- Yanpei Chen, Francois Raab, Randy Katz: From TPC-C to Big Data Benchmarks: A Functional Workload Model, WBDB, 2012

功能负载模型 Functional view enables a large range of similarly targeted systems to be compared allows the benchmark to scale and evolve

TPC-C 方法学 Functions of Abstraction a mid-weight read-write trans- action (i.e., New-Order) a light-weight read-write transaction (i.e., Payment) a mid-weight read-only transaction (i.e., Order-Status) a batch of mid-weight read-write transactions (i.e., Delivery) a heavy-weight read-only transaction (i.e., Stock-Level) Functional Workload Model captures in an implementation-independent manner the load that the system needs to service

结构化数据的关系模型 E. F. Codd, A relational Model of Data for Large shared data banks. Communication of ACM, vol 13. no.6, 1970. Set concept : general mathematical meaning General representation of data Basis of relational algebra (theoretical foundation of database) 5 basic operations Select, Project, Product, Union, Difference

并行计算抽象 By a multidisciplinary group of well-known researchers e.g.: Jim Gray, Michael Jordan David A. Patterson Operations & Patterns Abstracted from 13 representative parallel computation patterns Parallel computation inherent demand for big data processing (volume & complexity)

提纲 原理 方法 BigDataBench

什么是 BigDataBench? An open source big data benchmarking project http://prof.ict.ac.cn/bigdatabench Search Google using BigDataBench

BigDataBench 3.1 概况 BDGS(Big Data Generator Suite) for scalable data Wikipedia Entries Amazon Movie Reviews Google Web Graph Facebook Social Network E-commerce Transaction ImageNet English broadcasting audio ProfSearch Resumes DVD Input Streams Image scene SoGou Data Genome sequence data Assembly of the human genome MNIST 14 Real-world Data Sets Search Engine Multimedia Social E-commerce Network Bioinformatics 33 Workloads NoSql Impala Shark Hadoop RDMA MPI DataMPI Software Stacks

为什么要使用 BigDataBench? Specifi cation Application domains Workload Types Work loads Scalable data sets (from real data) Multiple impleme ntations Multite nancy BigDataBench Y Five Four [1] 33 8 Y Y Y Y Subs ets Simulat or version BigBench Y One Three 10 3 N N N N Cloud-Suite N N/A Two 8 3 N N N Y HiBench N N/A Two 10 3 N N N N CALDA Y N/A One 5 N/A Y N N N YCSB Y N/A One 6 N/A Y N N N LinkBench Y N/A One 10 N/A Y N N N AMP Benchmarks Y N/A One 4 N/A Y N N N [1] The four workloads types include Offline Analytics, Cloud OLTP, Interactive Analytics and Online Service

BigDataBench 用户 http://prof.ict.ac.cn/bigdatabench/users/ Industry users Accenture, BROADCOM, SAMSUMG, Huawei, IBM About 20 academia groups published papers using BigDataBench BigDataBench support for Flink

工业标准 :BigDataBench-DCA China s first industry-standard big data benchmark suite http://prof.ict.ac.cn/bigdatabench/industrystandard-benchmarks/ Telecom Research Institute of Ministry of Industry and Information Technology, ICT, CAS, Huawei, China Mobile, Sina, ZTE, Intel (China), Microsoft (China), IBM CDL, Baidu, INSPUR, ZTE, 21viane and UCloud

BigDataBench 论文和技术报告 BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014). Characterizing data analysis workloads in data centers. 2013 IEEE International Symposium on Workload Characterization (IISWC 2013)(Best paper award) BigOP: generating comprehensive big data workloads as a benchmarking framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014) BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The Fourth workshop on big data benchmarking (WBDB 2014) Identifying Dwarfs Workloads in Big Data Analytics arxiv preprint arxiv:1505.06872 BigDataBench-MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads arxiv preprint arxiv:1504.02205

BigDataBench 原理 Data-centric supporting different types of raw data Application-centric Independent on specific HW/SW components Coverage representative workloads reflect diversity of application scenarios Representative software stacks Scalable & Extensible Easy to add new workloads and support new software stacks Usability: Easy to deploy, configure and run for users

BigDataBench 方法学 Application Domain 1 Benchmark specification 1 Real-world data sets Multi-tenancy version Application Domain Data models of different types & semantics Data operations & workloads patterns Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset

Nucleotides (billion) Search Engine 200 Electronic Commerce new 180Others VIDEOS 160 on YouTube 15% 5% every minute 140 120 100 80 15% 五个应用领域 DDBJ/EMBL/GenBank database Growth Nucleotides Entries Internet Service Search engine, Social network, E-commerce Social Network Media Streaming hours MUSIC streaming on PANDORA 40% every minute 25% 60 data growth VIDEO 40 feeds from 40 minutes Bioinformatics VOICE calls on are surveillance 20 cameras 20 Skype every minute 0 Top 20 websites uments, 0 http://www.oldcolony.us/wp-content/uploads/2014/11/whatisbigdata-dkb-v2.pdf http://www.alexa.com/topsites/global;0 http://www.ddbj.nig.ac.jp/breakdown_stats/dbgrowth-e.html#dbgrowth-graph Taking up 80% of internet services according to page 180 new views and daily visitors 160 PHOTOS on FLICKR every 140 minute Multimedia 120 100 80 60 IMAGES, VIDEOS, doc Entries (million)

BigDataBench 方法学 Application Domain 1 Benchmark specification 1 Real-world data sets Multi-tenancy version Application Domain Data models of different types & semantics Data operations & workloads patterns Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset

大数据分析的小矮人 A minimum set to represent maximum patterns of big data analytics

大数据离线分析小矮人 Linear Algebra Sampling Transform operation Graph operation Logic operation Set operation Statistic operation Sort

类 DAG 组合 Feature extraction SIFT Algorithm

负载和数据集 Structured Semi-Structured Unstructured Text Graph Table Multimedia Data Model Semantics Data Operations Workload Patterns Unit of computation Different combination of units of computation

BigDataBench 方法学 Application Domain 1 Benchmark specification 1 Real-world data sets Multi-tenancy version Application Domain Data models of different types & semantics Data operations & workloads patterns Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset

基准测试规约 Guidelines for BigDataBench implementation Data model workloads Describe data model Model typical application scenarios Extract important workloads

规约 1 搜索引擎 General search and vertical search Online server and Offline analytics

规约 多媒体 Voice Data Extraction Speech Recognition Video Data MPEG Decoder Frame Data Extraction Feature Extraction Image Segmentation Face Detection Three- Dimensional Reconstruction Tracing

BigDataBench 方法学 Application Domain 1 Benchmark specification 1 Real-world data sets Multi-tenancy version Application Domain Data models of different types & semantics Data operations & workloads patterns Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset

大数据生成工具套件 3 kinds of big data generators Preserving original characteristics of real data Text/Graph/Table generator

BigDataBench 方法学 Application Domain 1 Benchmark specification 1 Real-world data sets Multi-tenancy version Application Domain Data models of different types & semantics Data operations & workloads patterns Benchmark specification Data generation tools Mix with different percentages Reduce benchmarking cost Application Domain N Benchmark specification N Workloads with diverse implementations BigDataBench subset

BigDataBench 多租户版本 Scenarios of multiple tenants running heterogeneous applications in cloud datacenters Latency-critical online services Latency-insensitive offline batch applications Benchmarking scenarios Mining real-world Workload traces (Google and Facebook) Profiling Realworld Workload traces Workload matching using Machine learning techniques Parametric workload generation tool Mixed workloads in public clouds Data analytical workloads in private clouds

BigDataBench 子集 Motivation It is expensive to run all the benchmarks for system and architecture researches multiplied by different implementations BigDataBench 3.0 provides about 77 workloads Eliminate the correlation data Identify workload characteristics from a specific perspective Dimension reduction (PCA) Clustering (K-Means) Subset

正在进行的工作 Streaming With ECNU, Renming Univeristy of China

结论 回顾和总结基准测试程序原理和方法 介绍一个开源的大数据基准测试程序 -- -BigDataBench