How to use BigDataBench workloads and data sets

Size: px

Start display at page:

Download "How to use BigDataBench workloads and data sets"

Aron Barker
5 years ago
Views:

Academy of Sciences BigDataBench Tutorial MICRO

1 How to use BigDataBench workloads and data sets Gang Lu Institute of Computing Technology, Chinese Academy of Sciences BigDataBench Tutorial MICRO 2014 Cambridge, UK INSTITUTE OF COMPUTING TECHNOLOGY 1

2 General steps to use BigDataBench n Current release n 3.1 on h8p://prof.ict.ac.cn/bigdatabench n General steps to run the benchmarks n Prepare the package of BigDataBench n Prepare the environments of the selected sobware stack n Generate data sets as you need You can find a gendate* or a prepare* shell script in each directory of the benchmarks n Run the scripts or commands (Handbook!)

3 A glance of the directory structure Root directory of BigDataBench With different programming model Search Engine Social network E- commerce MulO- media BioinformaOcs Hadoop Hadoop Hadoop MPI Work Queue Spark Spark Spark MPI MPI MPI MPI BDGS BDGS BDGS

4 Domain: Search Engine Mirobenchmakrs Opera&ons or Algorithms Read Write Scan Sort Grep WordCount Index PageRank Types Data Sets So6ware Stacks Cloud OLTP ProfSearch Resumes HBase,Mysql Offline analyocs Wikipedia Hadoop,Spark,MPI Offline analyocs Google Web Graph Hadoop,Spark,MPI Nutch Server Online service SouGou Index Nutch

5 Example: Cloud OLTP with HBase (Hadoop ) n Target: run write operaoons using HBase n General steps: n Prepare HBase according to the office guide sh /hbase /bin/hbase shell create 'usertable','f1','f2','f3' n Prepare YCSB as the workload generator YCSB is in the directory of BasicDatastoreOperaOons/ycsb n Run YCSB commands like this: sh bin/ycsb load hbase - P workloads/workloadc - p threads=<thread- numbers> - p columnfamily=<family> - p recordcount=<recordcount- value> - p hosts=<hosop> - s>load.dat

6 Example: Cloud OLTP with Hbase (Hadoop ) n Important parameters of running YCSB: <threadnumber> The number of client threads, this is oben done to increase the amount of load offered against the database. <family> <recordcount- value> <hosop> In the HBase case, we used it to set database column. You should have database user table with column family before running this command. Then all data will be loaded into database user table with column family The total records for this benchmark. For example, when you want to load 10G record, you should set it to The IP of the HBase s master node.

7 Example: PageRank with MPI n Target: run PageRank using MPI n General steps: n Prepare MPI environments n Run the data generaoon script cd BigDataBench_MPI_V3.0/SearchEngine/MPI_Pagerank sh gendata_pagerank.sh n Run the script: sh run_pagerank.sh <# IteraHons of GenGragh > You can also Steps try are mpirun almost as the follows: same for other programming models. mpirun - n process number./run_pagerank <InputGraphfile ><num ofvertex ><num ofedges ><iterahons > Refer to the handbook!

8 Domain: E- commerce Opera&ons or Algorithm Types Data Sets So6ware Stacks Bayes CF Project offline analyocs Amazon Movie Review Hadoop, Spark, MPI Mirobenchmakrs Complex queries Filter Cross Product OrderBy Union Difference Aggrega&on Join Query Select Query Aggrega&on Query InteracOve analyocs CALDA and E- commerce Hive, Shark, Impala

9 Example: Complex Queries With Shark n General steps: n Prepare Shark environments n Run the data generaoon script cd./bigdatageneratorsuite/table_datagen/ java - XX:NewRaHo=1 - jar pdgf.jar - l demo- schema.xml - l demo- generahon.xml - c - s - sf $number Don t forget to upload the output file to HDFS which will be used by Shark tasks n Start Shark and create three tables which will be used for follow- up queries

10 Example: Complex Queries With Shark n General steps: n Prepare Shark environments n Run the data generaoon script n Start Shark and create three tables which will be used for follow- up queries detailed statements are in the handbook n Run the queries sh runquery.sh

11 Domain: MulO- media Opera&ons or Algorithm BasicMPEG SIFT Speech Recogni&on Types Data Sets So6ware Stacks Stream Data ImageNet Audio files Ray Tracing Image Segmenta&on Offline analyocs Scene descripoon files ImageNet MPI Face Detec&on ImageNet DBN MNIST

12 Example: SIFT with MPI n Target: run SIFT using MPI n General steps: n Prepare MPI environments n Run the data generaoon script sh getpath $ ImageNet_1G /BigDataBench_Media (The output file will be ImageNet_1G.path ) n Run SIFT using the generated file as input mpirun - n process_number - f node_file./si\feat_mpi <input file>

13 Other domains Domains Opera&ons or Algorithm Types Data Sets So6ware Stacks BFS Graph500 Data MPI Social Network Kmeans CC offline analyocs Facebook Social Network Facebook Social Network Hadoop, Spark,MPI Hadoop, Spark,MPI Bioinform a&cs SAND BLAST offline analyocs Genome sequence Data Assembly of the human genome MPI Details can be found in the handbook of BigDataBench:

14 Any Questions

BigDataBench Subset II. User s Manual

BigDataBench Subset II. User s Manual BigDataBench Subset II User s Manual 1 Cloud OLTP We use YCSB to run database basic operations. And, we provide the HBase to run operations for each operation HBase_Write To Prepare 2. cd $hbase 3. bin/hbase