Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

Size: px

Start display at page:

Download "Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment"

Elvin Fowler
5 years ago
Views:

1 DEIM Forum 213 F2-1 Adaptive indexing MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment Shohei OKUDERA, Daisaku YOKOYMA, Miyuki NAKANO, and Masaru KITSUREGAWA Abstract Institute of Industrial Science the University of Tokyo Komaba Meguro-ku Tokyo Japan {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp We are required to process large-scale data efficiently as the volume of digital data we generate increases rapidly. MapReduce environment is suitable for large-scale distributed data processing and getting more important as a ad-hoc data analysis platform. However, all data scan is required every query when performing ad-hoc analysis in MapReduce environment, so the I/O cost will be badly huge. In this paper, in order to reduce the I/O cost, we propose an integrated MapReduce framework with adaptive index. This framework gradually shrinks the processing cost, as those queries are issued, each of which has a selection condition on the same column but different selection values. And the framework can keep the performance when data updates happens. We applied our proposed adaptive indexing technique into basic data operations projection, aggregation, join and evaluated the reduction of the execution time. We conducted the simulation and compared the accumulated execution time in the workload where the queries with the selection conditions slightly changed are repeatedly issued. As a result of simulation, our proposed framework showed higher performance than those processing without adaptive indexing. Key words MapReduce Adaptive indexing Ad-hoc query Simulation 1. SNS Facebook [1], [2], ebay 2 3 5

Web. MapReduce [3] Hadoop [4] MapReduce [5] MapReduce MapReduce MapReduce 2 MapReduce 3 4 MapReduce 5 6 7 2.

2 Web. MapReduce [3] Hadoop [4] MapReduce [5] MapReduce MapReduce MapReduce 2 MapReduce 3 4 MapReduce MapReduce MapReduce 3 discount (1) SELECT id, discount FROM R WHERE x <= discount < y ; (2) (3) SELECT S. partid, S. partname, R. discount FROM R, S WHERE x <= R. discount < y AND R. partid = S. partid 2. 1 Hadoop MapReduce. MapReduce MapReduce Map Reduce 2 Map Hadoop HDFS map Reduce Map Reduce HDFS 1 Map 2 < = discount < 8 Map Map Map Collect Spill Merge Reduce. Reduce Map Reduce HDFS < = discount < 5 MapReduce SELECT partid, AVE( d i s c o u n t ) FROM R WHERE x <= discount < y GROUP BY parrid

3. MapReduce Adaptive index [6], [7] MapReduce Adaptive indexing I/O 3.

3 3. MapReduce Adaptive index [6], [7] MapReduce Adaptive indexing I/O MapReduce Map Map Map Map HDFS HDFS Adaptive indexing Database Cracking [6] Map Map 2 < = discount < 8 discount < 2, 2 < = discount < 8, 8 < = discount 3 2, 8 ) HDFS

4 Map Map HDFS HDFS all-delete HDFS split-deleteHDFS HDFS hash hash hash hash 5 3.partial-update HDFS a ) Map b )

5 4. [8] Hadoop [8] I/O CPU Map 1 Scan,Map, Collect,Spill,Merge 5 Map Scan I/O CPU Intel Xeon E GHz 2P/8C DDR3 24GB, HDD 1 HDD 115MBps OS Linux2.6.32, Hadoop.2.3 1GBps MapRedcue R 1TB S 1GB R id S partid 2 2 x, y.1% 8% x, y.1% discount MapReduce Map 512MB, Reduce (1) 6 6 (no-index) ,8 (adaptiveindex) , 1 5% 2

6 no-index adaptive index (2) 1% 7 (no-index) , ,24 1 5% no-index adaptive index (3) (3) R S 8 (no-index) , , % no-index adaptive index R 1TB 5 2 ID 9 8%.1% 9 (no-index) (partial-update) split-delete

7 (partial-update) (partial-update) no-index all-delete split-delete partial-update [9] with-value 1stQuery sort 1stQuery index pointer MapReduce with-value 1stQuery sort 1stQuery index pointer

8 with-value 1stQuery sort 1stQuery index pointer 6. [9] [12] Dittrich [11] IO 12 Ritcher [9] Adaptive index [6], [7] Dremel [13] Dremel Dremel MapReduce I/O Map [1] M. Zuckerberg, One billion people on facebook, Oct One Billion People on Facebook [2] Big data: The next frontier for innovation, competition, and productivity, 211. [3] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, pp.1 1, OSDI 4, USENIX Association, 24. [4] hadoop [5] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, Data warehousing and analytics infrastructure at facebook, Proceedings of the 21 ACM SIGMOD International Conference on Management of data, pp , SIGMOD 1, ACM, 21. [6] S. Idreos, M.L. Kersten, and S. Manegold, Database cracking, CIDR, pp.68 78, 27. [7] S. Idreos, M.L. Kersten, and S. Manegold, Self-organizing tuple reconstruction in column-stores, Proceedings of the 35th SIGMOD international conference on Management of data, pp , SIGMOD 9, ACM, New York, NY, USA, [8] H. Herodotou, Hadoop performance models, Dec [9] S. Richter, J.-A. Quiané-Ruiz, S. Schuh, and J. Dittrich, Towards zero-overhead adaptive indexing in hadoop, 212. [1] D. Jiang, B.C. Ooi, L. Shi, and S. Wu, The performance of mapreduce: an in-depth study, Proc. VLDB Endow., vol.3, no.1-2, pp , Sept. 21. [11] J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, Hadoop++: making a yellow elephant run like a cheetah (without it even noticing), Proc. VLDB Endow., vol.3, no.1-2, pp , Sept [12] J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad, Only aggressive elephants are fast elephants, Proc. VLDB Endow., vol.5, no.11, pp , July [13] S. Melnik, A. Gubarev, J.J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, Dremel: Interactive analysis of web-scale datasets, Proc. of the 36th Int l Conf on Very Large Data Bases, pp , MapRedcue

MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY

MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY *S. ANUSUYA,*R.B. ARUNA,*V. DEEPASRI,**DR.T. AMITHA *UG Students, **Professor Department Of Computer Science and Engineering Dhanalakshmi College of