Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment Shohei OKUDERA, Daisaku YOKOYMA, Miyuki NAKANO, and Masaru KITSUREGAWA Abstract Institute of Industrial Science the University of Tokyo Komaba 4 6 1 Meguro-ku Tokyo 153 855 Japan E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp We are required to process large-scale data efficiently as the volume of digital data we generate increases rapidly. MapReduce environment is suitable for large-scale distributed data processing and getting more important as a ad-hoc data analysis platform. However, all data scan is required every query when performing ad-hoc analysis in MapReduce environment, so the I/O cost will be badly huge. In this paper, in order to reduce the I/O cost, we propose an integrated MapReduce framework with adaptive index. This framework gradually shrinks the processing cost, as those queries are issued, each of which has a selection condition on the same column but different selection values. And the framework can keep the performance when data updates happens. We applied our proposed adaptive indexing technique into basic data operations projection, aggregation, join and evaluated the reduction of the execution time. We conducted the simulation and compared the accumulated execution time in the workload where the queries with the selection conditions slightly changed are repeatedly issued. As a result of simulation, our proposed framework showed higher performance than those processing without adaptive indexing. Key words MapReduce Adaptive indexing Ad-hoc query Simulation 1. SNS Facebook 28 8 1 213 1 1 3 [1], [2], ebay 2 3 5

Web. MapReduce [3] Hadoop [4] MapReduce [5] MapReduce MapReduce MapReduce 2 MapReduce 3 4 MapReduce 5 6 7 2. MapReduce MapReduce 3 discount (1) SELECT id, discount FROM R WHERE x <= discount < y ; (2) (3) SELECT S. partid, S. partname, R. discount FROM R, S WHERE x <= R. discount < y AND R. partid = S. partid 2. 1 Hadoop MapReduce. MapReduce MapReduce Map Reduce 2 Map Hadoop HDFS map Reduce Map Reduce HDFS 1 Map 2 < = discount < 8 Map Map Map Collect Spill Merge Reduce. Reduce Map Reduce HDFS 1 2. 2 1 3 < = discount < 5 MapReduce SELECT partid, AVE( d i s c o u n t ) FROM R WHERE x <= discount < y GROUP BY parrid

3. MapReduce Adaptive index [6], [7] MapReduce Adaptive indexing I/O 3. 1 2 2 MapReduce Map Map Map Map HDFS HDFS Adaptive indexing Database Cracking [6] 2 2 2 2 8 8 3 2 3. 2 3 Map 1. 2. Map 2 < = discount < 8 discount < 2, 2 < = discount < 8, 8 < = discount 3 2, 8 ) HDFS 3 3. 2. 1 5 7 4 2 8

Map Map HDFS 4 3. 2. 2 HDFS 5 3 1 5 1.all-delete HDFS 2 5 2.split-deleteHDFS HDFS hash hash hash hash 5 3.partial-update 3. 2. 3 2 1 5 3 HDFS 2 1 2 a ) Map b )

4. [8] Hadoop [8] I/O CPU Map 1 Scan,Map, Collect,Spill,Merge 5 Map Scan I/O CPU Intel Xeon E553 2.4GHz 2P/8C DDR3 24GB, HDD 1 HDD 115MBps OS Linux2.6.32, Hadoop.2.3 1GBps 1 4 1 1 5. 4. 1 MapRedcue 2 5. 1 2 17 17 R 1TB 1 168 S 1GB R id S partid 2 2 x, y.1% 8% x, y.1% discount MapReduce Map 512MB, Reduce 1 5. 2 1 5. 2. 1 2 (1) 6 6 (no-index) 691 2 13,8 (adaptiveindex) 1 725 2 2382 1 2, 1 5% 2

6 15 1 5 no-index adaptive index 6 5. 2. 2 2 (2) 1% 7 (no-index) 692 2 13,822 718 2 2,24 1 5% 2 6.1 15 1 5 no-index adaptive index 7 5. 2. 3 2 (3) (3) R S 8 (no-index) 1 2 774 15,478 1 81 2 3,895 3.5% 2 4 15 1 5 no-index adaptive index 8 5. 2. 4 R 1TB 5 2 ID 9 8%.1% 9 (no-index) 5 2 5 (partial-update) split-delete

(partial-update) (partial-update) 15 1 5 9 no-index all-delete split-delete partial-update 5. 3 2 2 4 [9] 4 3 5. 4 1 1 2 4 5 2 1.2 1 25 2 15 1 with-value 1stQuery sort 1stQuery index pointer 5. 5 11 2 MapReduce 2 11 6 4 2 with-value 1stQuery sort 1stQuery index pointer 5. 6 12 5 2 4

5 2 4 5 12 4 3 2 1 with-value 1stQuery sort 1stQuery index pointer 6. [9] [12] Dittrich [11] IO 12 Ritcher [9] Adaptive index [6], [7] Dremel [13] Dremel Dremel MapReduce I/O Map [1] M. Zuckerberg, One billion people on facebook, Oct. 212. One Billion People on Facebook [2] Big data: The next frontier for innovation, competition, and productivity, 211. [3] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, pp.1 1, OSDI 4, USENIX Association, 24. [4] hadoop http://hadoop.apache.org/. [5] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, Data warehousing and analytics infrastructure at facebook, Proceedings of the 21 ACM SIGMOD International Conference on Management of data, pp.113 12, SIGMOD 1, ACM, 21. [6] S. Idreos, M.L. Kersten, and S. Manegold, Database cracking, CIDR, pp.68 78, 27. [7] S. Idreos, M.L. Kersten, and S. Manegold, Self-organizing tuple reconstruction in column-stores, Proceedings of the 35th SIGMOD international conference on Management of data, pp.297 38, SIGMOD 9, ACM, New York, NY, USA, 29. http://doi.acm.org/1.1145/1559845.1559878 [8] H. Herodotou, Hadoop performance models, Dec. 211. [9] S. Richter, J.-A. Quiané-Ruiz, S. Schuh, and J. Dittrich, Towards zero-overhead adaptive indexing in hadoop, 212. [1] D. Jiang, B.C. Ooi, L. Shi, and S. Wu, The performance of mapreduce: an in-depth study, Proc. VLDB Endow., vol.3, no.1-2, pp.472 483, Sept. 21. [11] J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, Hadoop++: making a yellow elephant run like a cheetah (without it even noticing), Proc. VLDB Endow., vol.3, no.1-2, pp.515 529, Sept. 21. http://dl.acm.org/citation.cfm?id=192841.19298 [12] J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad, Only aggressive elephants are fast elephants, Proc. VLDB Endow., vol.5, no.11, pp.1591 162, July 212. http://dl.acm.org/citation.cfm?id=235229.235272 [13] S. Melnik, A. Gubarev, J.J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, Dremel: Interactive analysis of web-scale datasets, Proc. of the 36th Int l Conf on Very Large Data Bases, pp.33 339, 21. 7. MapRedcue