DATA MINING II - 1DL460

Size: px

Start display at page:

Download "DATA MINING II - 1DL460"

Carmel Blake
5 years ago
Views:

1 DATA MINING II - 1DL460 Spring 2017 A second course in data mining Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden

2 Introduction to NoSQL in Data Mining Intro to MapReduce Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden

3 What is a NoSQL Database? A key/value store Basic index manager, no complete query language E.g. Google BigTable, Amazon Dynamo A DBMS with a limited query language Provides for high volume small business transactions Sometimes called cloud databases E.g. Google App Engine, Microsoft Azure, Amazon SimpleDB; MongoDB A DBMS with a new query language for new kinds of applications Amos II, Streambase, Virtuoso, Neo4J

4 What is a NoSQL Database? A batch database processing system where mapreduce is used instead of queries to query, transform, and alalyze large datasets Manual programs iterate over entire data sets in batch E.g. Hadoop, Spark A mapreduce engine with a query language on top: HIVE on top of Hadoop provides HiveQL Provides non-procedural data analythics (select - from - groupby) without detailed programming Executed in batch as parallel Hadoop jobs

5 MapReduce Parallel batch processing using mapreduce Highly scalable implementation of parallell batch processing of same (e.g. Java, C++, Python) program over large amounts of data stored in different files Based on a scalable file system (e.g. HDFS) The mapreduce function: Applies a (costly) user function mapper producing key/value pairs in parallel on many nodes accessing files in a cluster Applies a user aggregate function on the key/value pairs produced by the mapper Very similar to GROUP BY in SQL Reference article:

6 File I/O Mapreduce Data file Data file Data file Data file Data file Data file Map Map Map Map Map Map Partition Partition Partition Partition Partition Partition Reduce Reduce Reduce Output Writer Result file

7 Mapreduce code (Python) function map(string name, String document): // name: document name, i.e. HDFS file contents // document: document contents, parsed HDFS file tokens // Can make own parser as preprocessor for each word w in document: emit (w, 1) function reduce(string word, Iterator partialcounts): // word: a word // partialcounts pc: a list of aggregated partial counts (word, cnt) sum = 0; for each pc in partialcounts: sum += ParseInt(pc); emit (word, sum)

8 Input reader Mapreduce stages System component that reads files from scalable file system (e.g. HDFS) and sends to map functions applied in parallell Map function Applied in parallel on many different files Reads input file data from HDFS Does some (expensive) computation Emits key/value pairs as result Key/value pairs stored by MapReduce system as temporary files Partition function (optional) Partitions output key/value pairs from map function into groups of key/value pairs to be reduced in parallel Usually hash partitioning Reduce function Iterates over set of key/value pairs to produced a reduced set of key value pairs stored in the file system C.f. aggregate functions

9 HIVE SQL-like query language HiveQL support on top of Hadoop Developed by Facebook, now maintained by Netflix Queries very often involving grouping and statistics (as in SQL) Queries compiled into mapreduce jobs Handles 80% of mapreduce analysis tasks at Facebook Much easier to use compared to raw mapreduce coding Substantial impovement of programming productivity Non-programmers can analyze data

10 Hive architecture

11 Wordcount in HiveQL FROM ( MAP docs.doctext USING 'map.py' AS (word, cnt) FROM docs CLUSTER BY word ) REDUCE word, cnt USING 'reduce.py';

12 Log file: HIVQL example :26:41 SampleClass3 [TRACE] verbose detail for id :26:41 SampleClass2 [TRACE] verbose detail for id Java.lang.Exception: :27:10 SampleClass7 [ERROR] incorrect format for id :29:38 SampleClass1 [DEBUG] detail for id HIVEQL: CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO TABLE logs; SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%' GROUP BY t4; [TRACE] 2 [DEBUG]

13 HIVQL example 3 Log file: :26:41 SampleClass3 [TRACE] verbose detail for id :26:41 SampleClass2 [TRACE] verbose detail for id Java.lang.Exception: :27:10 SampleClass7 [ERROR] incorrect format for id :29:38 SampleClass1 [DEBUG] detail for id HIVEQL: SELECT t5 AS sev, COUNT(*) AS cnt FROM logs WHERE t5 LIKE '[%' GROUP BY t5; [ERROR] 1 SELECT L.sev, SUM(L.cnt) FROM ( SELECT t4 AS sev, COUNT(*) AS cnt GROUP BY L.sev; [TRACE] 2 [DEBUG] 1 [ERROR] 1 => 3 Mapreduce jobs! FROM logs WHERE t4 LIKE '[%' GROUP BY t4 UNION ALL SELECT t5 AS sev, COUNT(*) AS cnt FROM logs WHERE t5 LIKE '[%' GROUP BY t5) L

14 Word count in SQL Alt 1, assume words on documents stored in table: CREATE TABLE docs(integer ID, word VARCHAR(20) PRIMARY KEY(ID)) The query becomes: SELECT d.word, COUNT(d.word) FROM docs d GROUP BY d.word Problem: DOCS is table, not stored in file Alt 2, use user defined table function (UDF) to access documents in file: SELECT d.word, COUNT(d.word) FROM mydocuments( C:/mydocuments ) AS d GROUP BY d.word

15 HIVEQL vs raw mapreduce Raw mapreduce: Java (Python, C++, etc.) program does map and reduce Very common use of mapreduce: Statistics collection over files (count, sum, stdev, etc) HIVEQQL handles basic statistics 80% of applications When advanced statistics not supported in HIVEQL (or SQL): Alt 1: User defined aggregate functions in HIVE (UDAF) Can be generally used in other queries too Alt 2: Raw mapreduce Code may be complicated Code cannot be used in queries

16 Reading raw data files to DBMS Major point with mapreduce: Saves time to load database and build index compared to DBMS Modern DBMSs have bulk load facilities Never use insert command to bulk load Bulk load speed approaches file copy time Orders of magnitude faster than naïve inserts Automatic parallelization of bulk loading by DBMS Indexes can be built if needed

17 DBMSs vs mapreduce Modern DBMSs have built-in aggregate functions and user defined functions, UDFs, too Can read data from files using UDFs DBMSs have indexing and high parallelism to provide scalability Notice that it takes time to build index during database loading => May slow down database loading considerable When is HIVE with mapreduce better than RDBs? One shot queries when no indexing is needed Massively parallel very expensive brute force computations embarrasingly parallel computations

DATABASE DESIGN II - 1DL400

DATABASE DESIGN II - 1DL400 Fall 2016 A second course in database systems http://www.it.uu.se/research/group/udbl/kurser/dbii_ht16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,