Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science. September 21, 2017

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science. September 21, 2017"

Transcription

1 E6893 Big Data Analytics Lecture 3: Big Data Storage and Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science September 21, E6893 Big Data Analytics Lecture 3: Storage and Processing

2 Demos Hive, HBase, and Spark by TA 2 E6893 Big Data Analytics Lecture 3: Storage and Processing

3 Analytics 1 Recommendation 3 E6893 Big Data Analytics Lecture 3: Storage and Processing

4 Key Components of Mahout 4 E6893 Big Data Analytics Lecture 3: Big Data Analytics

5 Mahout reference book 5

6 Setting Up Mahout Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven Step 3: Mahout Eclipse 6

7 Motivation Information Overload News Books, Journals, Research papers Movies Books Electronics! Provide users with the information they need at the right time 7

8 Recommender Systems: General Idea User profile (info about the user) comparison Set of items Recommendation e.g. news articles, books, music, movies, products, Sample applications p E-commerce Product recommender - Amazon p Enterprise Activity Intelligence Domain expert finder, p Digital Libraries Pages/documents/books recommendation p Personal Assistance Museum guidance, Algorithms p Content-based Filtering p Collaborative Filtering p Hybrid Filtering (combination of the above two) 8

9 Content-based Filtering (CBF) Recommending items based on the content and properties Preference: and Matching Item Stream sports Tech World World Health Health Recommending 9

10 Collaborative Filtering (CF) Leveraging opinions of like-minded users Given Recommend People with similar tastes 10

11 Exploiting Dynamic Patterns for Recommender Systems Dynamic nature from both items and users Items expire over time with types of Short-term Long-term Users intentions of Updating breaking information Looking for long-term items Users interests evolve over time User adoption patterns Some users are earlier adopters 11

12 Recommender Inputs Solid lines: positively related Dashed lines: negatively related Input Data: User, Item, Rating 12

13 User-based Recommendation Scenario I gettofail.com 13

14 User-based Recommendation Scenario II 14

15 User-based Recommendation Scenario III 15

16 User-based Recommendation Algorithms 16

17 Example Recommender Code via Mahout 17

18 Process and output of the example Recommendation for Person 1: Item 104 > Item 106 Item 107 is not favored 18

19 User Similarity Measurements Pearson Correlation Similarity Euclidean Distance Similarity Cosine Measure Similarity Spearman Correlation Similarity Tanimoto Coefficient Similarity (Jaccard coefficient) Log-Likelihood Similarity 19

20 Pearson Correlation Similarity Data: missing data 20

21 On Pearson Similarity Three problems with the Pearson Similarity: 1. Not take into account of the number of items in which two users preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.) 2. If two users overlap on only one item, no correlation can be computed. 3. The correlation is undefined if either series of preference values are identical. Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used. 21

22 Euclidean Distance Similarity Similarity = 1 / ( 1 + d ) 22

23 Cosine Similarity Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0). 23

24 Spearman Correlation Similarity Example for ties Pearson value on the relative ranks 24

25 Caching User Similarity Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed. 25

26 Tanimoto (Jaccard) Coefficient Similarity Discard preference values 26 Tanimoto similarity is the same as Jaccard similarity. But, Tanimoto distance is not the same as Jaccard distance.

27 Log-Likelihood Similarity Asses how unlikely it is that the overlap between the two users is just due to chance. 27

28 Performance measurements Using GroupLens data ( 10 million rating MovieLens dataset. Spearnman: 0.8 Tanimoto: 0.82 Log-Likelihood: 0.73 Euclidean: 0.75 Pearson (weighted): 0.77 Pearson:

29 Performance measurements 10 nearest neighbors: nearest neighbors: nearest neighbors: % of training; 5% of testing 29

30 Selecting the number of neighbors Based on number of neighbors Based on a fixed threshold, e.g., 0.7 or

31 Item-based recommendation 31

32 Item-based recommendation algorithm 32

33 Code and Performance of Item-Based Recommendation performance 33

34 Slope-One Recommender 34

35 Slope-One Algorithm Difference values from the example Slope-One got a result of near 0.65 on the GroupLens data 35

36 Other recommenders SVD recommender number of features number of training step lambda: factor for regularization SVD method got 0.69 on the GroupLens data 36

37 Linear Interpolation Item-based recommender SVD method got 0.76 on the GroupLens data 37

38 Cluster-based Recommendation 38

39 Other Recommenders not in Mahout Groups (SDM 06) A 3 rd party Knowledge Repository: 30K users and 20K documents. Study the most active 697 users who have at least 20 download in a year. Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53% improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering) CB DR CB DR CB DR 39

40 In E-Commerce: Rogers Diffusion of Innovations Theory Innovators Early adopters Early majority Late majority Laggards Users adoption patterns: Some users tend to adopt innovations earlier than others! Information virtually flows from early adopters to late adopters 40

41 Recommendation Driven by Information Flow Innovator adopt Early adopter Innovator adopt?? Early adopter Late majority Early majority Late majority? Early majority Laggard Laggard adopt People with similar tastes People with similar tastes Influence is not symmetric! 41

42 Scheme Overview p Leverage the uneven influence Dataset EABIF Information Propagation Model Application: Personalized Recommendation EABIF: Early Adoption based Information Flow Network 42

43 EABIF (1) -- Markov Chain p p A Markov chain has two components A network structure where each node is called a state A transition probability of traversing a link given that the chain is in a state (P: transition probability matrix) A stationary distribution is a probability distribution q such that q = qp how likely you will stay at one node i P ij j p Application -- PageRank [Brin and Page 98] Assumption: A link from page A to page B is a recommendation of page B by the author of A A B Quality of a page is related to p Number of pages linking to it C p The quality of pages linking to it Assume the web is a Markov chain p PageRank = stationary distribution of this Markov chain 43

44 EABIF (2) p Early Adoption Matrix (EAB) Count how many items one user accesses earlier than the other pairwise comparison p Markov Chain Model Normalize EAB to a transition matrix F of a Markov chain Adjustment F to guarantee the existence of stationary distribution of the Markov chain p p Make the matrix stochastic F ij = 1 Make the Markov chain irreducible j F 0, 1 i, j N ij 44

45 Information Propagation Models 1. Summation of various propagation steps ( ) ( ( ) ( )) 2 m! F = F + F + + F if m m v r 1 u 2. Direct summation F if ( d ) ( ( 1 )) N F I F = I F r 2 r 3 N: number of the nodes 3. Exponential weighted summation F # 1 1 $ if (exp) = & F 2! F N 1! F ' ( ) ( ) ( ) 2 ( ) ( N 1 β β! β ) )(!' exp( β ) 45

46 Topic-Sensitive Early Adoption Based Information Flow (TEABIF) Network p Adoption is typically category specific An early adopter of fashion may not be an early adopter of technology Application: Personalized Recommendation LDA TOPIC 1 TOPIC 2 TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 3 46

47 Experimental Setup ER dataset 2004 Apr. to 2005 Apr. as training data 2005 May to 2005 Jul. as test data 1033 users, 586 documents Process Construct information flow network based on the training data Trigger earliest users to start the process Predict who will be also interested in these documents Evaluation Precision & Recall 47

48 Experimental Results --Recommendation Quality Precision Precision Precision Comparison (Number of T riggered Users = 1, Propagation Steps = 1) No. of retrieved users Precision Comparison (Number of T riggered Users = 2, Propagation Steps = 1) CF EABIF TEABIF CF EABIF TEABIF Recall Recall Recall Comparison (Number of T riggered Users = 1, Propagation Steps = 1) CF EABIF TEABIF No. of retrieved users Recall Comparison (Number of T riggered Users = 2, Propagation Steps = 1) CF EABIF TEABIF No. of retrieved users No. of retrieved users p Comparing to Collaborative Filtering (CF), precision: EABIF is 91.0% better, TEABIF is 108.5% better Recall: EABIF is 87.1% better, TEABIF is 112.8% better 48

49 Experimental Results -- Propagation Performance Ratio (x100%) Precision Improvement Comparison (Number of triggered users = 1, Baseline: CF) m = 1 m = 2 m = 3 m = 4 m = 5 sum exp(β= 1) exp(β= 1.5) exp(β= 2) exp(β= 3) exp(β= 4) exp(β= 5) EABIF TEABIF exp(β= 8) exp(β= 16) Ratio (x100%) Recall Improvement Comparison (Number of triggered users = 1, Baseline: CF) EABIF TEABIF m = 1 m = 2 m = 3 m = 4 m = 5 sum exp(β= 1) exp(β= 1.5) exp(β= 2) exp(β= 3) exp(β= 4) exp(β= 5) exp(β= 8) exp(β= 16) Ratio (x100%) Precision Improvement Comparison (Number of triggered users = 2, Baseline: CF) m = 1 m = 2 m = 3 m = 4 m = 5 sum exp(β= 1) exp(β= 1.5) exp(β= 2) exp(β= 3) exp(β= 4) exp(β= 5) EABIF TEABIF exp(β= 8) exp(β= 16) Ratio (x100%) Recall Improvement Comparison (Number of triggered users = 2, Baseline: CF) m = 1 m = 2 m = 3 m = 4 m = 5 sum exp(β= 1) exp(β= 1.5) exp(β= 2) exp(β= 3) exp(β= 4) exp(β= 5) EABIF TEABIF exp(β= 8) exp(β= 16) p TEABIF with exponential weighted summation ( β = 3 ) p achieves the best performance: p improves 108.5% on precision and 116.9% on recall comparing to CF 49

50 Summary Exploit dynamic patterns including Leverage dynamic patterns from both documents and users perspective Analyzing documents accessing types Predicting documents expiration date Detecting users intentions Identifying users interests evolving over time CTC model Ranking the documents adaptively Time-sensitive Adaboost Utilize users adoption patterns Information virtually flows from early adopters to late adopters Experimental results demonstrate Dynamic factors are important for recommendations 50

51 Selected Publications X. Song, C.-Y. Lin, B. L. Tseng, and M.-T. Sun, Modeling Evolutionary Behaviors for Community-based Dynamic Recommendation, SIAM Conf. on Data Mining, Bethesda, MD, Apr X. Song, C.-Y. Lin, B. L. Tseng, and M.-T. Sun, Personalized Recommendation Driven by Information Flow, ACM SIGIR, Aug X. Song, C.-Y. Lin, B. L. Tseng and M.-T. Sun, Modeling and Predicting Personal Information Dissemination Behavior, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug X. Song, B. L. Tseng, C.-Y. Lin, and M.-T. Sun, "ExpertiseNet: Relational and Evolutionary Expert Modeling," International Conference on User Modeling, Edinburgh, UK, Jul ,

52 Distributed Item-based Recommender 52

53 Distributed recommender get co-occurrence matrix Data: 53

54 Multiply the co-occurrence matrix with user preference The highest is 103 (101, 104, 105, 107 have been purchased by user 3) 54

55 Translating to MapReduce: generating user vectors 55

56 Translating to MapReduce: calculating co-occurrence 56

57 Translating to MapReduce: matrix multiplication 57

58 Translating to MapReduce: partial products 58

59 Translating to MapReduce: partial product II 59

60 Running Recommender on MapReduce and HDFS 60

61 Questions? 61 E6893 Big Data Analytics Lecture 3: Storage and Processing

62 E6893 Big Data Analytics: Appendix references 62

63 Part I: Pig installation and Demo Pig is a platform for analyzing large data sets that consists of a highlevel language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. 63

64 1. Installation of Pig: setup.html Download pig and run following sentence: export PATH=/<my-path-to-pig>/pign.n.n/bin:$PATH 64

65 65

66 2. Running pig in local mode: pig -x local movies = LOAD '/Users/Rich/ Documents/Courses/Fall2014/ BigData/Pig/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration); 66 DUMP movies

67 Filter: List the movies that were released between 1950 and 1960 movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961; store movies_1950_1960 into '/Users/Rich/ Desktop/Demo/movies_1950_1960'; 67

68 Foreach Generate: List movie names and their duration (in minutes) movies_name_duration = foreach movies generate name, (float)duration/3600; store movies_name_duration into '/Users/Rich/ Desktop/Demo/movies_name_duration'; 68

69 Order: List all movies in descending order of year movies_year_sort =order movies by year desc; store movies_year_sort into '/Users/Rich/ Desktop/Demo/movies_year_sort'; 69

70 3. Running pig with HDFS: pig Run HDFS first: ssh localhost cd /usr/local/cellar/hadoop/2.5.0 sbin/start-dfs.sh 70

71 3. Upload file to HDFS: Make a directory: bin/hdfs dfs -mkdir /PigSource Upload a file: bin/hdfs dfs -put /Users/Rich/Documents/Courses/ Spring2014/CloudComputing/HW/MINI5/ movies_data.csv /PigSource 71

72 4. Run pig in grunt with HDFS: pig movies = LOAD '/PigSource/ movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration); DUMP movies 72

73 5. Run.pig file with HDFS: pig pig /Users/Rich/Documents/Courses/ Fall2014/BigData/Pig/run1.pig 73

74 Part II: Hbase installation and Demo HBase is an open source, non-relational, distributed database modeled after Google's Big Table and written in Java. It runs on top of HDFS, and can serve as input and output for MapReduce jobs run in Hadoop. Access through Java API and Pig. 74

75 Hbase Configuration: Download and configure Hbase by following the instruction online. To run Hbase, under your Hbase path: $bin/start-hbase.sh Enter shell $bin/hbase shell Also visit localhost:60010 to check Hbase webui 75

76 Hbase Command: Create: hbase>create table, cf hbase>put table', r1', cf:c1', 'value1 hbase>scan table Also we can do count, get, delete and drop. etc much like other DB systems. 76

77 Part III: Hive installation and Demo 77

78 Simple demo for Hive Install steps: Mac: you can first install brew $ brew install Hive Linux: cd ~/Downloads wget hive /apache-hive bin.tar.gz cp apache-hive bin.tar.gz ~ cd ~ tar zxvf apache-hive bin.tar.gz mv apache-hive bin hive cd hive cd bin./hive 78

79 Simple demo for Hive Step 1. start Hadoop, create a file under /user/ yourname hadoop fs -mkdir test Step 2. put your dataset into hdfs hadoop fs -put (your dataset path) test hadoop fs -ls test 79

80 Simple demo for Hive Step 3. create the table (SQL) in Hive shell hive Step 4. store the data into table load data inpath '/user/yourname/test' into table client; You can see your data has already been put into the table 80

81 Simple demo for Hive 1. select * from client where name='henry'; 2. select avg(age) from client where gender='male'; 81

82 Simple demo for Hive Use JDBC(JAVA) to access data in Hive Server Other choice: Python, PHP etc. 82

83 Simple demo for Hive Open the Hive (Starting Hive Thrift Server) serverhive --service hiveserver -p URL: jdbc:hive://localhost:10000/default Result in the console 83

84 Simple demo for Hive Example. Jetty Server with Jersey, restful web service Create simple API (Http request, XML/JSON format output) 84

85 Simple demo for Hive Example. Jetty Server with Jersey, restful web service 85

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

More information

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT. Oracle Big Data. A NALYTICS A ND MANAG E MENT. Oracle Big Data: Redundância. Compatível com ecossistema Hadoop, HIVE, HBASE, SPARK. Integração com Cloudera Manager. Possibilidade de Utilização da Linguagem

More information

Distributed Computing with Spark

Distributed Computing with Spark Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing

More information

Talend Big Data Sandbox. Big Data Insights Cookbook

Talend Big Data Sandbox. Big Data Insights Cookbook Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is

More information

Talend Big Data Sandbox. Big Data Insights Cookbook

Talend Big Data Sandbox. Big Data Insights Cookbook Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is

More information

Aims. Background. This exercise aims to get you to:

Aims. Background. This exercise aims to get you to: Aims This exercise aims to get you to: Import data into HBase using bulk load Read MapReduce input from HBase and write MapReduce output to HBase Manage data using Hive Manage data using Pig Background

More information

Introduction to Hive Cloudera, Inc.

Introduction to Hive Cloudera, Inc. Introduction to Hive Outline Motivation Overview Data Model Working with Hive Wrap up & Conclusions Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded

More information

Similarity Ranking in Large- Scale Bipartite Graphs

Similarity Ranking in Large- Scale Bipartite Graphs Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads

More information

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation

More information

Matrix Co-factorization for Recommendation with Rich Side Information HetRec 2011 and Implicit 1 / Feedb 23

Matrix Co-factorization for Recommendation with Rich Side Information HetRec 2011 and Implicit 1 / Feedb 23 Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback Yi Fang and Luo Si Department of Computer Science Purdue University West Lafayette, IN 47906, USA fangy@cs.purdue.edu

More information

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks

More information

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS Vandita Jain 1, Prof. Tripti Saxena 2, Dr. Vineet Richhariya 3 1 M.Tech(CSE)*,LNCT, Bhopal(M.P.)(India) 2 Prof. Dept. of CSE, LNCT, Bhopal(M.P.)(India)

More information

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

Column Stores and HBase. Rui LIU, Maksim Hrytsenia Column Stores and HBase Rui LIU, Maksim Hrytsenia December 2017 Contents 1 Hadoop 2 1.1 Creation................................ 2 2 HBase 3 2.1 Column Store Database....................... 3 2.2 HBase

More information

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc. PROFESSIONAL NoSQL Shashank Tiwari WILEY John Wiley & Sons, Inc. Examining CONTENTS INTRODUCTION xvil CHAPTER 1: NOSQL: WHAT IT IS AND WHY YOU NEED IT 3 Definition and Introduction 4 Context and a Bit

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Oracle Data Integrator 12c: Integration and Administration

Oracle Data Integrator 12c: Integration and Administration Oracle University Contact Us: Local: 1800 103 4775 Intl: +91 80 67863102 Oracle Data Integrator 12c: Integration and Administration Duration: 5 Days What you will learn Oracle Data Integrator is a comprehensive

More information

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Bootcamp Curriculum. NYC Data Science Academy Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations

More information

Sandbox Setup Guide for HDP 2.2 and VMware

Sandbox Setup Guide for HDP 2.2 and VMware Waterline Data Inventory Sandbox Setup Guide for HDP 2.2 and VMware Product Version 2.0 Document Version 10.15.2015 2014-2015 Waterline Data, Inc. All rights reserved. All other trademarks are the property

More information

Recommender Systems. Collaborative Filtering & Content-Based Recommending

Recommender Systems. Collaborative Filtering & Content-Based Recommending Recommender Systems Collaborative Filtering & Content-Based Recommending 1 Recommender Systems Systems for recommending items (e.g. books, movies, CD s, web pages, newsgroup messages) to users based on

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

Oracle Big Data Discovery

Oracle Big Data Discovery Oracle Big Data Discovery Turning Data into Business Value Harald Erb Oracle Business Analytics & Big Data 1 Safe Harbor Statement The following is intended to outline our general product direction. It

More information

Section 7.12: Similarity. By: Ralucca Gera, NPS

Section 7.12: Similarity. By: Ralucca Gera, NPS Section 7.12: Similarity By: Ralucca Gera, NPS Motivation We talked about global properties Average degree, average clustering, ave path length We talked about local properties: Some node centralities

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved. Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources

More information

R Language for the SQL Server DBA

R Language for the SQL Server DBA R Language for the SQL Server DBA Beginning with R Ing. Eduardo Castro, PhD, Principal Data Analyst Architect, LP Consulting Moderated By: Jose Rolando Guay Paz Thank You microsoft.com idera.com attunity.com

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of

More information

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Web Personalization & Recommender Systems

Web Personalization & Recommender Systems Web Personalization & Recommender Systems COSC 488 Slides are based on: - Bamshad Mobasher, Depaul University - Recent publications: see the last page (Reference section) Web Personalization & Recommender

More information

Multi-domain Predictive AI. Correlated Cross-Occurrence with Apache Mahout and GPUs

Multi-domain Predictive AI. Correlated Cross-Occurrence with Apache Mahout and GPUs Multi-domain Predictive AI Correlated Cross-Occurrence with Apache Mahout and GPUs Pat Ferrel ActionML, Chief Consultant Apache Mahout, PMC & Committer Apache PredictionIO, PMC & Committer pat@apache.org

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

The Evolution of Big Data Platforms and Data Science

The Evolution of Big Data Platforms and Data Science IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering

More information

Politecnico di Torino. Porto Institutional Repository

Politecnico di Torino. Porto Institutional Repository Politecnico di Torino Porto Institutional Repository [Proceeding] A Recommender System for Telecom Users: Evaluation of Recommendation Algorithms Experimental Original Citation: Falcarin P.; Vetro A.;

More information

RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING

RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING Yunkyoung Lee Follow this and additional

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,

More information

ANALYSIS OF DOMAIN INDEPENDENT STATISTICAL KEYWORD EXTRACTION METHODS FOR INCREMENTAL CLUSTERING

ANALYSIS OF DOMAIN INDEPENDENT STATISTICAL KEYWORD EXTRACTION METHODS FOR INCREMENTAL CLUSTERING ANALYSIS OF DOMAIN INDEPENDENT STATISTICAL KEYWORD EXTRACTION METHODS FOR INCREMENTAL CLUSTERING Rafael Geraldeli Rossi 1, Ricardo Marcondes Marcacini 1,2, Solange Oliveira Rezende 1 1 Institute of Mathematics

More information

Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi

Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures Hiroshi Yamaguchi & Hiroyuki Adachi About Us 2 Hiroshi Yamaguchi Hiroyuki Adachi Hadoop DevOps Engineer Hadoop Engineer

More information

A Time-based Recommender System using Implicit Feedback

A Time-based Recommender System using Implicit Feedback A Time-based Recommender System using Implicit Feedback T. Q. Lee Department of Mobile Internet Dongyang Technical College Seoul, Korea Abstract - Recommender systems provide personalized recommendations

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

CSE 494: Information Retrieval, Mining and Integration on the Internet

CSE 494: Information Retrieval, Mining and Integration on the Internet CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Recommender Systems II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 31, 2017 Recommender Systems Recommendation via Information Network Analysis Hybrid Collaborative Filtering

More information

Higher level data processing in Apache Spark

Higher level data processing in Apache Spark Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest es Neighbor Approach Jeff Howbert Introduction to Machine Learning Winter 2012 1 Bad news Netflix Prize data no longer available to public. Just after contest t ended d

More information

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial... Table of contents 1 Pig Setup... 2 2 Running Pig... 3 3 Pig Latin Statements... 6 4 Pig Properties... 8 5 Pig Tutorial... 9 1. Pig Setup 1.1. Requirements Mandatory Unix and Windows users need the following:

More information

Big Data Technologies and Geospatial Data Processing:

Big Data Technologies and Geospatial Data Processing: Big Data Technologies and Geospatial Data Processing: A perfect fit Albert Godfrind Spatial and Graph Solutions Architect Oracle Corporation Agenda 1 2 3 4 The Data Explosion Big Data? Big Data and Geo

More information

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University {tedhong, dtsamis}@stanford.edu Abstract This paper analyzes the performance of various KNNs techniques as applied to the

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system

More information

Developer s Manual. Version May, Computer Science Department, Texas Christian University

Developer s Manual. Version May, Computer Science Department, Texas Christian University Developer s Manual Version 4.0 2 May, 2016 2015-2016 Computer Science Department, Texas Christian University Revision Signatures By signing the following document, the team member is acknowledging that

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Data Mining Concepts & Tasks

Data Mining Concepts & Tasks Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time

More information

Oracle R Technologies

Oracle R Technologies Oracle R Technologies R for the Enterprise Mark Hornick, Director, Oracle Advanced Analytics @MarkHornick mark.hornick@oracle.com Safe Harbor Statement The following is intended to outline our general

More information

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Using Google s PageRank Algorithm to Identify Important Attributes of Genes Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105

More information

Cloud, Big Data & Linear Algebra

Cloud, Big Data & Linear Algebra Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes

More information

Design and Implement of Bigdata Analysis Systems

Design and Implement of Bigdata Analysis Systems Design and Implement of Bigdata Analysis Systems Jeong-Joon Kim *Department of Computer Science & Engineering, Korea Polytechnic University, Gyeonggi-do Siheung-si 15073, Korea. Abstract The development

More information

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials

More information

Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback

Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback ABSTRACT Yi Fang Department of Computer Science Purdue University West Lafayette, IN 47907, USA fangy@cs.purdue.edu

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

E6885 Network Science Lecture 10: Graph Database (II)

E6885 Network Science Lecture 10: Graph Database (II) E 6885 Topics in Signal Processing -- Network Science E6885 Network Science Lecture 10: Graph Database (II) Ching-Yung Lin, Dept. of Electrical Engineering, Columbia University November 18th, 2013 Course

More information

Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017

Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017 Distributed Systems 09r. Map-Reduce Programming on AWS/EMR (Part I) Paul Krzyzanowski TA: Long Zhao Rutgers University Fall 2017 November 21, 2017 2017 Paul Krzyzanowski 1 Setting Up AWS/EMR November 21,

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Progress DataDirect For Business Intelligence And Analytics Vendors

Progress DataDirect For Business Intelligence And Analytics Vendors Progress DataDirect For Business Intelligence And Analytics Vendors DATA SHEET FEATURES: Direction connection to a variety of SaaS and on-premises data sources via Progress DataDirect Hybrid Data Pipeline

More information

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data. Distributed Systems 1. Graph Computing Frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 016 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead

More information

Web Personalization & Recommender Systems

Web Personalization & Recommender Systems Web Personalization & Recommender Systems COSC 488 Slides are based on: - Bamshad Mobasher, Depaul University - Recent publications: see the last page (Reference section) Web Personalization & Recommender

More information

Recommendation with Differential Context Weighting

Recommendation with Differential Context Weighting Recommendation with Differential Context Weighting Yong Zheng Robin Burke Bamshad Mobasher Center for Web Intelligence DePaul University Chicago, IL USA Conference on UMAP June 12, 2013 Overview Introduction

More information

Hadoop Quickstart. Table of contents

Hadoop Quickstart. Table of contents Table of contents 1 Purpose...2 2 Pre-requisites...2 2.1 Supported Platforms... 2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster...3 5 Standalone

More information

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda

More information

Available online at ScienceDirect. Procedia Technology 17 (2014 )

Available online at  ScienceDirect. Procedia Technology 17 (2014 ) Available online at www.sciencedirect.com ScienceDirect Procedia Technology 17 (2014 ) 528 533 Conference on Electronics, Telecommunications and Computers CETC 2013 Social Network and Device Aware Personalized

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

New Features and Enhancements in Big Data Management 10.2

New Features and Enhancements in Big Data Management 10.2 New Features and Enhancements in Big Data Management 10.2 Copyright Informatica LLC 2017. Informatica, the Informatica logo, Big Data Management, and PowerCenter are trademarks or registered trademarks

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Solr Installation User Guide. Solr Installation Brainvire Infotech Pvt. Ltd

Solr Installation User Guide. Solr Installation Brainvire Infotech Pvt. Ltd Solr Installation 1 Lets see how to Install Solr, it is very easy and we are going to do that in this Solr Installation guideline. So let s start! Let s start for Linux and Mac. We ll guide you trough

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Experiences from Implementing Collaborative Filtering in a Web 2.0 Application

Experiences from Implementing Collaborative Filtering in a Web 2.0 Application Experiences from Implementing Collaborative Filtering in a Web 2.0 Application Wolfgang Woerndl, Johannes Helminger, Vivian Prinz TU Muenchen, Chair for Applied Informatics Cooperative Systems Boltzmannstr.

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Using the Force of Python and SAS Viya on Star Wars Fan Posts

Using the Force of Python and SAS Viya on Star Wars Fan Posts SESUG Paper BB-170-2017 Using the Force of Python and SAS Viya on Star Wars Fan Posts Grace Heyne, Zencos Consulting, LLC ABSTRACT The wealth of information available on the Internet includes useful and

More information

SOLUTION BRIEF BIG DATA SECURITY

SOLUTION BRIEF BIG DATA SECURITY SOLUTION BRIEF BIG DATA SECURITY Get maximum value and insight from your Big Data initiatives while maintaining robust data security THE CHALLENGE More and more companies are finding that Big Data strategies

More information

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Deep Character-Level Click-Through Rate Prediction for Sponsored Search Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as

More information

The Simularity High Performance Correlation Engine

The Simularity High Performance Correlation Engine The Simularity High Performance Correlation Engine Simularity s High Performance Correlation Engine (HPCE) is a purpose-built analytics engine for similarity and correlation analytics optimized to do real-time,

More information

Mining Distributed Frequent Itemset with Hadoop

Mining Distributed Frequent Itemset with Hadoop Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN iway iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.0 DN3502232.1216 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo,

More information

Feature-weighted User Model for Recommender Systems

Feature-weighted User Model for Recommender Systems Feature-weighted User Model for Recommender Systems Panagiotis Symeonidis, Alexandros Nanopoulos, and Yannis Manolopoulos Aristotle University, Department of Informatics, Thessaloniki 54124, Greece {symeon,

More information

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 4.11 Last Updated: 1/10/2018 Please note: This appliance is for testing and educational purposes only;

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

Processing Big Data with Hadoop in Azure HDInsight

Processing Big Data with Hadoop in Azure HDInsight Processing Big Data with Hadoop in Azure HDInsight Lab 1 - Getting Started with HDInsight Overview In this lab, you will provision an HDInsight cluster. You will then run a sample MapReduce job on the

More information

Apache Flink Big Data Stream Processing

Apache Flink Big Data Stream Processing Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

More information