Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science. September 21, 2017
|
|
- Derrick John Malone
- 1 years ago
- Views:
Transcription
1 E6893 Big Data Analytics Lecture 3: Big Data Storage and Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science September 21, E6893 Big Data Analytics Lecture 3: Storage and Processing
2 Demos Hive, HBase, and Spark by TA 2 E6893 Big Data Analytics Lecture 3: Storage and Processing
3 Analytics 1 Recommendation 3 E6893 Big Data Analytics Lecture 3: Storage and Processing
4 Key Components of Mahout 4 E6893 Big Data Analytics Lecture 3: Big Data Analytics
5 Mahout reference book 5
6 Setting Up Mahout Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven Step 3: Mahout Eclipse 6
7 Motivation Information Overload News Books, Journals, Research papers Movies Books Electronics! Provide users with the information they need at the right time 7
8 Recommender Systems: General Idea User profile (info about the user) comparison Set of items Recommendation e.g. news articles, books, music, movies, products, Sample applications p E-commerce Product recommender - Amazon p Enterprise Activity Intelligence Domain expert finder, p Digital Libraries Pages/documents/books recommendation p Personal Assistance Museum guidance, Algorithms p Content-based Filtering p Collaborative Filtering p Hybrid Filtering (combination of the above two) 8
9 Content-based Filtering (CBF) Recommending items based on the content and properties Preference: and Matching Item Stream sports Tech World World Health Health Recommending 9
10 Collaborative Filtering (CF) Leveraging opinions of like-minded users Given Recommend People with similar tastes 10
11 Exploiting Dynamic Patterns for Recommender Systems Dynamic nature from both items and users Items expire over time with types of Short-term Long-term Users intentions of Updating breaking information Looking for long-term items Users interests evolve over time User adoption patterns Some users are earlier adopters 11
12 Recommender Inputs Solid lines: positively related Dashed lines: negatively related Input Data: User, Item, Rating 12
13 User-based Recommendation Scenario I gettofail.com 13
14 User-based Recommendation Scenario II 14
15 User-based Recommendation Scenario III 15
16 User-based Recommendation Algorithms 16
17 Example Recommender Code via Mahout 17
18 Process and output of the example Recommendation for Person 1: Item 104 > Item 106 Item 107 is not favored 18
19 User Similarity Measurements Pearson Correlation Similarity Euclidean Distance Similarity Cosine Measure Similarity Spearman Correlation Similarity Tanimoto Coefficient Similarity (Jaccard coefficient) Log-Likelihood Similarity 19
20 Pearson Correlation Similarity Data: missing data 20
21 On Pearson Similarity Three problems with the Pearson Similarity: 1. Not take into account of the number of items in which two users preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.) 2. If two users overlap on only one item, no correlation can be computed. 3. The correlation is undefined if either series of preference values are identical. Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used. 21
22 Euclidean Distance Similarity Similarity = 1 / ( 1 + d ) 22
23 Cosine Similarity Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0). 23
24 Spearman Correlation Similarity Example for ties Pearson value on the relative ranks 24
25 Caching User Similarity Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed. 25
26 Tanimoto (Jaccard) Coefficient Similarity Discard preference values 26 Tanimoto similarity is the same as Jaccard similarity. But, Tanimoto distance is not the same as Jaccard distance.
27 Log-Likelihood Similarity Asses how unlikely it is that the overlap between the two users is just due to chance. 27
28 Performance measurements Using GroupLens data ( 10 million rating MovieLens dataset. Spearnman: 0.8 Tanimoto: 0.82 Log-Likelihood: 0.73 Euclidean: 0.75 Pearson (weighted): 0.77 Pearson:
29 Performance measurements 10 nearest neighbors: nearest neighbors: nearest neighbors: % of training; 5% of testing 29
30 Selecting the number of neighbors Based on number of neighbors Based on a fixed threshold, e.g., 0.7 or
31 Item-based recommendation 31
32 Item-based recommendation algorithm 32
33 Code and Performance of Item-Based Recommendation performance 33
34 Slope-One Recommender 34
35 Slope-One Algorithm Difference values from the example Slope-One got a result of near 0.65 on the GroupLens data 35
36 Other recommenders SVD recommender number of features number of training step lambda: factor for regularization SVD method got 0.69 on the GroupLens data 36
37 Linear Interpolation Item-based recommender SVD method got 0.76 on the GroupLens data 37
38 Cluster-based Recommendation 38
39 Other Recommenders not in Mahout Groups (SDM 06) A 3 rd party Knowledge Repository: 30K users and 20K documents. Study the most active 697 users who have at least 20 download in a year. Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53% improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering) CB DR CB DR CB DR 39
40 In E-Commerce: Rogers Diffusion of Innovations Theory Innovators Early adopters Early majority Late majority Laggards Users adoption patterns: Some users tend to adopt innovations earlier than others! Information virtually flows from early adopters to late adopters 40
41 Recommendation Driven by Information Flow Innovator adopt Early adopter Innovator adopt?? Early adopter Late majority Early majority Late majority? Early majority Laggard Laggard adopt People with similar tastes People with similar tastes Influence is not symmetric! 41
42 Scheme Overview p Leverage the uneven influence Dataset EABIF Information Propagation Model Application: Personalized Recommendation EABIF: Early Adoption based Information Flow Network 42
43 EABIF (1) -- Markov Chain p p A Markov chain has two components A network structure where each node is called a state A transition probability of traversing a link given that the chain is in a state (P: transition probability matrix) A stationary distribution is a probability distribution q such that q = qp how likely you will stay at one node i P ij j p Application -- PageRank [Brin and Page 98] Assumption: A link from page A to page B is a recommendation of page B by the author of A A B Quality of a page is related to p Number of pages linking to it C p The quality of pages linking to it Assume the web is a Markov chain p PageRank = stationary distribution of this Markov chain 43
44 EABIF (2) p Early Adoption Matrix (EAB) Count how many items one user accesses earlier than the other pairwise comparison p Markov Chain Model Normalize EAB to a transition matrix F of a Markov chain Adjustment F to guarantee the existence of stationary distribution of the Markov chain p p Make the matrix stochastic F ij = 1 Make the Markov chain irreducible j F 0, 1 i, j N ij 44
45 Information Propagation Models 1. Summation of various propagation steps ( ) ( ( ) ( )) 2 m! F = F + F + + F if m m v r 1 u 2. Direct summation F if ( d ) ( ( 1 )) N F I F = I F r 2 r 3 N: number of the nodes 3. Exponential weighted summation F # 1 1 $ if (exp) = & F 2! F N 1! F ' ( ) ( ) ( ) 2 ( ) ( N 1 β β! β ) )(!' exp( β ) 45
46 Topic-Sensitive Early Adoption Based Information Flow (TEABIF) Network p Adoption is typically category specific An early adopter of fashion may not be an early adopter of technology Application: Personalized Recommendation LDA TOPIC 1 TOPIC 2 TOPIC 1 TOPIC 2 TOPIC 3 TOPIC 3 46
47 Experimental Setup ER dataset 2004 Apr. to 2005 Apr. as training data 2005 May to 2005 Jul. as test data 1033 users, 586 documents Process Construct information flow network based on the training data Trigger earliest users to start the process Predict who will be also interested in these documents Evaluation Precision & Recall 47
48 Experimental Results --Recommendation Quality Precision Precision Precision Comparison (Number of T riggered Users = 1, Propagation Steps = 1) No. of retrieved users Precision Comparison (Number of T riggered Users = 2, Propagation Steps = 1) CF EABIF TEABIF CF EABIF TEABIF Recall Recall Recall Comparison (Number of T riggered Users = 1, Propagation Steps = 1) CF EABIF TEABIF No. of retrieved users Recall Comparison (Number of T riggered Users = 2, Propagation Steps = 1) CF EABIF TEABIF No. of retrieved users No. of retrieved users p Comparing to Collaborative Filtering (CF), precision: EABIF is 91.0% better, TEABIF is 108.5% better Recall: EABIF is 87.1% better, TEABIF is 112.8% better 48
49 Experimental Results -- Propagation Performance Ratio (x100%) Precision Improvement Comparison (Number of triggered users = 1, Baseline: CF) m = 1 m = 2 m = 3 m = 4 m = 5 sum exp(β= 1) exp(β= 1.5) exp(β= 2) exp(β= 3) exp(β= 4) exp(β= 5) EABIF TEABIF exp(β= 8) exp(β= 16) Ratio (x100%) Recall Improvement Comparison (Number of triggered users = 1, Baseline: CF) EABIF TEABIF m = 1 m = 2 m = 3 m = 4 m = 5 sum exp(β= 1) exp(β= 1.5) exp(β= 2) exp(β= 3) exp(β= 4) exp(β= 5) exp(β= 8) exp(β= 16) Ratio (x100%) Precision Improvement Comparison (Number of triggered users = 2, Baseline: CF) m = 1 m = 2 m = 3 m = 4 m = 5 sum exp(β= 1) exp(β= 1.5) exp(β= 2) exp(β= 3) exp(β= 4) exp(β= 5) EABIF TEABIF exp(β= 8) exp(β= 16) Ratio (x100%) Recall Improvement Comparison (Number of triggered users = 2, Baseline: CF) m = 1 m = 2 m = 3 m = 4 m = 5 sum exp(β= 1) exp(β= 1.5) exp(β= 2) exp(β= 3) exp(β= 4) exp(β= 5) EABIF TEABIF exp(β= 8) exp(β= 16) p TEABIF with exponential weighted summation ( β = 3 ) p achieves the best performance: p improves 108.5% on precision and 116.9% on recall comparing to CF 49
50 Summary Exploit dynamic patterns including Leverage dynamic patterns from both documents and users perspective Analyzing documents accessing types Predicting documents expiration date Detecting users intentions Identifying users interests evolving over time CTC model Ranking the documents adaptively Time-sensitive Adaboost Utilize users adoption patterns Information virtually flows from early adopters to late adopters Experimental results demonstrate Dynamic factors are important for recommendations 50
51 Selected Publications X. Song, C.-Y. Lin, B. L. Tseng, and M.-T. Sun, Modeling Evolutionary Behaviors for Community-based Dynamic Recommendation, SIAM Conf. on Data Mining, Bethesda, MD, Apr X. Song, C.-Y. Lin, B. L. Tseng, and M.-T. Sun, Personalized Recommendation Driven by Information Flow, ACM SIGIR, Aug X. Song, C.-Y. Lin, B. L. Tseng and M.-T. Sun, Modeling and Predicting Personal Information Dissemination Behavior, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug X. Song, B. L. Tseng, C.-Y. Lin, and M.-T. Sun, "ExpertiseNet: Relational and Evolutionary Expert Modeling," International Conference on User Modeling, Edinburgh, UK, Jul ,
52 Distributed Item-based Recommender 52
53 Distributed recommender get co-occurrence matrix Data: 53
54 Multiply the co-occurrence matrix with user preference The highest is 103 (101, 104, 105, 107 have been purchased by user 3) 54
55 Translating to MapReduce: generating user vectors 55
56 Translating to MapReduce: calculating co-occurrence 56
57 Translating to MapReduce: matrix multiplication 57
58 Translating to MapReduce: partial products 58
59 Translating to MapReduce: partial product II 59
60 Running Recommender on MapReduce and HDFS 60
61 Questions? 61 E6893 Big Data Analytics Lecture 3: Storage and Processing
62 E6893 Big Data Analytics: Appendix references 62
63 Part I: Pig installation and Demo Pig is a platform for analyzing large data sets that consists of a highlevel language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. 63
64 1. Installation of Pig: setup.html Download pig and run following sentence: export PATH=/<my-path-to-pig>/pign.n.n/bin:$PATH 64
65 65
66 2. Running pig in local mode: pig -x local movies = LOAD '/Users/Rich/ Documents/Courses/Fall2014/ BigData/Pig/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration); 66 DUMP movies
67 Filter: List the movies that were released between 1950 and 1960 movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961; store movies_1950_1960 into '/Users/Rich/ Desktop/Demo/movies_1950_1960'; 67
68 Foreach Generate: List movie names and their duration (in minutes) movies_name_duration = foreach movies generate name, (float)duration/3600; store movies_name_duration into '/Users/Rich/ Desktop/Demo/movies_name_duration'; 68
69 Order: List all movies in descending order of year movies_year_sort =order movies by year desc; store movies_year_sort into '/Users/Rich/ Desktop/Demo/movies_year_sort'; 69
70 3. Running pig with HDFS: pig Run HDFS first: ssh localhost cd /usr/local/cellar/hadoop/2.5.0 sbin/start-dfs.sh 70
71 3. Upload file to HDFS: Make a directory: bin/hdfs dfs -mkdir /PigSource Upload a file: bin/hdfs dfs -put /Users/Rich/Documents/Courses/ Spring2014/CloudComputing/HW/MINI5/ movies_data.csv /PigSource 71
72 4. Run pig in grunt with HDFS: pig movies = LOAD '/PigSource/ movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration); DUMP movies 72
73 5. Run.pig file with HDFS: pig pig /Users/Rich/Documents/Courses/ Fall2014/BigData/Pig/run1.pig 73
74 Part II: Hbase installation and Demo HBase is an open source, non-relational, distributed database modeled after Google's Big Table and written in Java. It runs on top of HDFS, and can serve as input and output for MapReduce jobs run in Hadoop. Access through Java API and Pig. 74
75 Hbase Configuration: Download and configure Hbase by following the instruction online. To run Hbase, under your Hbase path: $bin/start-hbase.sh Enter shell $bin/hbase shell Also visit localhost:60010 to check Hbase webui 75
76 Hbase Command: Create: hbase>create table, cf hbase>put table', r1', cf:c1', 'value1 hbase>scan table Also we can do count, get, delete and drop. etc much like other DB systems. 76
77 Part III: Hive installation and Demo 77
78 Simple demo for Hive Install steps: Mac: you can first install brew $ brew install Hive Linux: cd ~/Downloads wget hive /apache-hive bin.tar.gz cp apache-hive bin.tar.gz ~ cd ~ tar zxvf apache-hive bin.tar.gz mv apache-hive bin hive cd hive cd bin./hive 78
79 Simple demo for Hive Step 1. start Hadoop, create a file under /user/ yourname hadoop fs -mkdir test Step 2. put your dataset into hdfs hadoop fs -put (your dataset path) test hadoop fs -ls test 79
80 Simple demo for Hive Step 3. create the table (SQL) in Hive shell hive Step 4. store the data into table load data inpath '/user/yourname/test' into table client; You can see your data has already been put into the table 80
81 Simple demo for Hive 1. select * from client where name='henry'; 2. select avg(age) from client where gender='male'; 81
82 Simple demo for Hive Use JDBC(JAVA) to access data in Hive Server Other choice: Python, PHP etc. 82
83 Simple demo for Hive Open the Hive (Starting Hive Thrift Server) serverhive --service hiveserver -p URL: jdbc:hive://localhost:10000/default Result in the console 83
84 Simple demo for Hive Example. Jetty Server with Jersey, restful web service Create simple API (Http request, XML/JSON format output) 84
85 Simple demo for Hive Example. Jetty Server with Jersey, restful web service 85
Creating a Recommender System. An Elasticsearch & Apache Spark approach
Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused
Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.
Oracle Big Data. A NALYTICS A ND MANAG E MENT. Oracle Big Data: Redundância. Compatível com ecossistema Hadoop, HIVE, HBASE, SPARK. Integração com Cloudera Manager. Possibilidade de Utilização da Linguagem
Distributed Computing with Spark
Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing
Talend Big Data Sandbox. Big Data Insights Cookbook
Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is
Talend Big Data Sandbox. Big Data Insights Cookbook
Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is
Aims. Background. This exercise aims to get you to:
Aims This exercise aims to get you to: Import data into HBase using bulk load Read MapReduce input from HBase and write MapReduce output to HBase Manage data using Hive Manage data using Pig Background
Introduction to Hive Cloudera, Inc.
Introduction to Hive Outline Motivation Overview Data Model Working with Hive Wrap up & Conclusions Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded
Similarity Ranking in Large- Scale Bipartite Graphs
Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads
Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs
Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation
Matrix Co-factorization for Recommendation with Rich Side Information HetRec 2011 and Implicit 1 / Feedb 23
Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback Yi Fang and Luo Si Department of Computer Science Purdue University West Lafayette, IN 47906, USA fangy@cs.purdue.edu
Hortonworks Data Platform
Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks
LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS
LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS Vandita Jain 1, Prof. Tripti Saxena 2, Dr. Vineet Richhariya 3 1 M.Tech(CSE)*,LNCT, Bhopal(M.P.)(India) 2 Prof. Dept. of CSE, LNCT, Bhopal(M.P.)(India)
Column Stores and HBase. Rui LIU, Maksim Hrytsenia
Column Stores and HBase Rui LIU, Maksim Hrytsenia December 2017 Contents 1 Hadoop 2 1.1 Creation................................ 2 2 HBase 3 2.1 Column Store Database....................... 3 2.2 HBase
Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop
Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce
Map Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.
PROFESSIONAL NoSQL Shashank Tiwari WILEY John Wiley & Sons, Inc. Examining CONTENTS INTRODUCTION xvil CHAPTER 1: NOSQL: WHAT IT IS AND WHY YOU NEED IT 3 Definition and Introduction 4 Context and a Bit
Mining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
Oracle Data Integrator 12c: Integration and Administration
Oracle University Contact Us: Local: 1800 103 4775 Intl: +91 80 67863102 Oracle Data Integrator 12c: Integration and Administration Duration: 5 Days What you will learn Oracle Data Integrator is a comprehensive
Data Science Bootcamp Curriculum. NYC Data Science Academy
Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations
Sandbox Setup Guide for HDP 2.2 and VMware
Waterline Data Inventory Sandbox Setup Guide for HDP 2.2 and VMware Product Version 2.0 Document Version 10.15.2015 2014-2015 Waterline Data, Inc. All rights reserved. All other trademarks are the property
Recommender Systems. Collaborative Filtering & Content-Based Recommending
Recommender Systems Collaborative Filtering & Content-Based Recommending 1 Recommender Systems Systems for recommending items (e.g. books, movies, CD s, web pages, newsgroup messages) to users based on
Hadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
Oracle Big Data Discovery
Oracle Big Data Discovery Turning Data into Business Value Harald Erb Oracle Business Analytics & Big Data 1 Safe Harbor Statement The following is intended to outline our general product direction. It
Section 7.12: Similarity. By: Ralucca Gera, NPS
Section 7.12: Similarity By: Ralucca Gera, NPS Motivation We talked about global properties Average degree, average clustering, ave path length We talked about local properties: Some node centralities
A Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.
Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources
R Language for the SQL Server DBA
R Language for the SQL Server DBA Beginning with R Ing. Eduardo Castro, PhD, Principal Data Analyst Architect, LP Consulting Moderated By: Jose Rolando Guay Paz Thank You microsoft.com idera.com attunity.com
Enhancing Cluster Quality by Using User Browsing Time
Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of
Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016
International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
Web Personalization & Recommender Systems
Web Personalization & Recommender Systems COSC 488 Slides are based on: - Bamshad Mobasher, Depaul University - Recent publications: see the last page (Reference section) Web Personalization & Recommender
Multi-domain Predictive AI. Correlated Cross-Occurrence with Apache Mahout and GPUs
Multi-domain Predictive AI Correlated Cross-Occurrence with Apache Mahout and GPUs Pat Ferrel ActionML, Chief Consultant Apache Mahout, PMC & Committer Apache PredictionIO, PMC & Committer pat@apache.org
SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism
Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and
The Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
Politecnico di Torino. Porto Institutional Repository
Politecnico di Torino Porto Institutional Repository [Proceeding] A Recommender System for Telecom Users: Evaluation of Recommendation Algorithms Experimental Original Citation: Falcarin P.; Vetro A.;
RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 RECOMMENDATION SYSTEM USING COLLABORATIVE FILTERING Yunkyoung Lee Follow this and additional
General Instructions. Questions
CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These
How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,
How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS
Enhancing Cluster Quality by Using User Browsing Time
Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,
ANALYSIS OF DOMAIN INDEPENDENT STATISTICAL KEYWORD EXTRACTION METHODS FOR INCREMENTAL CLUSTERING
ANALYSIS OF DOMAIN INDEPENDENT STATISTICAL KEYWORD EXTRACTION METHODS FOR INCREMENTAL CLUSTERING Rafael Geraldeli Rossi 1, Ricardo Marcondes Marcacini 1,2, Solange Oliveira Rezende 1 1 Institute of Mathematics
Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi
Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures Hiroshi Yamaguchi & Hiroyuki Adachi About Us 2 Hiroshi Yamaguchi Hiroyuki Adachi Hadoop DevOps Engineer Hadoop Engineer
A Time-based Recommender System using Implicit Feedback
A Time-based Recommender System using Implicit Feedback T. Q. Lee Department of Mobile Internet Dongyang Technical College Seoul, Korea Abstract - Recommender systems provide personalized recommendations
Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci
Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules
CSE 494: Information Retrieval, Mining and Integration on the Internet
CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:
CS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Recommender Systems II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 31, 2017 Recommender Systems Recommendation via Information Network Analysis Hybrid Collaborative Filtering
Higher level data processing in Apache Spark
Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL
Jeff Howbert Introduction to Machine Learning Winter
Collaborative Filtering Nearest es Neighbor Approach Jeff Howbert Introduction to Machine Learning Winter 2012 1 Bad news Netflix Prize data no longer available to public. Just after contest t ended d
Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...
Table of contents 1 Pig Setup... 2 2 Running Pig... 3 3 Pig Latin Statements... 6 4 Pig Properties... 8 5 Pig Tutorial... 9 1. Pig Setup 1.1. Requirements Mandatory Unix and Windows users need the following:
Big Data Technologies and Geospatial Data Processing:
Big Data Technologies and Geospatial Data Processing: A perfect fit Albert Godfrind Spatial and Graph Solutions Architect Oracle Corporation Agenda 1 2 3 4 The Data Explosion Big Data? Big Data and Geo
Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University
Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University {tedhong, dtsamis}@stanford.edu Abstract This paper analyzes the performance of various KNNs techniques as applied to the
Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data
Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics
Click Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors
Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017
Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system
Developer s Manual. Version May, Computer Science Department, Texas Christian University
Developer s Manual Version 4.0 2 May, 2016 2015-2016 Computer Science Department, Texas Christian University Revision Signatures By signing the following document, the team member is acknowledging that
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
Data Mining Concepts & Tasks
Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time
Oracle R Technologies
Oracle R Technologies R for the Enterprise Mark Hornick, Director, Oracle Advanced Analytics @MarkHornick mark.hornick@oracle.com Safe Harbor Statement The following is intended to outline our general
Using Google s PageRank Algorithm to Identify Important Attributes of Genes
Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105
Cloud, Big Data & Linear Algebra
Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes
Design and Implement of Bigdata Analysis Systems
Design and Implement of Bigdata Analysis Systems Jeong-Joon Kim *Department of Computer Science & Engineering, Korea Polytechnic University, Gyeonggi-do Siheung-si 15073, Korea. Abstract The development
Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials
Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback
Matrix Co-factorization for Recommendation with Rich Side Information and Implicit Feedback ABSTRACT Yi Fang Department of Computer Science Purdue University West Lafayette, IN 47907, USA fangy@cs.purdue.edu
INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)
INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)
E6885 Network Science Lecture 10: Graph Database (II)
E 6885 Topics in Signal Processing -- Network Science E6885 Network Science Lecture 10: Graph Database (II) Ching-Yung Lin, Dept. of Electrical Engineering, Columbia University November 18th, 2013 Course
Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017
Distributed Systems 09r. Map-Reduce Programming on AWS/EMR (Part I) Paul Krzyzanowski TA: Long Zhao Rutgers University Fall 2017 November 21, 2017 2017 Paul Krzyzanowski 1 Setting Up AWS/EMR November 21,
Information Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
Progress DataDirect For Business Intelligence And Analytics Vendors
Progress DataDirect For Business Intelligence And Analytics Vendors DATA SHEET FEATURES: Direction connection to a variety of SaaS and on-premises data sources via Progress DataDirect Hybrid Data Pipeline
CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.
Distributed Systems 1. Graph Computing Frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 016 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead
Web Personalization & Recommender Systems
Web Personalization & Recommender Systems COSC 488 Slides are based on: - Bamshad Mobasher, Depaul University - Recent publications: see the last page (Reference section) Web Personalization & Recommender
Recommendation with Differential Context Weighting
Recommendation with Differential Context Weighting Yong Zheng Robin Burke Bamshad Mobasher Center for Web Intelligence DePaul University Chicago, IL USA Conference on UMAP June 12, 2013 Overview Introduction
Hadoop Quickstart. Table of contents
Table of contents 1 Purpose...2 2 Pre-requisites...2 2.1 Supported Platforms... 2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster...3 5 Standalone
Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology
Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda
Available online at ScienceDirect. Procedia Technology 17 (2014 )
Available online at www.sciencedirect.com ScienceDirect Procedia Technology 17 (2014 ) 528 533 Conference on Electronics, Telecommunications and Computers CETC 2013 Social Network and Device Aware Personalized
Using PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important
New Features and Enhancements in Big Data Management 10.2
New Features and Enhancements in Big Data Management 10.2 Copyright Informatica LLC 2017. Informatica, the Informatica logo, Big Data Management, and PowerCenter are trademarks or registered trademarks
Using PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important
Solr Installation User Guide. Solr Installation Brainvire Infotech Pvt. Ltd
Solr Installation 1 Lets see how to Install Solr, it is very easy and we are going to do that in this Solr Installation guideline. So let s start! Let s start for Linux and Mac. We ll guide you trough
Semi supervised clustering for Text Clustering
Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering
50 Must Read Hadoop Interview Questions & Answers
50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?
Introduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
Experiences from Implementing Collaborative Filtering in a Web 2.0 Application
Experiences from Implementing Collaborative Filtering in a Web 2.0 Application Wolfgang Woerndl, Johannes Helminger, Vivian Prinz TU Muenchen, Chair for Applied Informatics Cooperative Systems Boltzmannstr.
Distributed Machine Learning" on Spark
Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations
Accelerate Big Data Insights
Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not
Using the Force of Python and SAS Viya on Star Wars Fan Posts
SESUG Paper BB-170-2017 Using the Force of Python and SAS Viya on Star Wars Fan Posts Grace Heyne, Zencos Consulting, LLC ABSTRACT The wealth of information available on the Internet includes useful and
SOLUTION BRIEF BIG DATA SECURITY
SOLUTION BRIEF BIG DATA SECURITY Get maximum value and insight from your Big Data initiatives while maintaining robust data security THE CHALLENGE More and more companies are finding that Big Data strategies
Deep Character-Level Click-Through Rate Prediction for Sponsored Search
Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as
The Simularity High Performance Correlation Engine
The Simularity High Performance Correlation Engine Simularity s High Performance Correlation Engine (HPCE) is a purpose-built analytics engine for similarity and correlation analytics optimized to do real-time,
Mining Distributed Frequent Itemset with Hadoop
Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario
The amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
Document Clustering: Comparison of Similarity Measures
Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation
iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN
iway iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.0 DN3502232.1216 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo,
Feature-weighted User Model for Recommender Systems
Feature-weighted User Model for Recommender Systems Panagiotis Symeonidis, Alexandros Nanopoulos, and Yannis Manolopoulos Aristotle University, Department of Informatics, Thessaloniki 54124, Greece {symeon,
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 4.11 Last Updated: 1/10/2018 Please note: This appliance is for testing and educational purposes only;
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -
Processing Big Data with Hadoop in Azure HDInsight
Processing Big Data with Hadoop in Azure HDInsight Lab 1 - Getting Started with HDInsight Overview In this lab, you will provision an HDInsight cluster. You will then run a sample MapReduce job on the
Apache Flink Big Data Stream Processing
Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017