Data Processing on Large Clusters. By: Stephen Cardina

Size: px
Start display at page:

Download "Data Processing on Large Clusters. By: Stephen Cardina"

Transcription

1 Data Processing on Large Clusters By: Stephen Cardina

2 Introduction MapReduce is used on clusters to get data that you are specifically looking for. MapReduce was made back in 2004 by Google in order to help reduce complexity on their search engine. They did this by finding all of the words used on a web page and finding the amount of times each word is used. In 2014 Google had stopped using MapReduce as better alternatives had come along. The purpose of this presentation is to go through some of them to get an idea of what works best for what kind of given situation.

3 MapReduce Pros: Excellent for one time or simple use Cons: It has essentially been discontinued by Google since at April 2014 after it upgraded to Apache Mahout and support for it has been phased out. It s limited in machine learning. When used for an constantly or is very complex, there are better alternatives.

4 Apache Mahout Is an ongoing project by a nonprofit organization called Apache Software Foundation. The first version was released in February Phased out MapReduce and led to it getting phased out. Being worked on by volunteers. Is currently being used by Google.

5 Apache Mahout Pros: Is very good for machine learning, such as recommendations on products for a site. Is getting new versions at a semi regular rate, last version was in April Cons: Doesn t scale the best.

6 Apache Spark Is an open source cluster computing framework. Was originally made at the University of California s Berkeley s AMPLab. Was donated in 2013 to Apache Software Foundation and they ve had it since. Become one of the top level projects at Apache and one of the most popular projects to be worked on, exceeding 1,000 contributions in Also being worked on by volunteers. Used by Amazon and Groupon.

7 Apache Spark Pros: Is getting newer versions at a decent pace, last version release was October Writes as little as possible to the disk; which lets it finish tasks faster. Also works good for machine learning. Is generally considered better than Apache Mahout. Very well known so it is one of the easier ones to learn compared to the later ones. Cons: Doesn t scale the best.

8 H20 Is an open source software for big data analysis. Was first released in 2011 by H20.ai. Focuses solely on machine learning algorithms instead of having a whole framework. Can be integrated into Apache Spark. Capital One and Ebay currently use H20.

9 H20 Pros Scales very well. Can handle a lot of data at once. Cons Not the most well known so odds are you will have to learn on your own.

10 XGBoost Is an open source software library. It was first made as a research project by Tianqi Chen in It won the Higgs machine Learning Challenge in 2016 and gained widespread attention in the machine learning community. It was later made to be able to integrate into Apache Spark and Apache Hadoop.

11 XGBoost Pros One of the best when it comes to scalability. Can handle the most out of the options here. Cons Hasn t been around as long as the others making it harder to learn from someone else so you have to learn it on your own.

12 How to figure out what s the best For the purpose of this presentation we won t be comparing MapReduce and Apache Mahout with the other options The reason for this is that they aren t the best for large projects; they are fine to work with if it s relatively simple; but they won t be able to handle as much as the other three options. So with Apache Spark, H20 and XGBoost we ll compare them based on scalability and accuracy,

13 The First Test We will be using random forest for our first test. A random forest is where you give a certain number of trees, 500 in this case, a certain amount of data points and ask it for a curve based on what it received. We will test this with 10, ,000 1,000,000 and 10,000,000 different data points. N will equal 1 million for the following charts

14 The First Test Results These are the results from the test; where Spark crashed before it did all 10 million

15 The First Test Results

16 The Second Test We will be using Gradient Boosted Trees in our second test. This time we will be running it twice. It s a lot like Random Forest but this time it doesn t allow a tree to sway the curve as much, as represented by the depth. Test A will be 1,000 trees and max depth of 16. Test B will be 300 trees and max depth of 6.

17 The Second Test Results

18 The Second Test

19 In Conclusion MapReduce and Apache Mahout are only good for small one time projects. H20 and XGBoost are considered the 2 leading options at the moment so they are the best to work with if you know how. XGBoost is generally the fastest and requires the least amount of RAM as compared to the other options. If you don t feel as confident about figuring them out yourself it s best to use Apache Spark as it s more well known and thus easier to learn.

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

For Volunteers An Elvanto Guide

For Volunteers An Elvanto Guide For Volunteers An Elvanto Guide www.elvanto.com Volunteers are what keep churches running! This guide is for volunteers who use Elvanto. If you re in charge of volunteers, why not check out our Volunteer

More information

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina MapReduce: Simplified Data Processing on Large Clusters By Stephen Cardina The Problem You have a large amount of raw data, such as a database or a web log, and you need to get some sort of derived data

More information

Databases and Big Data Today. CS634 Class 22

Databases and Big Data Today. CS634 Class 22 Databases and Big Data Today CS634 Class 22 Current types of Databases SQL using relational tables: still very important! NoSQL, i.e., not using relational tables: term NoSQL popular since about 2007.

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

IMPORTANCE OF A MINISTRY WEBSITE

IMPORTANCE OF A MINISTRY WEBSITE SUMMARY In 2018, the internet is everything, even our appliances are starting to connect. People today are more comfortable emailing or texting than calling and face time. Although, I hate to admit it,

More information

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017 UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory

More information

SEO KEYWORD SELECTION

SEO KEYWORD SELECTION SEO KEYWORD SELECTION Building Your Online Marketing Campaign on Solid Keyword Foundations TABLE OF CONTENTS Introduction Why Keyword Selection is Important 01 Chapter I Different Types of Keywords 02

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these

More information

WINDOWS 8.X SIG SEPTEMBER 22, 2014

WINDOWS 8.X SIG SEPTEMBER 22, 2014 New Start Screen: Top RIGHT corner next to your Sign in Name is the OFF button. To the Right of Off button is a Search icon You can click on Search icon OR just start typing anywhere in open area of Start

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Welcome to the New Era of Cloud Computing

Welcome to the New Era of Cloud Computing Welcome to the New Era of Cloud Computing Aaron Kimball The web is replacing the desktop 1 SDKs & toolkits are there What about the backend? Image: Wikipedia user Calyponte 2 Two key concepts Processing

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

CMO Briefing Google+:

CMO Briefing Google+: www.bootcampdigital.com CMO Briefing Google+: How Google s New Social Network Can Impact Your Business Facts Google+ had over 30 million users in the first month and was the fastest growing social network

More information

Parallel learning of content recommendations using map- reduce

Parallel learning of content recommendations using map- reduce Parallel learning of content recommendations using map- reduce Michael Percy Stanford University Abstract In this paper, machine learning within the map- reduce paradigm for ranking

More information

AND BlackBerry JUL13 ISBN

AND BlackBerry JUL13 ISBN AND BlackBerry 806-29JUL13 ISBN 978-0-9819900-1-9 Contents 1 2 3 The Essentials of GTD and BlackBerry What is GTD?... 1 Which tools are best for GTD?... 1 If GTD is not about any particular tool, why a

More information

SQLite vs. MongoDB for Big Data

SQLite vs. MongoDB for Big Data SQLite vs. MongoDB for Big Data In my latest tutorial I walked readers through a Python script designed to download tweets by a set of Twitter users and insert them into an SQLite database. In this post

More information

Strong signs your website needs a professional redesign

Strong signs your website needs a professional redesign Strong signs your website needs a professional redesign Think - when was the last time that your business website was updated? Better yet, when was the last time you looked at your website? When the Internet

More information

what is cloud computing?

what is cloud computing? what is cloud computing? (Private) Cloud Computing with Mesos at Twi9er Benjamin Hindman @benh scalable virtualized self-service utility managed elastic economic pay-as-you-go what is cloud computing?

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Web Server Setup Guide

Web Server Setup Guide SelfTaughtCoders.com Web Server Setup Guide How to set up your own computer for web development. Setting Up Your Computer for Web Development Our web server software As we discussed, our web app is comprised

More information

Project Design. Version May, Computer Science Department, Texas Christian University

Project Design. Version May, Computer Science Department, Texas Christian University Project Design Version 4.0 2 May, 2016 2015-2016 Computer Science Department, Texas Christian University Revision Signatures By signing the following document, the team member is acknowledging that he

More information

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018 15-388/688 - Practical Data Science: Big data and MapReduce J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Big data Some context in distributed computing map + reduce MapReduce MapReduce

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

How to Sign Up for a Volunteer Activity

How to Sign Up for a Volunteer Activity How to Sign Up for a Volunteer Activity Visit www.catholiccharitiesdc.org/volunteer Click the One-Time volunteer button to see the upcoming volunteer activities On the Calendar, click the activity where

More information

Adding content to your Blackboard 9.1 class

Adding content to your Blackboard 9.1 class Adding content to your Blackboard 9.1 class There are quite a few options listed when you click the Build Content button in your class, but you ll probably only use a couple of them most of the time. Note

More information

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce Extreme Computing Introduction to MapReduce 1 Cluster We have 12 servers: scutter01, scutter02,... scutter12 If working outside Informatics, first: ssh student.ssh.inf.ed.ac.uk Then log into a random server:

More information

Content Curation Mistakes

Content Curation Mistakes Table of Contents Table of Contents... 2 Introduction... 3 Mistake #1 Linking to Poor Quality Content... 4 Mistake #2 Using the Same Few Sources... 5 Mistake #3 Curating Only Blog Posts... 6 Mistake #4

More information

Scalable Tools - Part I Introduction to Scalable Tools

Scalable Tools - Part I Introduction to Scalable Tools Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session

More information

THE GOOD, THE BAD AND THE UGLY. How Your Donation Process Impacts Your Workflow (and How To Fix It)

THE GOOD, THE BAD AND THE UGLY. How Your Donation Process Impacts Your Workflow (and How To Fix It) THE, THE AND THE How Your Donation Process Impacts Your Workflow (and How To Fix It) / It s great to be popular. Your company is getting a ton of donation requests from worthy charities, and you re doing

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

ANALYZING THE MILLION SONG DATASET USING MAPREDUCE

ANALYZING THE MILLION SONG DATASET USING MAPREDUCE PROGRAMMING ASSIGNMENT 3 ANALYZING THE MILLION SONG DATASET USING MAPREDUCE Version 1.0 DUE DATE: Wednesday, October 18 th, 2017 @ 5:00 pm OBJECTIVE You will be developing MapReduce programs that parse

More information

Speed Up Windows by Disabling Startup Programs

Speed Up Windows by Disabling Startup Programs Speed Up Windows by Disabling Startup Programs Increase Your PC s Speed by Preventing Unnecessary Programs from Running Windows All S 630 / 1 When you look at the tray area beside the clock, do you see

More information

Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data

Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data Shiratani Unsui forest by Σ64 Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data Oscar J. Luo Health Data Analytics 12 th October 2016 HEALTH & BIOSECURITY Transformational

More information

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2) Data Structures and Algorithm Analysis (CSC317) Hash tables (part2) Hash table We have elements with key and satellite data Operations performed: Insert, Delete, Search/lookup We don t maintain order information

More information

Organising . page 1 of 8. bbc.co.uk/webwise/accredited-courses/level-one/using- /lessons/your- s/organising-

Organising  . page 1 of 8. bbc.co.uk/webwise/accredited-courses/level-one/using- /lessons/your- s/organising- Organising email Reading emails When someone sends you an email it gets delivered to your inbox, which is where all your emails are stored. Naturally the first thing you ll want to do is read it. In your

More information

Efficient and Scalable Friend Recommendations

Efficient and Scalable Friend Recommendations Efficient and Scalable Friend Recommendations Comparing Traditional and Graph-Processing Approaches Nicholas Tietz Software Engineer at GraphSQL nicholas@graphsql.com January 13, 2014 1 Introduction 2

More information

How to Get a Help Desk Up and Running in a Day. May, 2011

How to Get a Help Desk Up and Running in a Day. May, 2011 How to Get a Help Desk Up and Running in a Day May, 2011 Table of Contents Introduction... 3 Easy to Get Started: Free 30- Day Trial... 3 Jumping In and Solving Your First Ticket... 3 Seeing and Replying

More information

GPS // Guide to Practice Success

GPS // Guide to Practice Success ways to use mobile technology to grow your practice in 2013 A Sesame You ve worked hard to make your practice website look great online, but how does it display on your smartphone? Take a moment to pull

More information

Installing Ubuntu Server

Installing Ubuntu Server CHAPTER 1 Installing Ubuntu Server You probably chose Ubuntu as a server solution because of either your gratifying experience using it on the desktop or the raves you ve heard from others about its user-friendly

More information

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko RAMCloud Scalable High-Performance Storage Entirely in DRAM 2009 by John Ousterhout et al. Stanford University presented by Slavik Derevyanko Outline RAMCloud project overview Motivation for RAMCloud storage:

More information

CIS220 In Class/Lab 1: Due Sunday night at midnight. Submit all files through Canvas (25 pts)

CIS220 In Class/Lab 1: Due Sunday night at midnight. Submit all files through Canvas (25 pts) CIS220 In Class/Lab 1: Due Sunday night at midnight. Submit all files through Canvas (25 pts) Problem 0: Install Eclipse + CDT (or, as an alternative, Netbeans). Follow the instructions on my web site.

More information

Chee Kiam. to sieve through. and the next one. relevant. The advances in Big. (NLB) of Singapore.

Chee Kiam. to sieve through. and the next one. relevant. The advances in Big. (NLB) of Singapore. Submitted on: May 31, 2013 Connecting library content using data mining and text analytics on structured and unstructured dataa Chee Kiam Lim Technology and Innovation, National Library Board, Singapore.

More information

How to Add or Invite Colleagues

How to Add or Invite Colleagues Page 1 of 5 How to Add or Invite Colleagues This how-to document contains four sections, addressing the most common questions about Point K collaboration features: Do my colleagues have to be co-workers

More information

Promo Buddy 2.0. Internet Marketing Database Software (Manual)

Promo Buddy 2.0. Internet Marketing Database Software (Manual) Promo Buddy 2.0 Internet Marketing Database Software (Manual) PromoBuddy has been developed by: tp:// INTRODUCTION From the computer of Detlev Reimer Dear Internet marketer, More than 6 years have passed

More information

Microsoft Access: Let s create the tblperson. Today we are going to use advanced properties for the table fields and use a Query.

Microsoft Access: Let s create the tblperson. Today we are going to use advanced properties for the table fields and use a Query. : Let s create the tblperson. Today we are going to use advanced properties for the table fields and use a Query. Add a SSN input mask to the PersonID field using the Wizard. Limit the first and last name

More information

Real-time Data Engineering in the Cloud Exercise Guide

Real-time Data Engineering in the Cloud Exercise Guide Real-time Data Engineering in the Cloud Exercise Guide Jesse Anderson 2017 SMOKING HAND LLC ALL RIGHTS RESERVED Version 1.12.a9779239 1 Contents 1 Lab Notes 3 2 Kafka HelloWorld 6 3 Streaming ETL 8 4 Advanced

More information

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy Texas A&M Big Data Workshop October 2011 January 2015, Texas A&M University Research Topics Seminar 1 Outline Overview of

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Create quick link URLs for a candidate merge Turn off external ID links in candidate profiles... 4

Create quick link URLs for a candidate merge Turn off external ID links in candidate profiles... 4 Credential Manager 1603 March 2016 In this issue Pearson Credential Management is proud to announce Generate quick link URLs for a candidate merge in the upcoming release of Credential Manager 1603, scheduled

More information

Design Like a Pro. Boost Your Skills in HMI / SCADA Project Development. Part 2: Developing Dynamic HMI / SCADA Projects with Speed and Precision

Design Like a Pro. Boost Your Skills in HMI / SCADA Project Development. Part 2: Developing Dynamic HMI / SCADA Projects with Speed and Precision INDUCTIVE AUTOMATION DESIGN SERIES Design Like a Pro Boost Your Skills in HMI / SCADA Project Development Part 2: Developing Dynamic HMI / SCADA Projects with Speed and Precision The design phase is the

More information

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1 Matrix-Vector Multiplication by MapReduce From Rajaraman / Ullman- Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrix-vector multiplications When ranking of Web pages that

More information

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET SOLUTION SHEET Syncsort DMX-h Simplifying Big Data Integration Goals of the Modern Data Architecture Data warehouses and mainframes are mainstays of traditional data architectures and still play a vital

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Distributed Itembased Collaborative Filtering with Apache Mahout. Sebastian Schelter twitter.com/sscdotopen. 7.

Distributed Itembased Collaborative Filtering with Apache Mahout. Sebastian Schelter twitter.com/sscdotopen. 7. Distributed Itembased Collaborative Filtering with Apache Mahout Sebastian Schelter ssc@apache.org twitter.com/sscdotopen 7. October 2010 Overview 1. What is Apache Mahout? 2. Introduction to Collaborative

More information

Making a PowerPoint Accessible

Making a PowerPoint Accessible Making a PowerPoint Accessible Purpose The purpose of this document is to help you to create an accessible PowerPoint, or to take a nonaccessible PowerPoint and make it accessible. You are probably reading

More information

DIRECTV Message Board

DIRECTV Message Board DIRECTV Message Board DIRECTV Message Board is an exciting new product for commercial customers. It is being shown at DIRECTV Revolution 2012 for the first time, but the Solid Signal team were lucky enough

More information

Google Drive. Move Fully to Google Docs

Google Drive. Move Fully to Google Docs Google Drive Fully move to the Google Drive ecosystem Use Google OCR to recreate text documents from a variety of sources Sharing files and folders Collaborating on Documents Revision History Downloading

More information

Case study on PhoneGap / Apache Cordova

Case study on PhoneGap / Apache Cordova Chapter 1 Case study on PhoneGap / Apache Cordova 1.1 Introduction to PhoneGap / Apache Cordova PhoneGap is a free and open source framework that allows you to create mobile applications in a cross platform

More information

Windows 10 Hardware and Software

Windows 10 Hardware and Software Windows 10 Hardware and Software Presented by: G. ALLEN SONNTAG, RDR, CRR, FAPR Tucson, Arizona First a Little Philosophy One of my favorite isms: Do or Do Not. There is no Try Yoda. I m going to try to

More information

Myths about Links, Links and More Links:

Myths about Links, Links and More Links: Myths about Links, Links and More Links: CedarValleyGroup.com Myth 1: You have to pay to be submitted to Google search engine. Well let me explode that one myth. When your website is first launched Google

More information

. social? better than. 7 reasons why you should focus on . to GROW YOUR BUSINESS...

. social? better than. 7 reasons why you should focus on  . to GROW YOUR BUSINESS... Is EMAIL better than social? 7 reasons why you should focus on email to GROW YOUR BUSINESS... 1 EMAIL UPDATES ARE A BETTER USE OF YOUR TIME If you had to choose between sending an email and updating your

More information

Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts

Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts White Paper Analytics & Big Data Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts Table of Contents page Compression...1 Early and Late Materialization...1

More information

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 2nd International Conference on Materials Science, Machinery and Energy Engineering (MSMEE 2017) Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2 1 Information Engineering

More information

To Barcode or Not To Barcode?

To Barcode or Not To Barcode? Prepared by Brandon Kidd 08 December 2016 Inventory Overload If you try to catalogue a library of things one item at a time, you will drive yourself and all of the people around you crazy. If you re used

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Jenkins: AMPLab s Friendly Butler. He will build your projects so you don t have to!

Jenkins: AMPLab s Friendly Butler. He will build your projects so you don t have to! Jenkins: AMPLab s Friendly Butler He will build your projects so you don t have to! What is Jenkins? Open source CI/CD/Build platform Used to build many, many open source software projects (including Spark

More information

Using Microsoft Excel

Using Microsoft Excel About Excel Using Microsoft Excel What is a Spreadsheet? Microsoft Excel is a program that s used for creating spreadsheets. So what is a spreadsheet? Before personal computers were common, spreadsheet

More information

5 REASONS YOUR BUSINESS NEEDS NETWORK MONITORING

5 REASONS YOUR BUSINESS NEEDS NETWORK MONITORING 5 REASONS YOUR BUSINESS NEEDS NETWORK MONITORING www.intivix.com (415) 543 1033 NETWORK MONITORING WILL ENSURE YOUR NETWORK IS OPERATING AT FULL CAPACITY 5 Reasons Your Business Needs Network Monitoring

More information

TOP DEVELOPERS MINDSET. All About the 5 Things You Don t Know.

TOP DEVELOPERS MINDSET. All About the 5 Things You Don t Know. MINDSET TOP DEVELOPERS All About the 5 Things You Don t Know 1 INTRODUCTION Coding and programming are becoming more and more popular as technology advances and computer-based devices become more widespread.

More information

Gene Kim 9/9/2016 CSC 2/444 Lisp Tutorial

Gene Kim 9/9/2016 CSC 2/444 Lisp Tutorial Gene Kim 9/9/2016 CSC 2/444 Lisp Tutorial About this Document This document was written to accompany an in-person Lisp tutorial. Therefore, the information on this document alone is not likely to be sufficient

More information

Learn Linux in a Month of Lunches by Steven Ovadia

Learn Linux in a Month of Lunches by Steven Ovadia Learn Linux in a Month of Lunches by Steven Ovadia Sample Chapter 17 Copyright 2017 Manning Publications brief contents PART 1 GETTING LINUX UP AND RUNNING... 1 1 Before you begin 3 2 Getting to know Linux

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

4/28/2014. Defining A Replacement Cycle for Your Association. Introductions. Introductions. April Executive Director, Idealware. Idealware.

4/28/2014. Defining A Replacement Cycle for Your Association. Introductions. Introductions. April Executive Director, Idealware. Idealware. Defining A Replacement Cycle for Your Association April 2014 Introductions Laura Quinn Executive Director, Idealware Introductions Idealware.org 1 What is a Replacement Cycle? What Should You Spend on

More information

Top 25 Big Data Interview Questions And Answers

Top 25 Big Data Interview Questions And Answers Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented

More information

A Guide to Condor. Joe Antognini. October 25, Condor is on Our Network What is an Our Network?

A Guide to Condor. Joe Antognini. October 25, Condor is on Our Network What is an Our Network? A Guide to Condor Joe Antognini October 25, 2013 1 Condor is on Our Network What is an Our Network? The computers in the OSU astronomy department are all networked together. In fact, they re networked

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 26 File Systems Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Cylinders: all the platters?

More information

GETTING STARTED WITH THE BLOOMZ APP

GETTING STARTED WITH THE BLOOMZ APP GETTING STARTED WITH THE BLOOMZ APP The following instructions will help you navigate through our app and familiarize with some of the app s features. Notice that, while this is an in-depth look into some

More information

IE-35 / IE-33 FAQ Now that Ivie has introduced the IE-35, what kind of support can an IE-33 owner expect?

IE-35 / IE-33 FAQ Now that Ivie has introduced the IE-35, what kind of support can an IE-33 owner expect? IE-35 / IE-33 FAQ With the introduction of the new IE-35 Audio Analysis System, many of our friends have asked questions about how the IE-35 relates to the IE-33. What s new? What s different? What s the

More information

2 What kinds of hosting does the market offer?

2 What kinds of hosting does the market offer? 2 What kinds of hosting does the market offer? If you ve spent at least 30 minutes searching for hosting solutions, you ll have noticed that there are different types of hosting: PaaS, shared, VPS, dedicated,

More information

Burning CDs in Windows XP

Burning CDs in Windows XP B 770 / 1 Make CD Burning a Breeze with Windows XP's Built-in Tools If your PC is equipped with a rewritable CD drive you ve almost certainly got some specialised software for copying files to CDs. If

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Efficient Map Reduce Model with Hadoop Framework for Data Processing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

15.1 Optimization, scaling, and gradient descent in Spark

15.1 Optimization, scaling, and gradient descent in Spark CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 16, 5/24/2017. Scribed by Andreas Santucci. Overview

More information

Problems with PSQL and Windows 10 Release 1803

Problems with PSQL and Windows 10 Release 1803 Problems with PSQL and Windows 10 Release 1803 A White Paper From For more information, see our web site at Problems with PSQL and Windows 10 Release 1803 Last Updated: June 26, 2018 (See Last Page) In

More information

How Apache Beam Will Change Big Data

How Apache Beam Will Change Big Data How Apache Beam Will Change Big Data 1 / 21 About Big Data Institute Mentoring, training, and high-level consulting company focused on Big Data, NoSQL and The Cloud Founded in 2008 We help make companies

More information

MainBoss 4.2 Installation and Administration

MainBoss 4.2 Installation and Administration MainBoss 4.2 Installation and Administration Copyright 2018, Thinkage Ltd. Revision 72 Permission granted to reproduce and/or translate all or part of this document provided that Thinkage s copyright notice

More information

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information