Machine Learning and SystemML. Nikolay Manchev Data Scientist Europe E-
|
|
- Rebecca Montgomery
- 5 years ago
- Views:
Transcription
1 Machine Learning and SystemML Nikolay Manchev Data Scientist Europe E-
2 A Simple Problem In this activity, you will analyze the relationship between educational attainment and median income using data from the ACS by examining a scatter plot and linear model that best fits that scatter plot and solving problems using the linear equation. Educational Attainment Median Income in USD Less than high school graduate High school graduate Some college or associate s degree Bachelor s degree Graduate or professional degree
3 Machine Learning "Field of study that gives computers the ability to learn without being explicitly programmed" Arthur Samuel,
4 Advantages Machines can handle bigger amounts of data Machines can work with high dimensional data Machines can work it out faster 4
5 Enneract (9 dimensional hypercube) 5
6 Use- case #1 Detecting potential "lemon cars" 2 million cars cars reacquired 10 million repair cases 25 million parts exchanges Logistic regression model input features Improved precision/recall by an order of magnitude 6
7 Machine Learning Supervised Machine Learning We provide a training set of labelled examples and fit a model to predict the correct labels using the features. Unsupervised Machine Learning No desired output is provided. The model finds similarities in the data based on the features alone. 7
8 Use- case #2 Large Holiday operator Looking to enrich their web shop with custom recommendations Search Result Recommend Sardinia Sicily Majorca Ibiza all inclusive Canary Islands 8
9 Piece of cake Collaborative filtering Based on user to item rating matrix Computes similarity measure between users Sardinia Majorca Aspen User #1 4-1 User # User #n Make a prediction 9
10 Unsupervised learning to the rescue Mixture of Gaussians model Based on search strings n fixed classes Hand crafted rules tailored to classes 10
11 Use- case #2 Large Holiday operator in the UK Looking to enrich their web shop with custom recommendations Search Classifier Recommend 1. Corralejo 2. Costa Calma 3. Barracuda Point all inclusive, H10 Rubicon, Regency Country Club, Taurito Princess Sardinia Sicily Majorca Ibiza 11
12 It s Big Data 12
13 Why Spark Traditional approach MapReduce jobs Input HDFS Read Iteration 1 CPU HDFS Write HDFS Read Memory Iteration 2 CPU HDFS Write Memory Result The Spark approach keep data in memory, distribute the execution Input HDFS Read Iteration 1 CPU Memory Zero Read/Write Disk Bottleneck Iteration 2 CPU Memory Chain Job Output into New Job Input faster than network & disk 13
14 IBM s Commitment to Spark Official announcement (15th June 2015) IBM will build Spark into the core of its analytics and commerce platforms IBM will commit over 3,500 researchers & developers to work on Spark- related projects 14
15 A Simple Problem In this activity, you will analyze the relationship between educational attainment and median income using data from the ACS by examining a scatter plot and linear model that best fits that scatter plot and solving problems using the linear equation. Median Income Educational Attainment in USD Less than high school graduate High school graduate Some college or associate s degree Bachelor s degree Graduate or professional degree
16 Find the best fitting line 16
17 We always look for patterns 17
18 Use case #3 Predictive model for a bank campaign We want to predict successful outcomes 18
19 You need Data Scientists Algorithms are NOT the problem Understanding what data goes into those algorithms and how to interpret the results is the crux of the matter Be very, very careful Involving a data scientist after you've gathered the data is like involving a doctor after the patient... 19
20 IBM s Commitment to Spark Official announcement (15th June 2015) IBM will build Spark into the core of its analytics and commerce platforms IBM will commit over 3,500 researchers & developers to work on Spark- related projects IBM will educate more than data scientists on Spark 20
21 Big Data University - free online training 21
22 Data Science before Big Data 22
23 Enter Big Data 23
24 Obvious solution Big Data 24
25 IBM s Commitment to Spark Official announcement (15th June 2015) IBM will build Spark into the core of its analytics and commerce platforms IBM will commit over 3,500 researchers & developers to work on Spark- related projects IBM will educate more than data scientists on Spark IBM will IBM will open source SystemML and collaborate with Databricks to advance Spark s machine learning capabilities 25
26 Linear Regression Refresher Simple Linear Regression Dependent variable (y) Independent variables (X) In order to estimate the parameters we have to minimize There is an elegant solution that minimizes : We can solve using R a = t(x) %*% X + diag(lambda); b = t(x) %*% y; theta = solve(a,b); 26
27 Linear Regression - Execution y X a = t(x) %*% X + diag(lambda); b = t(x) %*% y; theta = solve(a,b); yt X 1k 1k 500 features 300M observations 4TB text file.. MAP MAP XTX for each ytx for each 1k 1k MAP 300M observations 9GB text file Cluster Configuration 3.5 GB Map Task JVM 7 GB In- memory Master JVM 128 MB HDFS block size REDUCE a bt In- memory computation (a,b) < 2 MB 1. get b 2. call solve(a,b) 27
28 Changes that impact our implementation 3 times more attributes 300M times more observations XTX Cluster Configuration 3.5 GB Map Task J VM 7 GB In- memory Master J VM 128 MB HDFS block s ize 600M 500 Cluster Configuration 3.5 GB Map Task J VM 7 GB In- memory Master J VM 128 MB HDFS block s ize XTX solve (a,b) XTX solve (a,b) 1M Cluster Configuration 3.5 GB Map Task J VM 7 GB In- memory Master J VM 128 MB HDFS block s ize Cluster configuration change 300M 500 XTy XTy The dataset fits in memory 100 XTy solve (XTX, XTy) XTX Cluster Configuration 1.5 GB Map Task J VM 7 GB In- memory Master J VM 128 MB HDFS block s ize XTy XTy solve (a,b) 28
29 To Summarize 3 lines of code Minor changes in the data set / cluster configuration result in 4 dramatically different execution plans major change in performance best solution becomes a non- working solution How can we manage this? 29
30 What s in the SystemML box High-level language front-ends High- Level Operations (HOPs) General representation of statements in the data analysis language Low- Level Operations (LOPs) General representation of operations in the runtime framework Multiple execution environments 30
31 Backend performance 31
32 Out- of- the- box algorithms Category Description Descriptive Statistics Univariate, Bivariate, Stratified Bivariate Classification Logistic Regression, Multi- class SVM, Naïve Bayes, Decision Trees, Random Forest Clustering k- Means Regression Linear Regression (System of equations, SGD) Generalised Linear Models Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli Links for all distributions: identity, log, sq. root, inverse, 1/μ^2 Links for Binomial/ Bernoulli: logit, probit, cloglog, cauchit Stepwise Linear, GLM Dimensionality Reduction PCA Matrix Factorization ALS Survival Models Kaplan Meier, Cox Predict Scoring Transformation Recoding, dummy coding, binning, scaling, missing value imputation 32
33 Summary Key features Cost based compilation Out- of- the- box scalable machine learning algorithms Support for custom algorithms Write your own code and don t worry about scalability, numeric stability, and optimization Use it standalone, with MR backend, or with Spark backend Fit into Spark APIs, consume and produce DataFrames ML Pipeline integration Use System ML from Scala, Java, Python, R/SparkR BigR integration (package) 33
34 Additional Resources SystemML is available on GitHub An in- depth scientific perspective Ghoting, Amol, et al. "SystemML: Declarative machine learning on MapReduce., ICDE 2011 Boehm, Matthias, et al. SystemML s Optimizer: Plan Generation for Large- Scale Machine Learning Programs.. IEEE Data Eng. Bull 37.3 (2014). Huang, Botong, et al. "Resource Elasticity for Large- Scale Machine Learning., SIGMOD
35 IBM big data IBM big data THINK IBM big data IBM big data IBM big data IBM big data IBM big data IBM big data IBM big data IBM big data
Apache SystemML Declarative Machine Learning
Apache Big Data Seville 2016 Apache SystemML Declarative Machine Learning Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing to open
More informationUsing Existing Numerical Libraries on Spark
Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm
More informationUsing Numerical Libraries on Spark
Using Numerical Libraries on Spark Brian Spector London Spark Users Meetup August 18 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm with
More informationIntroducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone
Introducing Microsoft SQL Server 2016 R Services Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone SQL Server 2016: Everything built-in built-in built-in built-in built-in built-in $2,230
More informationThe Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
More informationData Science Bootcamp Curriculum. NYC Data Science Academy
Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations
More informationDistributed Machine Learning" on Spark
Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations
More informationScaled Machine Learning at Matroid
Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling
More informationDATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:
DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business
More informationScalable Machine Learning in R. with H2O
Scalable Machine Learning in R with H2O Erin LeDell @ledell DSC July 2016 Introduction Statistician & Machine Learning Scientist at H2O.ai in Mountain View, California, USA Ph.D. in Biostatistics with
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationPython With Data Science
Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,
More informationMatrix Computations and " Neural Networks in Spark
Matrix Computations and " Neural Networks in Spark Reza Zadeh Paper: http://arxiv.org/abs/1509.02256 Joint work with many folks on paper. @Reza_Zadeh http://reza-zadeh.com Training Neural Networks Datasets
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationRapid growth of massive datasets
Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationSCIENCE. An Introduction to Python Brief History Why Python Where to use
DATA SCIENCE Python is a general-purpose interpreted, interactive, object-oriented and high-level programming language. Currently Python is the most popular Language in IT. Python adopted as a language
More informationData Analytics and Machine Learning: From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster Presented by Viswanath Puttagunta Ganesh Raju Understanding use cases to optimize on ARM Ecosystem Date BKK16-404B March 10th, 2016 Event Linaro
More informationOracle Machine Learning Notebook
Oracle Machine Learning Notebook Included in Autonomous Data Warehouse Cloud Charlie Berger, MS Engineering, MBA Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics charlie.berger@oracle.com
More informationTackling Big Data Using MATLAB
Tackling Big Data Using MATLAB Alka Nair Application Engineer 2015 The MathWorks, Inc. 1 Building Machine Learning Models with Big Data Access Preprocess, Exploration & Model Development Scale up & Integrate
More informationIntroduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)
Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data
More informationSAS High-Performance Analytics Products
Fact Sheet What do SAS High-Performance Analytics products do? With high-performance analytics products from SAS, you can develop and process models that use huge amounts of diverse data. These products
More informationMachine Learning With Spark
Ons Dridi R&D Engineer 13 Novembre 2015 Centre d Excellence en Technologies de l Information et de la Communication CETIC Presentation - An applied research centre in the field of ICT - The knowledge developed
More informationResource and Performance Distribution Prediction for Large Scale Analytics Queries
Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia
More informationCertified Data Science with Python Professional VS-1442
Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional Certified Data Science with Python Professional Certification Code VS-1442 Data science has become
More informationDoing the Data Science Dance
Doing the Data Science Dance Dean Abbott Abbott Analytics, SmarterHQ KNIME Fall Summit 2018 Email: dean@abbottanalytics.com Twitter: @deanabb 1 Data Science vs. Other Labels 2 Google Trends 3 Abbott Analytics,
More informationMATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by
1 MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by MathWorks In 2004, MATLAB had around one million users
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationLavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
More informationSUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018
SUPERVISED LEARNING METHODS Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 2 CHOICE OF ML You cannot know which algorithm will work
More informationMachine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich
Machine Learning In A Snap Thomas Parnell Research Staff Member IBM Research - Zurich What are GLMs? Ridge Regression Support Vector Machines Regression Generalized Linear Models Classification Lasso Regression
More informationChapter 1 - The Spark Machine Learning Library
Chapter 1 - The Spark Machine Learning Library Objectives Key objectives of this chapter: The Spark Machine Learning Library (MLlib) MLlib dense and sparse vectors and matrices Types of distributed matrices
More informationIntegration with popular Big Data Frameworks in Statistica and Statistica Enterprise Server Solutions Statistica White Paper
and Statistica Enterprise Server Solutions Statistica White Paper Siva Ramalingam Thomas Hill TIBCO Statistica Table of Contents Introduction...2 Spark Support in Statistica...3 Requirements...3 Statistica
More informationMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY Web Dev @ Cornell Studied some General Biology Rails Consulting for TrueCar and other companies Implement ML model for ClearBook
More informationML 프로그래밍 ( 보충 ) Scikit-Learn
ML 프로그래밍 ( 보충 ) Scikit-Learn 2017.5 Scikit-Learn? 특징 a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (NumPy, SciPy, matplotlib).
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationBIG DATA SCIENTIST Certification. Big Data Scientist
BIG DATA SCIENTIST Certification Big Data Scientist Big Data Science Professional (BDSCP) certifications are formal accreditations that prove proficiency in specific areas of Big Data. To obtain a certification,
More informationApache Flink. Alessandro Margara
Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate
More informationR (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.
Assignment No. 4 Title: SD Module- Data Science with R Program R (2) C (4) V (2) T (2) Total (10) Dated Sign Data analysis case study using R for readily available data set using any one machine learning
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationAnalyzing Big Data with Microsoft R
Analyzing Big Data with Microsoft R 20773; 3 days, Instructor-led Course Description The main purpose of the course is to give students the ability to use Microsoft R Server to create and run an analysis
More informationPractical Machine Learning Agenda
Practical Machine Learning Agenda Starting From Log Management Moving To Machine Learning PunchPlatform team Thales Challenges Thanks 1 Starting From Log Management 2 Starting From Log Management Data
More informationDistributed Computing with Spark and MapReduce
Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationDeep Learning Frameworks with Spark and GPUs
Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,
More informationDistributed Computing with Spark
Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing
More informationMLI - An API for Distributed Machine Learning. Sarang Dev
MLI - An API for Distributed Machine Learning Sarang Dev MLI - API Simplify the development of high-performance, scalable, distributed algorithms. Targets common ML problems related to data loading, feature
More informationSpotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data
Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data THE RISE OF BIG DATA BIG DATA: A REVOLUTION IN ACCESS Large-scale data sets are nothing
More informationHigher level data processing in Apache Spark
Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL
More informationEvent: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect
Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of
More informationData Science Course Content
CHAPTER 1: INTRODUCTION TO DATA SCIENCE Data Science Course Content What is the need for Data Scientists Data Science Foundation Business Intelligence Data Analysis Data Mining Machine Learning Difference
More informationDivide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io
1 Divide & Recombine with Tessera: Analyzing Larger and More Complex Data tessera.io The D&R Framework Computationally, this is a very simple. 2 Division a division method specified by the analyst divides
More informationUsing Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear
Using Machine Learning to Identify Security Issues in Open-Source Libraries Asankhaya Sharma Yaqin Zhou SourceClear Outline - Overview of problem space Unidentified security issues How Machine Learning
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationHadoop and Apache Mahout Deep Dive
Hadoop and Apache Mahout Deep Dive Temple Crag, Sierra Nevada Mahidhar Tatineni User Services, SDSC Costa Rica Big Data School December 6, 2017 Overview Hadoop configuration files core-site.xml hdfs-site.xml
More informationOverview. Audience profile. At course completion. Course Outline. : 20773A: Analyzing Big Data with Microsoft R. Course Outline :: 20773A::
Module Title Duration : 20773A: Analyzing Big Data with Microsoft R : 3 days Overview The main purpose of the course is to give students the ability to use Microsoft R Server to create and run an analysis
More informationMachine Learning in Action
Machine Learning in Action PETER HARRINGTON Ill MANNING Shelter Island brief contents PART l (~tj\ssification...,... 1 1 Machine learning basics 3 2 Classifying with k-nearest Neighbors 18 3 Splitting
More informationGeneralized least squares (GLS) estimates of the level-2 coefficients,
Contents 1 Conceptual and Statistical Background for Two-Level Models...7 1.1 The general two-level model... 7 1.1.1 Level-1 model... 8 1.1.2 Level-2 model... 8 1.2 Parameter estimation... 9 1.3 Empirical
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More informationPredictive Analytics using Teradata Aster Scoring SDK
Predictive Analytics using Teradata Aster Scoring SDK Faraz Ahmad Software Engineer, Teradata #TDPARTNERS16 GEORGIA WORLD CONGRESS CENTER At Teradata, we believe. Analytics and data unleash the potential
More informationData Platforms and Pattern Mining
Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationCOPYRIGHT DATASHEET
Your Path to Enterprise AI To succeed in the world s rapidly evolving ecosystem, companies (no matter what their industry or size) must use data to continuously develop more innovative operations, processes,
More informationNoviembre18, 2017 Concepción, Chile. #sqlsatconce
Noviembre8, 27 Concepción, Chile #sqlsatconce SQL Server 27 - Deep Learning, clasificación de imágenes usando Azure Data Science Virtual Machine Nombre Speaker: Adrián J. Fernandez Cargo : Especialista
More informationWhat's New in MATLAB for Engineering Data Analytics?
What's New in MATLAB for Engineering Data Analytics? Will Wilson Application Engineer MathWorks, Inc. 2017 The MathWorks, Inc. 1 Agenda Data Types Tall Arrays for Big Data Machine Learning (for Everyone)
More informationData Science Tutorial
Eliezer Kanal Technical Manager, CERT Daniel DeCapria Data Scientist, ETC Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 2017 SEI SEI Data Science in in Cybersecurity Symposium
More informationC5##54&6*"6*1%2345*D&'*E2)2*F"4G)&"69
?23(&65*@52%6&6'*A&)(*B*267* C5##54&6*"6*1%2345*D&'*E2)2*F"4G)&"69!"#$%&'%(?2%3"9*
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationResearch in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis
Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, Tekin Bicer and others) Outline Middleware
More informationMLlib and Distributing the " Singular Value Decomposition. Reza Zadeh
MLlib and Distributing the " Singular Value Decomposition Reza Zadeh Outline Example Invocations Benefits of Iterations Singular Value Decomposition All-pairs Similarity Computation MLlib + {Streaming,
More informationBig Data processing: a framework suitable for Economists and Statisticians
Big Data processing: a framework suitable for Economists and Statisticians Giuseppe Bruno 1, D. Condello 1 and A. Luciani 1 1 Economics and statistics Directorate, Bank of Italy; Economic Research in High
More informationMachine Learning at the Limit
Machine Learning at the Limit John Canny*^ * Computer Science Division University of California, Berkeley ^ Yahoo Research Labs @GTC, March, 2015 My Other Job(s) Yahoo [Chen, Pavlov, Canny, KDD 2009]*
More informationA Cloud System for Machine Learning Exploiting a Parallel Array DBMS
2017 28th International Workshop on Database and Expert Systems Applications A Cloud System for Machine Learning Exploiting a Parallel Array DBMS Yiqun Zhang, Carlos Ordonez, Lennart Johnsson Department
More informationMachine learning library for Apache Flink
Machine learning library for Apache Flink MTP Mid Term Report submitted to Indian Institute of Technology Mandi for partial fulfillment of the degree of B. Tech. by Devang Bacharwar (B2059) under the guidance
More informationa Spark in the cloud iterative and interactive cluster computing
a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of
More informationCreating a Recommender System. An Elasticsearch & Apache Spark approach
Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused
More information09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms
Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques Classification (supervised learning) Clustering (unsupervised learning) Itemset mining
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationData Processing at Scale (CSE 511)
Data Processing at Scale (CSE 511) Note: Below outline is subject to modifications and updates. About this Course Database systems are used to provide convenient access to disk-resident data through efficient
More informationAccelerating Spark Workloads using GPUs
Accelerating Spark Workloads using GPUs Rajesh Bordawekar, Minsik Cho, Wei Tan, Benjamin Herta, Vladimir Zolotov, Alexei Lvov, Liana Fong, and David Kung IBM T. J. Watson Research Center 1 Outline Spark
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationSemi-Structured Data Management (CSE 511)
Semi-Structured Data Management (CSE 511) Note: Below outline is subject to modifications and updates. About this Course Database systems are used to provide convenient access to disk-resident data through
More informationData Science. Data Analyst. Data Scientist. Data Architect
Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &
More informationTwitter data Analytics using Distributed Computing
Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE
More informationIntroducing Categorical Data/Variables (pp )
Notation: Means pencil-and-paper QUIZ Means coding QUIZ Definition: Feature Engineering (FE) = the process of transforming the data to an optimal representation for a given application. Scaling (see Chs.
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationCS 229 Midterm Review
CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationDivide & Recombine (D&R) with Tessera: High Performance Computing for Data Analysis.
1 Divide & Recombine (D&R) with Tessera: High Performance Computing for Data Analysis www.tessera.io Big Data? 2 A big term An associated concept new computational methods and systems are needed to make
More informationTPCX-BB (BigBench) Big Data Analytics Benchmark
TPCX-BB (BigBench) Big Data Analytics Benchmark Bhaskar D Gowda Senior Staff Engineer Analytics & AI Solutions Group Intel Corporation bhaskar.gowda@intel.com 1 Agenda Big Data Analytics & Benchmarks Industry
More informationActivator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.
Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationCERN openlab & IBM Research Workshop Trip Report
CERN openlab & IBM Research Workshop Trip Report Jakob Blomer, Javier Cervantes, Pere Mato, Radu Popescu 2018-12-03 Workshop Organization 1 full day at IBM Research Zürich ~25 participants from CERN ~10
More informationIvy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V)
Ivy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V) Based on Industry Cases, Live Exercises, & Industry Executed Projects Module (I) Analytics Essentials 81 hrs 1. Statistics
More informationADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA
INSIGHTS@SAS: ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA AGENDA 09.00 09.15 Intro 09.15 10.30 Analytics using SAS Enterprise Guide Ellen Lokollo 10.45 12.00 Advanced Analytics using SAS
More informationClosing Thoughts on Machine Learning (ML in Practice)
Closing Thoughts on (ML in Practice) 1 Closing Thoughts on (ML in Practice) 1 When someone asks What is? Learning is any process by which a system improves performance from experience. - Herbert Simon
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationSTATISTICS (STAT) 200 Level Courses. 300 Level Courses. Statistics (STAT) 1
Statistics (STAT) 1 STATISTICS (STAT) 200 Level Courses STAT 250: Introductory Statistics I. 3 credits. Elementary introduction to statistics. Topics include descriptive statistics, probability, and estimation
More information