ML pipelines at OK. Dmitry Bugaychenko
|
|
- Coral Banks
- 6 years ago
- Views:
Transcription
1 ML pipelines at OK Dmitry Bugaychenko
2
3 OK is monthly unique users
4 OK is family links in the social graph
5 OK is A place where people share their posi9ve feelings
6 OK is servers around the globe 1+Tb/s of outgoing traffic 400+ sodware components High Load, Big Data, Fault Tolerance
7 OK is 3 Hadoop clusters 30+ petabytes storage (+20Tb daily) cores 40+ TB RAM 300+ regular jobs
8 Data Mining in 3 Steps Get the data
9 Data Mining in 3 Steps Get the data Find unobvious dependencies
10 Data Mining in 3 Steps Get the data Find unobvious dependencies Use them to make decisions
11 Data Mining with Machine Learning Get the data Find unobvious dependencies By fi"ng parameters of a mathema9cal model Use them to make decisions
12 ML Tasks at OK Find the good stuff Recommender systems Find the bad stuff An9-spam systems Understand the users Explora9ve analy9cs
13 Smart Data Toolbox at OK
14 Spark ML Pipelines
15 Spark ML Pipelines Transformer
16 Spark ML Pipelines Transformer Estimator
17 Spark ML Pipelines
18 Spark ML Pipelines
19 Spark ML Pipelines
20 Meet the OK ML Pipelines Open source project: hyps://github.com/odnoklassniki/ok-ml-pipelines Extends Spark ML pipelines for Flexible dataflow organiza9on BeYer cluster u9liza9on ML scaling and opera9onalizing
21 Example task Churn predic9on Input: 10M users with 200+ ayributes Output: user s reten9on Method: Find unobvious dependencies Build a predic9on model Analyze the models weights Use them to make decisions Profit!
22 The first pipeline: Extract features
23 The first pipeline: Convert to vector
24 The first pipeline: Choose the ML method
25 Logistic Regression at a glance 1 1+ e ( a x+b )
26 Logistic Regression at a glance 1 1+ e ( a x+b )
27 Logistic Regression at a glance 1 1+ e ( a x+b )
28 The first pipeline: Train the model
29 The first pipeline: Get the predictions
30 The first pipeline: Extract Features
31 The first pipeline: Convert to vector
32 The first pipeline: Train the model
33 The most known histogram in ML world
34 Caching example: the simple way
35 Caching example: the simple way failure
36 OK ML Pipelines: Unwrapped Stage
37 OK ML Pipelines: Unwrapped Stage
38 More cases of this kind Cache Repar99oning Random sampling Ordered cut Persist data to temp folder
39 Back to the goal: exploring weights We do not need the model itself We need some insights about the churn Lets look at the weights
40 Back to the goal: exploring weights We do not need the model itself We need some insights about the churn Lets look at the weights
41 Back to the goal: exploring weights
42 But wait! Spark ML does NOT work this way!
43 Back to the goal: exploring weights We do not need the model itself We need some insights about the churn Lets look at the weights
44 OK ML Pipelines: Model Summary Collec9on of named DataFrame s OK ML es9mators produces summary Saved as parquet files
45 OK ML Pipelines: Model Summary Collec9on of named DataFrame s OK ML es9mators produces summary Saved as parquet files Spark ML es9mator might be wrapped to produce summary
46 Back to the goal: how good is the model? Does the model make sense? We need to cross-validate the quality
47 Cross-validation
48 OK ML Pipelines: Evaluation
49 OK ML Pipelines: Evaluation
50 OK ML Pipelines: Evaluation
51 OK ML Pipelines: Evaluation
52 Evaluation: Spark ML vs. OK ML Spark ML Evaluator, producing a single scalar Used for hyper parameters tuning OK ML Pipelines Evaluator is a transformer conver9ng predic9ons to metrics Injected using unwrapped stage Metrics block is added to the model summary
53 OK ML Pipelines: Evaluation Binary classifica9on evaluator ROC, RP-curve, F1-plot Par99oned ranking evaluator NDCG, AUC, Post processing evaluator collects stat for metrics Train/test evaluator Cross-valida9on
54 Some thoughts on cross-validation 10+1 models No shared mutable state 90% intersec9on on input
55 Some thoughts on cross-validation 10+1 models No shared mutable state 90% intersec9on on input Single trainer u9lizes less than 100 cores cores in the cluster
56 Spark ML CrossValidator Sequen9al Cache each fold separately
57 Spark ML CrossValidator Sequen9al Cache each fold separately
58 OK ML CrossValidator Cache shared source data Train folds in parallel
59 OK ML Pipelines: Forked Estimator
60 OK ML Pipelines: Forked Estimator Model segmenta9on Mul9-class classifica9on Feature selec9on Cross valida9on Grid search work in progress
61 Cherry on the pie: Features scaling
62 Spark ML Pipelines: Scaling Two models to move to prod Two set of weights to keep
63 Spark ML Pipelines: Scaling Two models to move to prod Two set of weights to keep
64 Spark ML Pipelines: Scaling x i x i sd(x i )
65 Spark ML Pipelines: Scaling x i x i sd(x i ) 1 ( ) 1+ e a x+b
66 OK ML Pipelines: inline scaling aʹ = i a i sd(x i ) ;b' = b ʹ a x
67 OK ML Pipelines: Model Transformer
68 OK ML Pipelines: Model Transformer
69 OK ML Pipelines: Model Transformer
70 OK ML Pipelines: inline scaling
71 Even more cool stuff NLP transformers: Language detec9on Language aware analyzer URL eliminator Vectorizer N-gram extractor Trending term detec9on LDA work in progress Stat u9ls Vector stat collector Extended online summarizer Learning algorithms Matrix LBFGS DSRVGD CRR
72 How to use it Spark 2.2 (sbt): librarydependencies += "ru.odnoklassniki" %% "ok-ml-pipelines" % "0.1-spark2.2" withsources() Spark 1.6 (sbt): librarydependencies += "ru.odnoklassniki" %% "ok-ml-pipelines" % "0.1-spark1.6" withsources() Sources: hyps://github.com/odnoklassniki/ok-ml-pipelines
73 Thank you for your attention!
From click to predict and back: ML pipelines at OK. Dmitry Bugaychenko
From click to predict and back: ML pipelines at OK Dmitry Bugaychenko OK is 70 000 000+ monthly unique users OK is 800 000 000+ family links in the social graph OK is A place where people share their posi9ve
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationHPCSoC Modeling and Simulation Implications
Department Name (View Master > Edit Slide 1) HPCSoC Modeling and Simulation Implications (Sharing three concerns from an academic research user perspective using free, open tools. Solutions left to the
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationAbout the Course. Reading List. Assignments and Examina5on
Uppsala University Department of Linguis5cs and Philology About the Course Introduc5on to machine learning Focus on methods used in NLP Decision trees and nearest neighbor methods Linear models for classifica5on
More informationStages of (Batch) Machine Learning
Evalua&on Stages of (Batch) Machine Learning Given: labeled training data X, Y = {hx i,y i i} n i=1 Assumes each x i D(X ) with y i = f target (x i ) Train the model: model ß classifier.train(x, Y ) x
More informationMACHINE LEARNING TOOLBOX. Logistic regression on Sonar
MACHINE LEARNING TOOLBOX Logistic regression on Sonar Classification models Categorical (i.e. qualitative) target variable Example: will a loan default? Still a form of supervised learning Use a train/test
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationReal-Time Alarm Verifica/on with Spark Streaming and Machine Learning
Real-Time Alarm Verifica/on with Spark Streaming and Machine Learning Ana Sima, Jan Stampfli, Kurt Stockinger Zurich University of Applied Sciences Swiss Big Data User Group Mee3ng January 23, 2017 About
More informationML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris
ML4Bio Lecture #1: Introduc3on February 24 th, 216 Quaid Morris Course goals Prac3cal introduc3on to ML Having a basic grounding in the terminology and important concepts in ML; to permit self- study,
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationFlash Reliability in Produc4on: The Importance of Measurement and Analysis in Improving System Reliability
Flash Reliability in Produc4on: The Importance of Measurement and Analysis in Improving System Reliability Bianca Schroeder University of Toronto (Currently on sabbatical at Microsoft Research Redmond)
More informationCS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.
CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE
More informationCreating a Recommender System. An Elasticsearch & Apache Spark approach
Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused
More informationTutorials (M. Biehl)
Tutorials 09-11-2018 (M. Biehl) Suggestions: - work in groups (as formed for the other tutorials) - all this should work in the python environments that you have been using; but you may also switch to
More informationProgramming Systems for Big Data
Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationACHIEVEMENTS FROM TRAINING
LEARN WELL TECHNOCRAFT DATA SCIENCE/ MACHINE LEARNING SYLLABUS 8TH YEAR OF ACCOMPLISHMENTS AUTHORIZED GLOBAL CERTIFICATION CENTER FOR MICROSOFT, ORACLE, IBM, AWS AND MANY MORE. 8411002339/7709292162 WWW.DW-LEARNWELL.COM
More informationMapReduce, Apache Hadoop
NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/ svoboda/courses/2016-1-ndbi040/ Lecture 2 MapReduce, Apache Hadoop Marn Svoboda svoboda@ksi.mff.cuni.cz 11. 10. 2016 Charles University
More informationMLI - An API for Distributed Machine Learning. Sarang Dev
MLI - An API for Distributed Machine Learning Sarang Dev MLI - API Simplify the development of high-performance, scalable, distributed algorithms. Targets common ML problems related to data loading, feature
More informationClassifica(on and Clustering with WEKA. Classifica*on and Clustering with WEKA
Classifica(on and Clustering with WEKA 1 Schedule: Classifica(on and Clustering with WEKA 1. Presentation of WEKA. 2. Your turn: perform classification and clustering. 2 WEKA Weka is a collec*on of machine
More informationBigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite
BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du, Wanling Gao, Rui Ren, Yaodong Cheng, Zhifei Zhang, Zhen Jia, Peijian Wang and Jianfeng Zhan INSTITUTE
More informationCrossing the AI Chasm. What We Learned from building Apache PredictionIO (incubating)
Crossing the AI Chasm What We Learned from building Apache PredictionIO (incubating) Simon Chan Sr. Director, Product Management, Salesforce Co-founder, PredictionIO PhD, University College London simon@salesforce.com
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationAutomation.
Automation www.austech.edu.au WHAT IS AUTOMATION? Automation testing is a technique uses an application to implement entire life cycle of the software in less time and provides efficiency and effectiveness
More informationSpark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM
Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 24, 2015 Course Information Website: www.stat.ucdavis.edu/~chohsieh/ecs289g_scalableml.html My office: Mathematical Sciences Building (MSB)
More informationDistributed Computing with Spark
Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing
More informationMachine Learning Crash Course: Part I
Machine Learning Crash Course: Part I Ariel Kleiner August 21, 2012 Machine learning exists at the intersec
More informationPrinciples of Machine Learning
Principles of Machine Learning Lab 3 Improving Machine Learning Models Overview In this lab you will explore techniques for improving and evaluating the performance of machine learning models. You will
More informationMapReduce, Apache Hadoop
Czech Technical University in Prague, Faculty of Informaon Technology MIE-PDB: Advanced Database Systems hp://www.ksi.mff.cuni.cz/~svoboda/courses/2016-2-mie-pdb/ Lecture 12 MapReduce, Apache Hadoop Marn
More informationSpark Streaming. Guido Salvaneschi
Spark Streaming Guido Salvaneschi 1 Spark Streaming Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark s batch and interactive
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationReal-time personal trainer on the SMACK Jan Anirvan Chakraborty
Real-time personal trainer on the SMACK stack @honzam399 Jan Machacek @anirvan_c Anirvan Chakraborty Automated personal trainer - muvr Suggests the sequence of exercise sessions Suggests exercises in a
More informationA Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System
A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System Ilkay Al(ntas and Daniel Crawl San Diego Supercomputer Center UC San Diego Jianwu Wang UMBC WorDS.sdsc.edu Computa3onal
More informationVisualization and text mining of patent and non-patent data
of patent and non-patent data Anton Heijs Information Solutions Delft, The Netherlands http://www.treparel.com/ ICIC conference, Nice, France, 2008 Outline Introduction Applications on patent and non-patent
More informationAs a reference, please find a version of the Machine Learning Process described in the diagram below.
PREDICTION OVERVIEW In this experiment, two of the Project PEACH datasets will be used to predict the reaction of a user to atmospheric factors. This experiment represents the first iteration of the Machine
More informationJialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat
H5Spark H5Spark: Bridging the I/O Gap between Spark and Scien9fic Data Formats on HPC Systems Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike
More informationScalability in a Real-Time Decision Platform
Scalability in a Real-Time Decision Platform Kenny Shi Manager Software Development ebay Inc. A Typical Fraudulent Lis3ng fraud detec3on architecture sync vs. async applica3on publish messaging bus request
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationData Science Bootcamp Curriculum. NYC Data Science Academy
Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationData Science Course Content
CHAPTER 1: INTRODUCTION TO DATA SCIENCE Data Science Course Content What is the need for Data Scientists Data Science Foundation Business Intelligence Data Analysis Data Mining Machine Learning Difference
More informationComparison of Optimization Methods for L1-regularized Logistic Regression
Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com
More informationSubmitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay
Submitted to: Dr. Sunnie Chung Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunny Chung Presented by: Sonal Deshmukh Jay Upadhyay What is Apache Survey shows huge popularity spike for Apache
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data
More informationSparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics
SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics Min LI,, Jian Tan, Yandong Wang, Li Zhang, Valentina Salapura, Alan Bivens IBM TJ Watson Research Center * A
More informationAbout Intellipaat. About the Course. Why Take This Course?
About Intellipaat Intellipaat is a fast growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 700,000 in over
More informationCertified Data Science with Python Professional VS-1442
Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional Certified Data Science with Python Professional Certification Code VS-1442 Data science has become
More informationAmol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn
Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn Mo>va>on: Parallel Query Processing Increasing parallelism in compu>ng Shared nothing clusters, mul> core technology,
More informationFast, Interactive, Language-Integrated Cluster Computing
Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org
More information6,000 Cameras in Time Square 210 million Cameras worldwide
SMILE!! You are on camera 75 $mes per day Average American ci$zen can be caught on camera 1:29 Camera to person ra$o World Wide 6,000 Cameras in Time Square 210 million Cameras worldwide What is the LTO
More informationEvaluation Metrics. (Classifiers) CS229 Section Anand Avati
Evaluation Metrics (Classifiers) CS Section Anand Avati Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity,
More informationDivide & Recombine (D&R) with the DeltaRho D&R Software. D&R +δρ
1 Divide & Recombine (D&R) with the DeltaRho D&R Software D&R +δρ http://deltarho.org Meet the statistical and computational challenges of deep analysis of big data and high computational complexity of
More informationCREDIT RISK MODELING IN R. Finding the right cut-off: the strategy curve
CREDIT RISK MODELING IN R Finding the right cut-off: the strategy curve Constructing a confusion matrix > predict(log_reg_model, newdata = test_set, type = "response") 1 2 3 4 5 0.08825517 0.3502768 0.28632298
More informationBenchmarking Spark ML using BigBench. Sweta Singh TPCTC 2016
Benchmarking Spark ML using BigBench Sweta Singh singhswe@us.ibm.com TPCTC 2016 Motivation Study the performance of Machine Learning use cases on large data warehouses in context of assessing Alternate
More informationSearch Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson
Search Engines Informa1on Retrieval in Prac1ce Annota1ons by Michael L. Nelson All slides Addison Wesley, 2008 Evalua1on Evalua1on is key to building effec$ve and efficient search engines measurement usually
More informationAdditional License Authorizations. For Vertica software products
Additional License Authorizations Products and suites covered Products E-LTU or E-Media available * Perpetual License Non-production use category ** Term License Non-production use category (if available)
More informationModern Stream Processing with Apache Flink
1 Modern Stream Processing with Apache Flink Till Rohrmann GOTO Berlin 2017 2 Original creators of Apache Flink da Platform 2 Open Source Apache Flink + da Application Manager 3 What changes faster? Data
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationRapid Extraction and Updating Road Network from LIDAR Data
Rapid Extraction and Updating Road Network from LIDAR Data Jiaping Zhao, Suya You, Jing Huang Computer Science Department University of Southern California October, 2011 Research Objec+ve Road extrac+on
More informationData Platforms and Pattern Mining
Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,
More informationMachine Learning in Python. Rohith Mohan GradQuant Spring 2018
Machine Learning in Python Rohith Mohan GradQuant Spring 2018 What is Machine Learning? https://twitter.com/myusuf3/status/995425049170489344 Traditional Programming Data Computer Program Output Getting
More informationBig Data on AWS. Peter-Mark Verwoerd Solutions Architect
Big Data on AWS Peter-Mark Verwoerd Solutions Architect What to get out of this talk Non-technical: Big Data processing stages: ingest, store, process, visualize Hot vs. Cold data Low latency processing
More informationIntroduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent
Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage
More informationToday s content. Resilient Distributed Datasets(RDDs) Spark and its data model
Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,
More informationIntro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect
Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationBig Data con MATLAB. Lucas García The MathWorks, Inc. 1
Big Data con MATLAB Lucas García 2015 The MathWorks, Inc. 1 Agenda Introduction Remote Arrays in MATLAB Tall Arrays for Big Data Scaling up Summary 2 Architecture of an analytics system Data from instruments
More informationIntroduc)on to Informa)on Visualiza)on
Introduc)on to Informa)on Visualiza)on Seeing the Science with Visualiza)on Raw Data 01001101011001 11001010010101 00101010100110 11101101011011 00110010111010 Visualiza(on Applica(on Visualiza)on on
More informationWhy do we need graph processing?
Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group
More informationTutorial on Machine Learning Tools
Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationSUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?
SUPERVISED LEARNING WITH SCIKIT-LEARN How good is your model? Classification metrics Measuring model performance with accuracy: Fraction of correctly classified samples Not always a useful metric Class
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationAbout Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals
More informationPython With Data Science
Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,
More informationTackling Big Data Using MATLAB
Tackling Big Data Using MATLAB Alka Nair Application Engineer 2015 The MathWorks, Inc. 1 Building Machine Learning Models with Big Data Access Preprocess, Exploration & Model Development Scale up & Integrate
More informationPractical Machine Learning Agenda
Practical Machine Learning Agenda Starting From Log Management Moving To Machine Learning PunchPlatform team Thales Challenges Thanks 1 Starting From Log Management 2 Starting From Log Management Data
More informationBotGraph: Large Scale Spamming Botnet Detec5on
BotGraph: Large Scale Spamming Botnet Detec5on Yao Zhao Yinglian Xie *, Fang Yu *, Qifa Ke *, Yuan Yu *, Yan Chen and Eliot Gillum EECS Department, Northwestern University MicrosoK Research Silicon Valley
More informationPython Certification Training
About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over
More informationCAP5415-Computer Vision Lecture 13-Support Vector Machines for Computer Vision Applica=ons
CAP5415-Computer Vision Lecture 13-Support Vector Machines for Computer Vision Applica=ons Guest Lecturer: Dr. Boqing Gong Dr. Ulas Bagci bagci@ucf.edu 1 October 14 Reminders Choose your mini-projects
More informationSAS Visual Analytics 8.1: Getting Started with Analytical Models
SAS Visual Analytics 8.1: Getting Started with Analytical Models Using This Book Audience This book covers the basics of building, comparing, and exploring analytical models in SAS Visual Analytics. The
More informationCloud Analytics and Business Intelligence on AWS
Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse
More informationBeacon Catalog. Categories:
Beacon Catalog Find the Data Beacons you need to build Custom Dashboards to answer your most pressing digital marketing questions, enable you to drill down for more detailed analysis and provide the data,
More informationA Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression
Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study
More informationKafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM
Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015 About Contacts E-mail
More informationBIG DATA TESTING: A UNIFIED VIEW
http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal
More informationSolutions. Algebra II Journal. Module 2: Regression. Exploring Other Function Models
Solutions Algebra II Journal Module 2: Regression Exploring Other Function Models This journal belongs to: 1 Algebra II Journal: Reflection 1 Before exploring these function families, let s review what
More informationUsing Alluxio to Improve the Performance and Consistency of HDFS Clusters
ARTICLE Using Alluxio to Improve the Performance and Consistency of HDFS Clusters Calvin Jia Software Engineer at Alluxio Learn how Alluxio is used in clusters with co-located compute and storage to improve
More informationCross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024
Cross- Valida+on & ROC curve Anna Helena Reali Costa PCS 5024 Resampling Methods Involve repeatedly drawing samples from a training set and refibng a model on each sample. Used in model assessment (evalua+ng
More informationIntegrating Splunk with AWS services:
Integrating Splunk with AWS services: Using Redshi+, Elas0c Map Reduce (EMR), Amazon Machine Learning & S3 to gain ac0onable insights via predic0ve analy0cs via Splunk Patrick Shumate Solutions Architect,
More informationNew Features and Enhancements in Big Data Management 10.2
New Features and Enhancements in Big Data Management 10.2 Copyright Informatica LLC 2017. Informatica, the Informatica logo, Big Data Management, and PowerCenter are trademarks or registered trademarks
More informationAdditional License Authorizations. For HPE HAVEn and Vertica Analytics Platform software products
Additional License Authorizations For HPE HAVEn and Vertica Analytics Platform software products Products and suites covered Products E-LTU or E-Media available * Non-production use category ** HPE Vertica
More informationBenchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms
Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Candidate Andrea Spina Advisor Prof. Sonia Bergamaschi Co-Advisor Dr. Tilmann Rabl Co-Advisor
More informationFraud Detection using Machine Learning
Fraud Detection using Machine Learning Aditya Oza - aditya19@stanford.edu Abstract Recent research has shown that machine learning techniques have been applied very effectively to the problem of payments
More informationLeveraging AI on the Cloud to transform your business. Florida Business Analytics Forum 2018 at University of South Florida
Leveraging AI on the Cloud to transform your business Florida Business Analytics Forum 2018 at University of South Florida 1 My (unusual) path to Google Neural networks at NOAA 2 DNNs solved image analysis
More informationIntroduction to Data Science
Introduction to Data Science Lab 4 Introduction to Machine Learning Overview In the previous labs, you explored a dataset containing details of lemonade sales. In this lab, you will use machine learning
More information