Big measurements and data analysis, challenges and some ideas
|
|
- Candice Dorsey
- 5 years ago
- Views:
Transcription
1 Big measurements and data analysis, challenges and some ideas University of Vaasa March 9 th 2017
2 What is big data? Big data is... Data sets that are too large and complex to be worked with commonly available tools. [Snijders et al., 2012]
3 Is big data useful? 1 Soccer World Championships 2 Brexit 3 USA Precident election 4 Climate change research 5 WindSoMe project (Wind turbine noise)
4 Social network analysis
5 Characteristics of big data
6 Data driven science
7 WindSoMe research project HMS Ambient conditions Disturbance response Noise control Objective feedback Subjective feedback
8 Driver for Big Data: Machine Learning
9 What does the data know The data can be used for prediction a variable y by learning a function f : y 1 x 11 x 12 x x 1n y 2. = f x 21 x 22 x x 2n (1). y m x m1 x m2 x m3... x mn, where x is the data matrix with n variables and m samples, and y is a result column vector, which m values. If y is discrete, the function f is a classifier, and when y is continuous f is a regressor. If y is known, the problem is said to be a supervised learning case, otherwise it is unsupervised. Sometimes it is usefull to explore the X by itself.
10 Algorithms 1 PCA: Can remove the correlation from the variables and reduce the dimensionality to make further processing more controllable 2 PLS: Multivariate regression method 3 Neural networks: For multivariate classification and regression. 4 SVC: Multivariate classification and regression tool 5 NBC. Really fast to train and predict, supports training with partial data, which is very good, when the whole data set cannot fit in the memory at the same time. NBC in scikit-learn 6 Decision trees (white box): Ensemble classifiers with boosting or bagging (Adaboost, random forest) See Scikit-learn manual page. Can be used for feature selection as well. 7 L1 regularization to simplify the models: Lasso, LARS
11 Tools for machine learning 1 Python + Scientific Python + Scikit.learn + Pandas Real object oriented, versatile programming language, which is easy to learn Good support for numerical computation A lot of well documented libraries for machine learning Can work with Hadoop MapReduce 2 R All new algorithms are found from libraries Best in statistics Can work with Hadoop 3 MATLAB / Octave Well suited for numerical computation MATLAB can work with Hadoop, Spark The interface to all is programming language, with support for interactive operation. Hadoop integration should be studied further.
12 Tools of Big data
13 Challenges of big data 1 Storage: 2 Data transfer 3 CPU-capacity: Amount of stored data increases much faster than CPU-capacity, which still follows the Moore s law. [Kambatla et al., 2014] 4 Preprocessing, transformation and feature extraction: How to make processing faster by smartly calculating and selecting features. 5 Energy consumption: The energy saving technologies develop much slower than the data processing needs. In the future, energy saving may become an important issue in algorith development.
14 Solutions 1 Speed: Simple algoritms, dimensionality reduction, feature selection, sample selection 2 Transfer: Distributed processing near data: Hadoop 3 CPU capcity: computer farms, CSC, cloud computing, supercomputer 4 Data parallelism; If same operation is applied to many data, the task can be simply distributed in the computer cluster. Still the data storage or transfer may be the problem.
15 Map Reduce A common method for distributing calculation to the network in Big data systems: 1 The central node is ordered to compute a task 2 The central node maps the task to the client nodes (Map) 3 The client nodes may in turn distribute the task hierarchically further 4 The results (key-value pairs) are returned to the parent, which combines them togetner (Reduce) Map Reduce suits best for textual data and is not so much applied for data intensive science applications because complex and high dimensional data sets?? [Buck et al., 2011] Hadoop Uses MapReduce to enable distributed computing of large data sets in their exising locations and lot of other tools, such as resource manager YARN and distribute storage system, HDFS.
16 Storage Hard disks, DAS/NAS: The chapest (WD Blue WD40EZRZ 4 TB hard disck, 165 e[42 e/ TB]). Reliability and volume requirements can be fulfilled with RAID and logical volume solutions. Solid state memory: The fastest 10x cost, 10x speed. Cloud storage: Flexible, slower access, can become more expensive. CSC workspace quota 5120 GB in workspace, no backups! Distributed file system: HDFS (Still needs physical devices) Ideal storage system? 1) Tolerancy towards network parititoning, 2) Consistent, 3) High availability. According to Brewer s CAP-theorem, you can have only two. [Brewer, 2000]
17 NoSQL storage Centralized databases cannot handle really big data sets -> use distributed, like nosql. Source: Apache hadoop
18 CPU capacity: Other solutions Solution cores price unit price HPE Proliant tower, 1x Xeon E e 250 e/core Rack server computer, 2x Xeon E e 187 e/core Amazon on demand m4.large e/h 1.4 e/ coreday Amazon on demand m4.4xlarge e/h 1.4 e/coreday Amazon on demand m4.16xlarge e/h 1.4 e/ coreday Amazon hadoop-cluster 10 > 0.15 e/h 0.36 e/coreday If you can utilize more than about 150 coredays for each core of your server in it s lifetime, it will be cheaper than leasing from cloud. Of course server needs service, energy, and is not as flexible than on-demand cloud. Better is that the organization purchases CPU power for shared used than a project for private use.
19 CSC CSC: CPU cores + GPU cluster
20 Cloud based big data systems [Hashem et al., 2015]
21 Cases
22 Case: Climate change research MERRA Analytic Services: Meeting the Big Data challenges of climate science through cloud-enabled Climate Analytics-as-a-Service [Schnase et al., 2017]. NASA s climate change repositories alone are projected to grow to 350 petabytes by 2013 Data sets themselves cannot be moved: instead, analytical operations need to migrate to where the data reside The ability to respond quickly to customer demands for new and often unanticipated uses for climate data requires greater agility in building and deploying applications Data stored in Hadoop HDFS distributed file system The data is originally stored as netcdf format, and converted to HDFS Interface to other users is provided through RESTfull web services
23 WindSoMe research project HMS Ambient conditions Disturbance response Noise control Objective feedback Subjective feedback
24 Purpose of the project Measure sound emission and immission, weather conditions and collect dweller feedback over whole year. Develop suitable noise metrics to evaluate both the level and the characteristics of the sound. Assess dweller attitudes and opinions by questionnaires. Emission, immission and weather measurement are used to calculate sound propagation in all weather conditions Real time noise feedback is used for finding out why noise is annoying: Level, AM, tonality, or other reasons. Survey is made to study background reasons, and general opinions Tonality, amplitude modulation and other characteristics of the sound will be studied. The measurement, identification and prevention of mechanical wind turbine noise will be studied. Disseminate research results to the public.
25 Sound data measurement infrastructure
26 Data storage infrastucture The data is stored in HDF5 files in a filesystem, including the metadata related to the data acquisition: Sampling rate, sampling time, sampling location, etc. HDF5 can be easily used in Python, R, and MATLAB/Octave. It supports transparent compression of data. The metadata and calculated features are stored in an SQL database. SQLite files are used for exchanging structured data Some of the data and calculated features are accessible from WSAA web system, which provides also RESTfull API
27 Installation of measurement stations
28 WindSoMe machine learning case Prediction: Mean shift clustering Feature extraction: PCA Preprocessing: Scaling+FFT Transfer Acquisition FFT provides features for recognizing sound sources Replace PCA with Incremental PCA Large window data analysis needed to capture slow processes -> additional features, like AM-modulation index, wind shield noise index Can feature selection eliminate the need of full FFT and get rid of PCA?
29 Sound pressure signal (x) Sound pressure / Pa Time / s
30 Potential features Time/s Time/s Amplitude / Pa Amplitude / db Angle Angle Delay / s Delay / s
31 Spectrogram (X)
32 Automated noise analysis (Clustering)
33 Result: Automated noise analysis 70 Original average Fixed average 60 Sound pressure level / dba :00:00 03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00 Time
34 Conclusions Digitalization provides big data sets Possibilities for Data driven research Distributed storage and computation are needed Fast algorithms and carefull data selection helps Big data requires serious programming In WindSoMe case, analysis of big measurement data could tell which features of noise makes people annoyed, and in which weather conditions they apper. Furthermore it can provide automatization to environmental sound measurements and analysis, making it affordable.
35 References I Brewer, E. A. (2000). Towards robust distributed systems. In PODC, volume 7. Buck, J. B., Watkins, N., LeFevre, J., Ioannidou, K., Maltzahn, C., Polyzotis, N., and Brandt, S. (2011). SciHadoop: Array-based query processing in Hadoop. In 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., and Ullah Khan, S. (2015). The rise of big data on cloud computing: Review and open research issues. Information Systems, 47: Kambatla, K., Kollias, G., Kumar, V., and Grama, A. (2014). Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7): Schnase, J. L., Duffy, D. Q., Tamkin, G. S., Nadeau, D., Thompson, J. H., Grieg, C. M., McInerney, M. A., and Webster, W. P. (2017). MERRA Analytic Services: Meeting the Big Data challenges of climate science through cloud-enabled Climate Analytics-as-a-Service. Computers, Environment and Urban Systems, 61, Part B: Snijders, C., Matzat, U., and Reips, U.-D. (2012). "Big Data" : Big Gaps of Knowledge in the Field of Internet Science. International Journal of Internet Science, 7(1):1 5.
Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016
National Aeronautics and Space Administration Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures 13 November 2016 Carrie Spear (carrie.e.spear@nasa.gov) HPC Architect/Contractor
More informationBig Data and Object Storage
Big Data and Object Storage or where to store the cold and small data? Sven Bauernfeind Computacenter AG & Co. ohg, Consultancy Germany 28.02.2018 Munich Volume, Variety & Velocity + Analytics Velocity
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationData Science Bootcamp Curriculum. NYC Data Science Academy
Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations
More informationTackling Big Data Using MATLAB
Tackling Big Data Using MATLAB Alka Nair Application Engineer 2015 The MathWorks, Inc. 1 Building Machine Learning Models with Big Data Access Preprocess, Exploration & Model Development Scale up & Integrate
More informationBIG DATA TESTING: A UNIFIED VIEW
http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation
More informationEsgynDB Enterprise 2.0 Platform Reference Architecture
EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationThe Evolution of a Data Project
The Evolution of a Data Project The Evolution of a Data Project Python script The Evolution of a Data Project Python script SQL on live DB The Evolution of a Data Project Python script SQL on live DB SQL
More informationApache Flink: Distributed Stream Data Processing
Apache Flink: Distributed Stream Data Processing K.M.J. Jacobs CERN, Geneva, Switzerland 1 Introduction The amount of data is growing significantly over the past few years. Therefore, the need for distributed
More information5 Fundamental Strategies for Building a Data-centered Data Center
5 Fundamental Strategies for Building a Data-centered Data Center June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist Last generation Reference Data Unstructured OLTP Warehouse
More informationData Analytics with MATLAB. Tackling the Challenges of Big Data
Data Analytics with MATLAB Tackling the Challenges of Big Data How big is big? What characterises big data? Any collection of data sets so large and complex that it becomes difficult to process using traditional
More informationIntroducing SUSE Enterprise Storage 5
Introducing SUSE Enterprise Storage 5 1 SUSE Enterprise Storage 5 SUSE Enterprise Storage 5 is the ideal solution for Compliance, Archive, Backup and Large Data. Customers can simplify and scale the storage
More informationIntegrate MATLAB Analytics into Enterprise Applications
Integrate Analytics into Enterprise Applications Dr. Roland Michaely 2015 The MathWorks, Inc. 1 Data Analytics Workflow Access and Explore Data Preprocess Data Develop Predictive Models Integrate Analytics
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationBig Data Using Hadoop
IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationAbstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight
ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group
More informationNext-Generation Cloud Platform
Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationIBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage
IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage Silverton Consulting, Inc. StorInt Briefing 2017 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED Page 2 Introduction Unstructured data has
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationBest Practices and Performance Tuning on Amazon Elastic MapReduce
Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or
More informationDealing with Data Especially Big Data
Dealing with Data Especially Big Data INFO-GB-2346.01 Fall 2017 Professor Norman White nwhite@stern.nyu.edu normwhite@twitter Teaching Assistant: Frenil Sanghavi fps241@stern.nyu.edu Administrative Assistant:
More informationStorage in HPC: Scalable Scientific Data Management. Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11
Storage in HPC: Scalable Scientific Data Management Carlos Maltzahn IEEE Cluster 2011 Storage in HPC Panel 9/29/11 Who am I? Systems Research Lab (SRL), UC Santa Cruz LANL/UCSC Institute for Scalable Scientific
More informationSciSpark 201. Searching for MCCs
SciSpark 201 Searching for MCCs Agenda for 201: Access your SciSpark & Notebook VM (personal sandbox) Quick recap. of SciSpark Project What is Spark? SciSpark Extensions scitensor: N-dimensional arrays
More informationFalling Out of the Clouds: When Your Big Data Needs a New Home
Falling Out of the Clouds: When Your Big Data Needs a New Home Executive Summary Today s public cloud computing infrastructures are not architected to support truly large Big Data applications. While it
More informationWebinar Series TMIP VISION
Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing
More informationSciHadoop: Array Based Query Processing in Hadoop
SciHadoop: Array Based Query Processing in Hadoop Joe Buck, Noah Watkins, Jeff LeFevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, Scott Brandt 1 1 Damasc Data Management in Scientific Computing
More informationBig Data and FrameWorks; Perspectives to Applied Machine Learning
Big Data and FrameWorks; Perspectives to Applied Machine Learning Mehdi Habibzadeh PhD in Computer Science Outlines (Oct 2016) : Big Data and Challenges Review and Trends Math and Probability Concepts
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationAccelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads
WHITE PAPER Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads December 2014 Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents
More informationEvaluation of Machine Learning Algorithms for Satellite Operations Support
Evaluation of Machine Learning Algorithms for Satellite Operations Support Julian Spencer-Jones, Spacecraft Engineer Telenor Satellite AS Greg Adamski, Member of Technical Staff L3 Technologies Telemetry
More informationParallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer
Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer 2018 The MathWorks, Inc. 1 Practical Application of Parallel Computing Why parallel computing? Need faster
More informationChina Big Data and HPC Initiatives Overview. Xuanhua Shi
China Big Data and HPC Initiatives Overview Xuanhua Shi Services Computing Technology and System Laboratory Big Data Technology and System Laboratory Cluster and Grid Computing Laboratory Huazhong University
More informationIntegrate MATLAB Analytics into Enterprise Applications
Integrate Analytics into Enterprise Applications Aurélie Urbain MathWorks Consulting Services 2015 The MathWorks, Inc. 1 Data Analytics Workflow Data Acquisition Data Analytics Analytics Integration Business
More informationRapid growth of massive datasets
Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,
More informationTHE COMPLETE GUIDE HADOOP BACKUP & RECOVERY
THE COMPLETE GUIDE HADOOP BACKUP & RECOVERY INTRODUCTION Driven by the need to remain competitive and differentiate themselves, organizations are undergoing digital transformations and becoming increasingly
More informationScalable Tools - Part I Introduction to Scalable Tools
Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session
More informationSpatial Analytics Built for Big Data Platforms
Spatial Analytics Built for Big Platforms Roberto Infante Software Development Manager, Spatial and Graph 1 Copyright 2011, Oracle and/or its affiliates. All rights Global Digital Growth The Internet of
More informationPython With Data Science
Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,
More informationStrategic Briefing Paper Big Data
Strategic Briefing Paper Big Data The promise of Big Data is improved competitiveness, reduced cost and minimized risk by taking better decisions. This requires affordable solution architectures which
More informationCopyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Big Data Connectors: High Performance Integration for Hadoop and Oracle Database Melli Annamalai Sue Mavris Rob Abbott 2 Program Agenda Big Data Connectors: Brief Overview Connecting Hadoop with Oracle
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationAn Overview on Big Data Processing in Cloud Computing: Recent Challenges & Issues
An Overview on Big Data Processing in Cloud Computing: Recent Challenges & Issues Faraidoon Habibi*, Nagesh Kumar Department of Computer Science & Engineering School of Science & Technology A P Goyal Shimla
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationEvaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades
Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation report prepared under contract with Dot Hill August 2015 Executive Summary Solid state
More informationReal Time for Big Data: The Next Age of Data Management. Talksum, Inc. Talksum, Inc. 582 Market Street, Suite 1902, San Francisco, CA 94104
Real Time for Big Data: The Next Age of Data Management Talksum, Inc. Talksum, Inc. 582 Market Street, Suite 1902, San Francisco, CA 94104 Real Time for Big Data The Next Age of Data Management Introduction
More informationCOSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan
COSC 416 NoSQL Databases NoSQL Databases Overview Dr. Ramon Lawrence University of British Columbia Okanagan ramon.lawrence@ubc.ca Databases Brought Back to Life!!! Image copyright: www.dragoart.com Image
More informationA Glimpse of the Hadoop Echosystem
A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other
More informationIntegrate MATLAB Analytics into Enterprise Applications
Integrate Analytics into Enterprise Applications Lyamine Hedjazi 2015 The MathWorks, Inc. 1 Data Analytics Workflow Preprocessing Data Business Systems Build Algorithms Smart Connected Systems Take Decisions
More informationData Analytics at Logitech Snowflake + Tableau = #Winning
Welcome # T C 1 8 Data Analytics at Logitech Snowflake + Tableau = #Winning Avinash Deshpande I am a futurist, scientist, engineer, designer, data evangelist at heart Find me at Avinash Deshpande Chief
More informationThis document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and
AI and Visual Analytics: Machine Learning in Business Operations Steven Hillion Senior Director, Data Science Anshuman Mishra Principal Data Scientist DISCLAIMER During the course of this presentation,
More information1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda
Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server:
More informationAccelerate Big Data Insights
Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationMachine Learning With Spark
Ons Dridi R&D Engineer 13 Novembre 2015 Centre d Excellence en Technologies de l Information et de la Communication CETIC Presentation - An applied research centre in the field of ICT - The knowledge developed
More informationSCIENCE. An Introduction to Python Brief History Why Python Where to use
DATA SCIENCE Python is a general-purpose interpreted, interactive, object-oriented and high-level programming language. Currently Python is the most popular Language in IT. Python adopted as a language
More informationBlueDBM: An Appliance for Big Data Analytics*
BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting
More informationMachine Learning with Python
DEVNET-2163 Machine Learning with Python Dmitry Figol, SE WW Enterprise Sales @dmfigol Cisco Spark How Questions? Use Cisco Spark to communicate with the speaker after the session 1. Find this session
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationAnalytics Platform for ATLAS Computing Services
Analytics Platform for ATLAS Computing Services Ilija Vukotic for the ATLAS collaboration ICHEP 2016, Chicago, USA Getting the most from distributed resources What we want To understand the system To understand
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationTwitter data Analytics using Distributed Computing
Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE
More informationThe Fusion Distributed File System
Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationIntegration with popular Big Data Frameworks in Statistica and Statistica Enterprise Server Solutions Statistica White Paper
and Statistica Enterprise Server Solutions Statistica White Paper Siva Ramalingam Thomas Hill TIBCO Statistica Table of Contents Introduction...2 Spark Support in Statistica...3 Requirements...3 Statistica
More informationHandling Big Data an overview of mass storage technologies
SS Data & Handling Big Data an overview of mass storage technologies Łukasz Janyst CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it GridKA School 2013 Karlsruhe, 26.08.2013 What is Big Data?
More informationTable Of Contents: xix Foreword to Second Edition
Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationPutting it all together: Creating a Big Data Analytic Workflow with Spotfire
Putting it all together: Creating a Big Data Analytic Workflow with Spotfire Authors: David Katz and Mike Alperin, TIBCO Data Science Team In a previous blog, we showed how ultra-fast visualization of
More informationAnalysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark
Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,
More informationPerformance Innovations with Oracle Database In-Memory
Performance Innovations with Oracle Database In-Memory Eric Cohen Solution Architect Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information
More informationContents PART I: CLOUD, BIG DATA, AND COGNITIVE COMPUTING 1
Preface xiii PART I: CLOUD, BIG DATA, AND COGNITIVE COMPUTING 1 1 Princi ples of Cloud Computing Systems 3 1.1 Elastic Cloud Systems for Scalable Computing 3 1.1.1 Enabling Technologies for Cloud Computing
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationAgenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache
Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,
More informationCo-existence: Can Big Data and Big Computation Co-exist on the Same Systems?
Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems? Dr. William Kramer National Center for Supercomputing Applications, University of Illinois Where these views come from Large
More informationKnowledge Discovery. URL - Spring 2018 CS - MIA 1/22
Knowledge Discovery Javier Béjar cbea URL - Spring 2018 CS - MIA 1/22 Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical application of the methodologies from machine learning/statistics
More informationCisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr
Solution Overview Cisco UCS Integrated Infrastructure for Big Data and Analytics with Cloudera Enterprise Bring faster performance and scalability for big data analytics. Highlights Proven platform for
More informationChallenges in HPC I/O
Challenges in HPC I/O Universität Basel Julian M. Kunkel German Climate Computing Center / Universität Hamburg 10. October 2014 Outline 1 High-Performance Computing 2 Parallel File Systems and Challenges
More informationEvent: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect
Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of
More informationTHE FUTURE OF BUSINESS DEPENDS ON SOFTWARE DEFINED STORAGE (SDS)
THE FUTURE OF BUSINESS DEPENDS ON SOFTWARE DEFINED STORAGE (SDS) How SSDs can fit into and accelerate an SDS strategy SPONSORED BY TABLE OF CONTENTS Introduction 3 An Overview of SDS 4 Achieving the Goals
More informationThe Future of Business Depends on Software Defined Storage (SDS) How SSDs can fit into and accelerate an SDS strategy
The Future of Business Depends on Software Defined Storage (SDS) Table of contents Introduction 2 An Overview of SDS 3 Achieving the Goals of SDS Hinges on Smart Hardware Decisions 5 Assessing the Role
More informationTHE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY
THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY INTRODUCTION Driven by the need to remain competitive and differentiate themselves, organizations are undergoing digital transformations and becoming increasingly
More informationThe Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
More informationUNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017
UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in
More informationscikit-learn (Machine Learning in Python)
scikit-learn (Machine Learning in Python) (PB13007115) 2016-07-12 (PB13007115) scikit-learn (Machine Learning in Python) 2016-07-12 1 / 29 Outline 1 Introduction 2 scikit-learn examples 3 Captcha recognize
More informationData. Big: TiB - PiB. Small: MiB - GiB. Supervised Classification Regression Recommender. Learning. Model
2 Supervised Classification Regression Recommender Data Big: TiB - PiB Learning Model Small: MiB - GiB Unsupervised Clustering Dimensionality reduction Topic modeling 3 Example Formation Examples Modeling
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationBIG DATA COURSE CONTENT
BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationFusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic
WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationStream Processing on IoT Devices using Calvin Framework
Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised
More informationModernizing Business Intelligence and Analytics
Modernizing Business Intelligence and Analytics Justin Erickson Senior Director, Product Management 1 Agenda What benefits can I achieve from modernizing my analytic DB? When and how do I migrate from
More information