Accelerator Design for Big Data Processing Frameworks
|
|
- Brook Gibbs
- 6 years ago
- Views:
Transcription
1 Accelerator Design for Big Data Processing Frameworks Hiroki Matsutani Dept. of ICS, Keio University July 5th, 2017 International Forum on MPSoC for Software-Defined Hardware (MPSoC'17) 1
2 2 processing Input stream data Big data (Surveillance, Network service, SNS, UAV, IoT) Message queuing Data exchange (serialization) Learning Database layer (Polyglot persistence) DB queries Customer analysis Topic prediction chain records Geolocation query processing requires time and thus recent data cannot be reflected to the analysis result Combine batch and stream processing to make up the realtime capability
3 + Stream processing Input stream data Message queuing Stream Inference Database layer (Polyglot persistence) Realtime DB queries Big data (Surveillance, Network service, SNS, UAV, IoT) Data exchange (serialization) Learning Customer analysis Topic prediction chain records Geolocation query Nathan Marz, et.al., "Big Data: Principles and best practices of scalable realtime data systems", Manning Publications (2015). 3
4 + Stream processing Input stream data Message queuing Stream Inference Database layer (Polyglot persistence) Realtime DB queries Big data (Surveillance, Network service, SNS, UAV, IoT) I/O intensive Data exchange (serialization) Learning Customer analysis Topic prediction chain records Geolocation query Compute intensive Message queuing middleware (Apache Kafka) Serialization (Apache Thrift) Stream processing (Apache Spark Streaming) KVS / Column DB (Redis, HBase) Online machine learning (Classification, Outlier detection, Change point detection, Abnormal behavior detection) Document DB (MongoDB) Graph DB (Neo4j), graph processing processing (Apache Spark) Machine learning (Apache Spark MLlib) Tight integration of I/O and compute FPGA Four 10GbE FPGA Massive parallelism Networked GPU cluster GPUs Switch Host 4
5 Acceleration for data store Input stream data Message queuing Stream Inference Database layer (Polyglot persistence) Realtime DB queries Big data (Surveillance, Network service, SNS, UAV, IoT) I/O intensive Data exchange (serialization) Learning Customer analysis Topic prediction chain records Geolocation query Compute intensive Message queuing middleware (Apache Kafka) Serialization (Apache Thrift) Stream processing (Apache Spark Streaming) KVS / Column DB (Redis, HBase) Online machine learning (Classification, Outlier detection, Change point detection, Abnormal behavior detection) Document DB (MongoDB) Graph DB (Neo4j), graph processing processing (Apache Spark) Machine learning (Apache Spark MLlib) Tight integration of I/O and compute FPGA Four 10GbE FPGA Massive parallelism Networked GPU cluster GPUs Switch Host 5
6 Multilevel NOSQL cache: FPGA NIC [IEEE HoTI 16] Normal DB Access Result User Space In-Kernel KVS Cache (L2$) Hit Request NIC Cache (L1$) Hit Reply Add NetFPGA-10G Device Driver In-Kernel KVS (Key-Value Store) Cache 512GB NetFPGA-10G NIC HW Cache Machine Learning DRAM 288MB 10GbE x4 6
7 Multilevel NOSQL cache: FPGA NIC Multilevel NOSQL cache: User Space L1 NOSQL cache FPGA-based hardware cache L2 NOSQL cache In-kernel software cache Normal DB Access Result Tradeoffs between capacity and speed: L1 In-Kernel NOSQL cache KVS Very fast/efficient but small L2 Cache NOSQL (L2$) cache Hit Fast and large Design space exploration [IEEE HoTI 16] Request NIC Cache (L1$) Hit Reply Add NetFPGA-10G Device Driver In-Kernel KVS (Key-Value Store) Cache 512GB NetFPGA-10G NIC HW Cache Machine Learning DRAM 288MB 10GbE x4 7
8 FPGA NIC Cache for chain chain A chain of blocks each contains transactions verified and shared by all the parties 1. Bob wants to send money to Alice Bob 2. TX(Bob Alice) is represented as a block 3. The block is broadcasted to all the nodes 4. After verified, the block is added to the blockchain 8
9 FPGA NIC Cache for chain IoT devices (SPV nodes) Cannot maintain whole the blockchain data (>100GB) due to resource limitation Simple payment verification Ask full node to check whether a transaction of interest has been completed or not 1. Alice wants to verify TX(Bob Alice) Full node Alice 2. Alice contacts a full node to verify it 9
10 FPGA NIC Cache for chain IoT devices (SPV nodes) The Cannot number maintain of IoT devices whole that the blockchain join blockchain data will (>100GB) increase due to resource limitation To reduce full node accesses from SPV nodes, FPGA Ask NIC full KVS node is to used check as cache whether of a blockchain transaction [HEART 17] of interest has been completed or not Simple payment verification 1. Alice wants to verify TX(Bob Alice) Full node Alice 2. Alice contacts a full node to verify it FPGA NIC Cache 10
11 Acceleration for data processing Input stream data Message queuing Stream Inference Database layer (Polyglot persistence) Realtime DB queries Big data (Surveillance, Network service, SNS, UAV, IoT) I/O intensive Data exchange (serialization) Learning Customer analysis Topic prediction chain records Geolocation query Compute intensive Message queuing middleware (Apache Kafka) Serialization (Apache Thrift) Stream processing (Apache Spark Streaming) KVS / Column DB (Redis, HBase) Online machine learning (Classification, Outlier detection, Change point detection, Abnormal behavior detection) Document DB (MongoDB) Graph DB (Neo4j), graph processing processing (Apache Spark) Machine learning (Apache Spark MLlib) Tight integration of I/O and compute FPGA Four 10GbE FPGA Massive parallelism Networked GPU cluster GPUs Switch Host 11
12 Data processing w/ Spark+GPU User Space Array Cache (RDDs) Reduction & transformation of RDDs offloaded to GPU [HEART 16] RDD 1 Array Cache (Converted from Spark RDDs) RDD reduction (count, max, reduce, ) Value RDD 1 RDD 2 RDD 3 RDD 2 RDD-to-RDD transformation (sort, filter, map, ) RDD 3 Data are stored in RDDs (i.e., distributed shared memory) RDDs are converted to array structure & transferred to GPU 12
13 10G+10G Data processing w/ Spark+GPUs Many GPUs are directly connected to Apache Spark server via NEC ExpEther (20Gbps) GPUs 0G+10G 10/40GbE Switch PCIe card inserted in the server
14 Data processing w/ Spark+GPUs User Space Array Cache (RDDs) Reduction & transformation of RDDs offloaded to GPUs RDD 4 RDD 1 RDD 2 RDD 3 Array Cache (Converted from Spark RDDs) RDD 1 10GbE Switch RDD 3 RDD 6 RDD 2 RDD 5 [ICPADS 16] 14
15 Acceleration for data processing Input stream data Message queuing Stream Inference Database layer (Polyglot persistence) Realtime DB queries Big data (Surveillance, Network service, SNS, UAV, IoT) I/O intensive Data exchange (serialization) Learning Customer analysis Topic prediction chain records Geolocation query Compute intensive Message queuing middleware (Apache Kafka) Serialization (Apache Thrift) Stream processing (Apache Spark Streaming) KVS / Column DB (Redis, HBase) Online machine learning (Classification, Outlier detection, Change point detection, Abnormal behavior detection) Document DB (MongoDB) Graph DB (Neo4j), graph processing processing (Apache Spark) Machine learning (Apache Spark MLlib) Tight integration of I/O and compute FPGA Four 10GbE FPGA Massive parallelism Networked GPU cluster GPUs Switch Host 15
16 10Gbps outlier filtering: FPGA NIC Sensor Data Explosion User Space Outlier NetFPGA-10G Device Driver Only anomalyvalued packets are received Huge Size Small Size Machine learning algorithms Mahalanobis Distance Local Outlier Factor(LOF) K-Nearest Neighbor(KNN) NetFPGA-10G Data Mining 10GbE x4 16
17 10Gbps outlier filtering: FPGA NIC Density-based Sensor Data Explosion approach to User find Space outliers (e.g., higher LOF value when k neighbors are distant) All reference data needed for density computation Frequently-accessed reference data clusters are cached in FPGA NIC [PDP 17] Outlier NetFPGA-10G Device Driver Only anomalyvalued packets are received Huge Size Small Size Machine learning algorithms Mahalanobis Distance Local Outlier Factor(LOF) K-Nearest Neighbor(KNN) NetFPGA-10G Data Mining 10GbE x4 17
18 Spark Streaming: FPGA NIC Stream processing One-at-a-time style User Space Micro-batch Micro-batch style batch batch Spark Streaming Micro-batch style for compatibility w/ Spark Large latency (e.g, 1sec) NetFPGA-10G Device Driver Results of one-at-atime processing (e.g., filter) received NetFPGA-10G One-at-a-time Stream processing components which can be executed as one-at-a-time style are offloaded to FPGA NIC [IEEE BigData WS 16] 10GbE x4 18
19 Summary Input stream data Message queuing Stream Inference Database layer (Polyglot persistence) Realtime DB queries Big data (Surveillance, Network service, SNS, UAV, IoT) I/O intensive Data exchange (serialization) Learning Customer analysis Topic prediction chain records Geolocation query Compute intensive Message queuing middleware (Apache Kafka) Serialization (Apache Thrift) Stream processing (Apache Spark Streaming) KVS / Column DB (Redis, HBase) Online machine learning (Classification, Outlier detection, Change point detection, Abnormal behavior detection) Document DB (MongoDB) Graph DB (Neo4j), graph processing processing (Apache Spark) Machine learning (Apache Spark MLlib) Tight integration of I/O and compute FPGA Four 10GbE FPGA Massive parallelism Networked GPU cluster GPUs Switch Host 19
20 References (1/2) FPGA-based KVS accelerator Yuta Tokusashi, et.al., "A Multilevel NOSQL Cache Design Combining In-NIC and In- Kernel Caches", IEEE Hot Interconnects FPGA-based chain cache Yuma Sakakibara, et.al., "An FPGA NIC Based Hardware Caching for chain", HEART FPGA-based Spark Streaming accelerator Kohei Nakamura, et.al., "An FPGA-Based Low-Latency Network Processing for Spark Streaming", IEEE BigData Workshops
21 21 References (2/2) FPGA-based machine learning accelerator Ami Hayashi, et.al., "An FPGA-Based In-NIC Cache Approach for Lazy Learning Outlier Filtering", PDP Ami Hayashi, et.al., "A Line Rate Outlier Filtering FPGA NIC using 10GbE Interface", ACM SIGARCH CAN (2015). GPU-based acceleration of Apache Spark Yasuhiro Ohno, et.al., "Accelerating Spark RDD Operations with Local and Remote GPU Devices", ICPADS 2016.
22 Thank you! hank you for listening Acknowledgement: This work was supported by MIC SCOPE, NEDO IoT, JSPS Kaken B, and SECOM 22
An FPGA NIC Based Hardware Caching for Blockchain
An FPGA NIC Based Hardware Caching for Blockchain Yuma Sakakibara, Kohei Nakamura, and Hiroki Matsutani Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan {yuma,nakamura,matutani}@arc.ics.keio.ac.jp
More informationAn In-Kernel NOSQL Cache for Range Queries Using FPGA NIC
An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC Korechika Tamura Dept. of ICS, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan Email: tamura@arc.ics.keio.ac.jp Hiroki Matsutani Dept.
More informationAccelerating Spark RDD Operations with Local and Remote GPU Devices
Accelerating Spark RDD Operations with Local and Remote GPU Devices Yasuhiro Ohno, Shin Morishima, and Hiroki Matsutani Dept.ofICS,KeioUniversity, 3-14-1 Hiyoshi, Kohoku, Yokohama, Japan 223-8522 Email:
More informationAccelerating Blockchain Transfer System Using FPGA-Based NIC
Accelerating Blockchain Transfer System Using FPGA-Based NIC Yuma Sakakibara, Yuta Tokusashi, Shin Morishima, and Hiroki Matsutani Dept. of Information and Computer Science, Keio University 3-14-1 Hiyoshi,
More informationAn FPGA-Based In-NIC Cache Approach for Lazy Learning Outlier Filtering
An FPGA-Based In-NIC Cache Approach for Lazy Learning Outlier Filtering Ami Hayashi Dept. of ICS, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan Email: hayashi@arc.ics.keio.ac.jp Hiroki Matsutani
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationHiroki Matsutani Dept. of ICS, Keio University, Hiyoshi, Kohoku-ku, Yokohama, Japan
Design and Implementation of Hardware Cache Mechanism and NIC for Column-Oriented Databases Akihiko Hamada Dept. of ICS, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan Email: hamada@arc.ics.keio.ac.jp
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationCreating a Recommender System. An Elasticsearch & Apache Spark approach
Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused
More information10 Million Smart Meter Data with Apache HBase
10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on
More informationDistributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet
Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Shin Morishima 1 and Hiroki Matsutani 1,2,3 1 Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan 223-8522
More informationAccelerating Blockchain Search of Full Nodes Using GPUs
Accelerating Blockchain Search of Full Nodes Using GPUs Shin Morishima Dept. of ICS, Keio University, 3-14-1 Hiyoshi, Kohoku, Yokohama, Japan Email: morisima@arc.ics.keio.ac.jp Abstract Blockchain is a
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationData Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens
Data Science and Open Source Software Iraklis Varlamis Assistant Professor Harokopio University of Athens varlamis@hua.gr What is data science? 2 Why data science is important? More data (volume, variety,...)
More informationFluentd + MongoDB + Spark = Awesome Sauce
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationBIG DATA COURSE CONTENT
BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data
More informationHADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)
HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationEnd-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet
Hot Interconnects 2014 End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet Green Platform Research Laboratories, NEC, Japan J. Suzuki, Y. Hayashi, M. Kan, S. Miyakawa,
More informationAbout Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals
More informationHadoop. Introduction to BIGDATA and HADOOP
Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL
More informationToward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationMaximizing heterogeneous system performance with ARM interconnect and CCIX
Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationHigh-Level Data Models on RAMCloud
High-Level Data Models on RAMCloud An early status report Jonathan Ellithorpe, Mendel Rosenblum EE & CS Departments, Stanford University Talk Outline The Idea Data models today Graph databases Experience
More informationExam Questions
Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationBigchainDB: A Scalable Blockchain Database. Trent McConaghy
BigchainDB: A Scalable Blockchain Database Trent McConaghy Communications The Elements of Computing Processing Storage Communications Modern Application Stacks Applications Processing File System: Hierarchy
More informationHDInsight > Hadoop. October 12, 2017
HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond
More informationCaribou: Intelligent Distributed Storage
: Intelligent Distributed Storage Zsolt István, David Sidler, Gustavo Alonso Systems Group, Department of Computer Science, ETH Zurich 1 Rack-scale thinking In the Cloud ToR Switch Compute Compute + Provisioning
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationBig Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data
Big Data Big Streaming Data Big Streaming Data Processing Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets How to Process Big Streaming Data Raw Data Streams Distributed
More informationCOMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING
Volume 119 No. 16 2018, 937-948 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING K.Anusha
More informationBig Data on AWS. Peter-Mark Verwoerd Solutions Architect
Big Data on AWS Peter-Mark Verwoerd Solutions Architect What to get out of this talk Non-technical: Big Data processing stages: ingest, store, process, visualize Hot vs. Cold data Low latency processing
More informationHighly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture
A Cost Effective,, High g Performance,, Highly Scalable, Non-RDMA NVMe Fabric Bob Hansen,, VP System Architecture bob@apeirondata.com Storage Developers Conference, September 2015 Agenda 3 rd Platform
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More information3D WiNoC Architectures
Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",
More informationDesigning Hybrid Data Processing Systems for Heterogeneous Servers
Designing Hybrid Data Processing Systems for Heterogeneous Servers Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group Imperial College London http://lsds.doc.ic.ac.uk University
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationEXTRACT DATA IN LARGE DATABASE WITH HADOOP
International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationIntegration of Machine Learning Library in Apache Apex
Integration of Machine Learning Library in Apache Apex Anurag Wagh, Krushika Tapedia, Harsh Pathak Vishwakarma Institute of Information Technology, Pune, India Abstract- Machine Learning is a type of artificial
More informationNewly invented and fully owned by Turbo Data Laboratories, Inc. (TDL)
Newly invented and fully owned by Turbo Data Laboratories, Inc. (TDL) 28, July, 2017 Executive Summary Universal & Designless, yet Far Faster than Legacy Technologies Big Data Technology has to do with
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationAgenda. Spark Platform Spark Core Spark Extensions Using Apache Spark
Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing
More informationPacketShader: A GPU-Accelerated Software Router
PacketShader: A GPU-Accelerated Software Router Sangjin Han In collaboration with: Keon Jang, KyoungSoo Park, Sue Moon Advanced Networking Lab, CS, KAIST Networked and Distributed Computing Systems Lab,
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationVisual Analytics Sandbox: A big data platform for processing network traffic
Visual Analytics Sandbox: A big data platform for processing network traffic Raju Gottumukkala, Ph.D. Director of Research, Informatics Research Institute Site Director, NSF Center for Visual and Decision
More informationA Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan
A Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan 1 Outline: 3D Wireless NoC Designs This part also explores 3D NoC architecture with inductive-coupling
More informationCERN openlab & IBM Research Workshop Trip Report
CERN openlab & IBM Research Workshop Trip Report Jakob Blomer, Javier Cervantes, Pere Mato, Radu Popescu 2018-12-03 Workshop Organization 1 full day at IBM Research Zürich ~25 participants from CERN ~10
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationData Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research
Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket
More informationGetting to know. by Michelle Darling August 2013
Getting to know by Michelle Darling mdarlingcmt@gmail.com August 2013 Agenda: What is Cassandra? Installation, CQL3 Data Modelling Summary Only 15 min to cover these, so please hold questions til the end,
More information40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011
40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011 Allan Cantle President & Founder www.nallatech.com Company Overview ISI + Nallatech + Innovative
More informationScalability of web applications
Scalability of web applications CSCI 470: Web Science Keith Vertanen Copyright 2014 Scalability questions Overview What's important in order to build scalable web sites? High availability vs. load balancing
More informationLambda Architecture with Apache Spark
Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR
More informationTwitter data Analytics using Distributed Computing
Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE
More informationThe SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.
Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate
More informationMicrosoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo
Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 HOTSPOT You install the Microsoft Hive ODBC Driver on a computer that runs Windows
More informationReal-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b
4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2015) Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b 1
More informationBig data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT
: Choices for high availability and disaster recovery on Microsoft Azure By Arnab Ganguly DataCAT March 2019 Contents Overview... 3 The challenge of a single-region architecture... 3 Configuration considerations...
More informationCompSci 516: Database Systems
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and
More informationHow to Use NoSQL in Enterprise Java Applications
How to Use NoSQL in Enterprise Java Applications Patrick Baumgartner NoSQL Roadshow Basel 30.08.2012 Agenda Speaker Profile New Demands on Data Access New Types of Data Stores Integrating NoSQL Data Stores
More informationAnnouncements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems
Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic
More informationLightweight KV-based Distributed Store for Datacenters
Lightweight KV-based Distributed Store for Datacenters Chanwoo Chung, Jinhyung Koo*, Arvind, and Sungjin Lee Massachusetts Institute of Technology (MIT) Daegu Gyeongbuk Institute of Science & Technology
More informationAccelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads
WHITE PAPER Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads December 2014 Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents
More informationBlueDBM: An Appliance for Big Data Analytics*
BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationThe NoSQL Ecosystem. Adam Marcus MIT CSAIL
The NoSQL Ecosystem Adam Marcus MIT CSAIL marcua@csail.mit.edu / @marcua About Me Social Computing + Database Systems Easily Distracted: Wrote The NoSQL Ecosystem in The Architecture of Open Source Applications
More informationThe Hardware & Software Implications of Microservices and How Big Data Can Help
The Hardware & Software Implications of Microservices and How Big Data Can Help Christina Delimitrou Cornell University with Yu Gan, Yanqi Zhang, Shuang Chen, Neeraj Kulkarni, Ariana Bruno, Justin Hu,
More informationFast packet processing in the cloud. Dániel Géhberger Ericsson Research
Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration
More informationIMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES
IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES by PRAMOD KUMAR GUDIPATI B.E., OSMANIA UNIVERSITY (OU), INDIA, 2012 A REPORT submitted in partial fulfillment of the requirements of the
More informationThe Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationE6885 Network Science Lecture 10: Graph Database (II)
E 6885 Topics in Signal Processing -- Network Science E6885 Network Science Lecture 10: Graph Database (II) Ching-Yung Lin, Dept. of Electrical Engineering, Columbia University November 18th, 2013 Course
More informationICALEPS 2013 Exploring No-SQL Alternatives for ALMA Monitoring System ADC
ICALEPS 2013 Exploring No-SQL Alternatives for ALMA Monitoring System Overview The current paradigm (CCL and Relational DataBase) Propose of a new monitor data system using NoSQL Monitoring Storage Requirements
More informationChapter 1, Introduction
CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from
More informationManaging IoT and Time Series Data with Amazon ElastiCache for Redis
Managing IoT and Time Series Data with ElastiCache for Redis Darin Briskman, ElastiCache Developer Outreach Michael Labib, Specialist Solutions Architect 2016, Web Services, Inc. or its Affiliates. All
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationexam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0
70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to
More information4 Myths about in-memory databases busted
4 Myths about in-memory databases busted Yiftach Shoolman Co-Founder & CTO @ Redis Labs @yiftachsh, @redislabsinc Background - Redis Created by Salvatore Sanfilippo (@antirez) OSS, in-memory NoSQL k/v
More informationDatabase Solution in Cloud Computing
Database Solution in Cloud Computing CERC liji@cnic.cn Outline Cloud Computing Database Solution Our Experiences in Database Cloud Computing SaaS Software as a Service PaaS Platform as a Service IaaS Infrastructure
More informationFacilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit
Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM Join the Conversation #OpenPOWERSummit Moral of the Story OpenPOWER is the best platform to
More informationUsing the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver
Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data
More informationDistributed systems for stream processing
Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall Alena Hall Large-scale data processing Distributed Systems Functional Programming Data Science & Machine
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationIn-Memory Data processing using Redis Database
In-Memory Data processing using Redis Database Gurpreet Kaur Spal Department of Computer Science and Engineering Baba Banda Singh Bahadur Engineering College, Fatehgarh Sahib, Punjab, India Jatinder Kaur
More informationOracle Exadata: Strategy and Roadmap
Oracle Exadata: Strategy and Roadmap - New Technologies, Cloud, and On-Premises Juan Loaiza Senior Vice President, Database Systems Technologies, Oracle Safe Harbor Statement The following is intended
More informationP51: High Performance Networking
P51: High Performance Networking Lecture 6: Programmable network devices Dr Noa Zilberman noa.zilberman@cl.cam.ac.uk Lent 2017/18 High Throughput Interfaces Performance Limitations So far we discussed
More information