Data Acquisition. The reference Big Data stack
|
|
- Brett Newton
- 5 years ago
- Views:
Transcription
1 Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack High-level Interfaces Data Processing Data Storage Resource Management Support / Integration Valeria Cardellini - SABD 2016/17 1
2 Data acquisition How to collect data from various data sources into storage layer? Distributed file system, NoSQL database for batch analysis How to connect data sources to streams or in-memory processing frameworks? Data stream processing frameworks for real-time analysis Valeria Cardellini - SABD 2016/17 2 Driving factors Source type Batch data sources: files, logs, RDBMS, Real-time data sources: sensors, IoT systems, social media feeds, stock market feeds, Velocity How fast data is generated? How frequently data varies? Real-time or streaming data require low latency and low overhead Ingestion mechanism Depends on data consumers Pull: pub/sub, message queue Push: framework pushes data to sinks Valeria Cardellini - SABD 2016/17 3
3 Architecture choices Message queue system (MQS) ActiveMQ RabbitMQ Amazon SQS Publish-subscribe system (pub-sub) Kafka Pulsar by Yahoo! Redis NATS Valeria Cardellini - SABD 2016/17 4 Initial use case Mainly used in the data processing pipelines for data ingestion or aggregation Envisioned mainly to be used at the beginning or end of a data processing pipeline Example Incoming data from various sensors Ingest this data into a streaming system for realtime analytics or a distributed file system for batch analytics Valeria Cardellini - SABD 2016/17 5
4 Queue message pattern Allows for persistent asynchronous communication How can a service and its consumers accommodate isolated failures and avoid unnecessarily locking resources? Principles Loose coupling Service statelessness Services minimize resource consumption by deferring the management of state information when necessary Valeria Cardellini - SABD 2016/17 6 Queue message pattern A sends a message to B B issues a response message back to A Valeria Cardellini - SABD 2016/17 7
5 Message queue API Basic interface to a queue in a MQS: put: nonblocking send Append a message to a specified queue get: blocking receive Block untile the specified queue is nonempty and remove the first message Variations: allow searching for a specific message in the queue, e.g., using a matching pattern poll: nonblocking receive Check a specified queue for message and remove the first Never block notify: nonblocking receive Install a handler (callback function) to be automatically called when a message is put into the specified queue Valeria Cardellini SABD 2016/17 8 Publish/subscribe pattern Valeria Cardellini - SABD 2016/17 9
6 Publish/subscribe pattern A sibling of message queue pattern but further generalizes it by delivering a message to multiple consumers Message queue: delivers messages to only one receiver, i.e., one-to-one communication Pub/sub channel: delivers messages to multiple receivers, i.e., one-to-many communication Some frameworks (e.g., RabbitMQ, Kafka, NATS) support both patterns Valeria Cardellini - SABD 2016/17 10 Pub/sub API Calls that capture the core of any pub/sub system: publish(event): to publish an event Events can be of any data type supported by the given implementation languages and may also contain meta-data subscribe(filter expr, notify_cb, expiry) sub handle: to subscribe to an event Takes a filter expression, a reference to a notify callback for event delivery, and an expiry time for the subscription registration. Returns a subscription handle unsubscribe(sub handle) notify_cb(sub_handle, event): called by the pub/sub system to deliver a matching event Valeria Cardellini SABD 2016/17 11
7 Apache Kafka General-purpose, distributed pub/sub system Allows to implement either message queue or pub/sub pattern Originally developed in 2010 by LinkedIn Written in Scala Horizontally scalable Fault-tolerant At least-once delivery Source: Kafka: A Distributed Messaging System for Log Processing, 2011 Valeria Cardellini SABD 2016/17 12 Kafka at a glance Kafka maintains feeds of messages in categories called topics Producers: publish messages to a Kafka topic Consumers: subscribe to topics and process the feed of published message Kafka cluster: distributed log of data over serves known as brokers Brokers rely on Apache Zookeeper for coordination Valeria Cardellini SABD 2016/17 13
8 Kafka: topics Topic: a category to which the message is published For each topic, Kafka cluster maintains a partitioned log Log: append-only, totally-ordered sequence of records ordered by time Topics are split into a pre-defined number of partitions Each partition is replicated with some replication factor Why? CLI command to create a topic > bin/kafka-topics.sh --create --zookeeper localhost: replication-factor 1 --partitions 1 --topic test! Valeria Cardellini - SABD 2016/17 14 Kafka: partitions Valeria Cardellini - SABD 2016/17 15
9 Kafka: partitions Each partition is an ordered, numbered, immutable sequence of records that is continually appended to Like a commit log Each record is associated with a sequence ID number called offset Partitions are distributed across brokers Each partition is replicated for fault tolerance Valeria Cardellini - SABD 2016/17 16 Kafka: partitions Each partition is replicated across a configurable number of brokers Each partition has one leader broker and 0 or more followers The leader handles read and write requests Read from leader Write to leader A follower replicates the leader and acts as a backup Each broker is a leader fro some of it partitions and a follower for others to load balance ZooKeeper is used to keep the broker consistent Valeria Cardellini - SABD 2016/17 17
10 Kafka: producers Publish data to topics of their choice Also responsible for choosing which record to assign to which partition within the topic Round-robin or partitioned by keys Producers = data sources Run the producer > bin/kafka-console-producer.sh --broker-list localhost: topic test! This is a message! This is another message! Valeria Cardellini - SABD 2016/17 18 Kafka: consumers Valeria Cardellini - SABD 2016/17 Consumer Group: set of consumers sharing a common group ID A Consumer Group maps to a logical subscriber Each group consists of multiple consumers for scalability and fault tolerance Consumers use the offset to track which messages have been consumed Messages can be replayed using the offset Run the consumer > bin/kafka-console-consumer.sh --bootstrap-server localhost: topic test --from-beginning! 19
11 Kafka: ZooKeeper Kafka uses ZooKeeper to coordinate between the producers, consumers and brokers ZooKeeper stores metadata List of brokers List of consumers and their offsets List of producers ZooKeeper runs several algorithms for coordination between consumers and brokers Consumer registration algorithm Consumer rebalancing algorithm Allows all the consumers in a group to come into consensus on which consumer is consuming which partitions Valeria Cardellini - SABD 2016/17 20 Kafka design choices Push vs. pull model for consumers Push model Challenging for the broker to deal with diverse consumers as it controls the rate at which data is transferred Need to decide whether to send a message immediately or accumulate more data and send Pull model In case broker has no data, consumer may end up busy waiting for data to arrive Valeria Cardellini - SABD 2016/17 21
12 Kafka: ordering guarantees Messages sent by a producer to a particular topic partition will be appended in the order they are sent Consumer instance sees records in the order they are stored in the log Strong guarantees about ordering within a partition Total order over messages within a partition, not between different partitions in a topic Per-partition ordering combined with the ability to partition data by key is sufficient for most applications Valeria Cardellini - SABD 2016/17 22 Kafka: fault tolerance Replicates partitions for fault tolerance Kafka makes a message available for consumption only after all the followers acknowledge to the leader a successful write Implies that a message may not be immediately available for consumption Kafka retains messages for a configured period of time Messages can be replayed in the event that a consumer fails Valeria Cardellini - SABD 2016/17 23
13 Kafka: limitations Kafka follows the pattern of active-backup with the notion of leader partition replica and follower partition replicas Kafka only writes to filesystem page cache Reduced durability DistributedLog from Twitter claims to solve these issues Valeria Cardellini - SABD 2016/17 24 Kafka APIs Four core APIs Producer API: allows app to publish streams of records to one or more Kafka topics Consumer API: allows app to subscribe to one or more topics and process the stream of records produced to them Connector API: allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems so to move large collections of data into and out of Kafka Valeria Cardellini - SABD 2016/17 25
14 Kafka APIs Streams API: allows app to act as a stream processor, transforming an input stream from one or more topics to an output stream to one or more output topics Can use Kafka Streams to process data in pipelines consisting of multiple stages Valeria Cardellini - SABD 2016/17 26 Client library JVM internal client Plus rich ecosystem of clients, among which: Sarama: Go library for Kafka Python library for Kafka - NodeJS client Valeria Cardellini - SABD 2016/17 27
15 LinkedIn Valeria Cardellini - SABD 2016/17 28 Netflix Netflix uses Kafka for data collection and buffering so that it can be used by downstream systems Valeria Cardellini - SABD 2016/17 29
16 Uber Uber uses Kafka for real-time business driven decisions Valeria Cardellini - SABD 2016/ Rome Tor CINI Smart City Challenge 17 Valeria Cardellini - SABD 2016/17 By M. Adriani, D. Magnanimi, M. Ponza, F. Rossi 31
17 Realtime data Facebook Data originated in mobile and web products is fed into Scribe, a distributed data transport system The realtime stream processing systems write to Scribe Source: Realtime Data Processing at Facebook, SIGMOD Valeria Cardellini - SABD 2016/17 32 Scribe Transport mechanism for sending data to batch and real-time systems at Facebook Persistent, distributed messaging system for collecting, aggregating and delivering high volumes of log data with a few seconds of latency and high throughput Data is organized by category Category = distinct stream of data All data is written to or read from a specific category Multiple buckets per scribe category Scribe bucket = basic processing unit for stream processing systems Scribe provides data durability by storing it in HDFS Scribe messages are stored and streams can be replayed by the same or different receivers for up to a few days Valeria Cardellini - SABD 2016/17 33
18 Messaging queues Can be used for push-pull messaging Producers push data to the queue Consumers pull data from the queue Message queue systems based on protocols: RabbitMQ Implements AMQP and relies on a broker-based architecture ZeroMQ High-throughput and lightweight messaging library No persistence Amazon SQS Valeria Cardellini - SABD 2016/17 34 Data collection systems Allow collecting, aggregating and moving data From various sources (server logs, social media, streaming sensor data, ) To a data store (distributed file system, NoSQL data store) Valeria Cardellini - SABD 2016/17 35
19 Apache Flume Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data Robust and fault tolerant with tunable reliability mechanisms and failover and recovery mechanisms Suitable for online analytics Valeria Cardellini - SABD 2016/17 36 Flume architecture Valeria Cardellini - SABD 2016/17 37
20 Flume data flows Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination Supports multiplexing the event flow to one or more destinations Multiple built-in sources and sinks (e.g., Avro) Valeria Cardellini - SABD 2016/17 38 Flume reliability Events are staged in a channel on each agent Events are then delivered to the next agent or terminal repository (e.g., HDFS) in the flow Events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository Transactional approach to guarantee the reliable delivery of events Sources and sinks encapsulate in a transaction the storage/retrieval of the events placed in or provided by a transaction provided by the channel Valeria Cardellini - SABD 2016/17 39
21 Apache Sqoop Efficient tool to import bulk data from structured data stores such as RDBMS into Hadoop HDFS, HBase or Hive Also to export data from HDFS to RDBMS Valeria Cardellini - SABD 2016/17 40 Amazon IoT Cloud service for collecting data from IoT devices into AWS cloud Valeria Cardellini - SABD 2016/17 41
22 References Kreps et al., Kafka: a Distributed Messaging System for Log Processing, NetDB Apache Kafka documentation, Chen et al., Realtime Data Processing at Facebook, SIGMOD Apache Flume documentation, S. Hoffman, Apache Flume: Distributed Log Collection for Hadoop - Second Edition, Valeria Cardellini - SABD 2016/17 42
Data Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference
More informationA Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers
A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented
More informationSearch Engines and Time Series Databases
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Search Engines and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18
More informationNewSQL Databases. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica NewSQL Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference
More informationSearch and Time Series Databases
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Search and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria
More informationKafka Streams: Hands-on Session A.A. 2017/18
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Kafka Streams: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationLet the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka. Materna GmbH
Let the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka Wer ist Frank Pientka? Dipl.-Informatiker (TH Karlsruhe) Verheiratet, 2 Töchter Principal Software Architect in Dortmund Fast
More informationApache Storm: Hands-on Session A.A. 2016/17
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Storm: Hands-on Session A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica
More informationFluentd + MongoDB + Spark = Awesome Sauce
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationThe SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.
Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationLecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka
Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka What problem does Kafka solve? Provides a way to deliver updates about changes in state from one service to another
More informationIntra-cluster Replication for Apache Kafka. Jun Rao
Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationWHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka
WHITE PAPER Reference Guide for Deploying and Configuring Apache Kafka Revised: 02/2015 Table of Content 1. Introduction 3 2. Apache Kafka Technology Overview 3 3. Common Use Cases for Kafka 4 4. Deploying
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationBuilding Durable Real-time Data Pipeline
Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A Publish-Subscribe Online services
More informationREAL-TIME ANALYTICS WITH APACHE STORM
REAL-TIME ANALYTICS WITH APACHE STORM Mevlut Demir PhD Student IN TODAY S TALK 1- Problem Formulation 2- A Real-Time Framework and Its Components with an existing applications 3- Proposed Framework 4-
More informationHortonworks and The Internet of Things
Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded
More informationTools for Social Networking Infrastructures
Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationIntroduction to Kafka (and why you care)
Introduction to Kafka (and why you care) Richard Nikula VP, Product Development and Support Nastel Technologies, Inc. 2 Introduction Richard Nikula VP of Product Development and Support Involved in MQ
More informationBuilding loosely coupled and scalable systems using Event-Driven Architecture. Jonas Bonér Patrik Nordwall Andreas Källberg
Building loosely coupled and scalable systems using Event-Driven Architecture Jonas Bonér Patrik Nordwall Andreas Källberg Why is EDA Important for Scalability? What building blocks does EDA consists of?
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationIndex. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /
Index A ACID, 251 Actor model Akka installation, 44 Akka logos, 41 OOP vs. actors, 42 43 thread-based concurrency, 42 Agents server, 140, 251 Aggregation techniques materialized views, 216 probabilistic
More informationChallenges in Data Stream Processing
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria
More informationCloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018
Cloudline Autonomous Driving Solutions Accelerating insights through a new generation of Data and Analytics October, 2018 HPE big data analytics solutions power the data-driven enterprise Secure, workload-optimized
More informationFog Computing. The scenario
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Fog Computing Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The scenario
More informationData pipelines with PostgreSQL & Kafka
Data pipelines with PostgreSQL & Kafka Oskari Saarenmaa PostgresConf US 2018 - Jersey City Agenda 1. Introduction 2. Data pipelines, old and new 3. Apache Kafka 4. Sample data pipeline with Kafka & PostgreSQL
More informationUniversità degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica. Hadoop Ecosystem
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Hadoop Ecosystem Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini Why an
More informationHow Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,
How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS
More informationIntroduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent
Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage
More informationApache Kafka Your Event Stream Processing Solution
Apache Kafka Your Event Stream Processing Solution Introduction Data is one among the newer ingredients in the Internet-based systems and includes user-activity events related to logins, page visits, clicks,
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationTransformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's
Building Agile and Resilient Schema Transformations using Apache Kafka and ESB's Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's Ricardo Ferreira
More informationData Infrastructure at LinkedIn. Shirshanka Das XLDB 2011
Data Infrastructure at LinkedIn Shirshanka Das XLDB 2011 1 Me UCLA Ph.D. 2005 (Distributed protocols in content delivery networks) PayPal (Web frameworks and Session Stores) Yahoo! (Serving Infrastructure,
More informationMicroservices Lessons Learned From a Startup Perspective
Microservices Lessons Learned From a Startup Perspective Susanne Kaiser @suksr CTO at Just Software @JustSocialApps Each journey is different People try to copy Netflix, but they can only copy what they
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationOver the last few years, we have seen a disruption in the data management
JAYANT SHEKHAR AND AMANDEEP KHURANA Jayant is Principal Solutions Architect at Cloudera working with various large and small companies in various Verticals on their big data and data science use cases,
More informationTime and Space. Indirect communication. Time and space uncoupling. indirect communication
Time and Space Indirect communication Johan Montelius In direct communication sender and receivers exist in the same time and know of each other. KTH In indirect communication we relax these requirements.
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationVOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
More informationUsing the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver
Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationReactive Microservices Architecture on AWS
Reactive Microservices Architecture on AWS Sascha Möllering Solutions Architect, @sascha242, Amazon Web Services Germany GmbH Why are we here today? https://secure.flickr.com/photos/mgifford/4525333972
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationApache BookKeeper. A High Performance and Low Latency Storage Service
Apache BookKeeper A High Performance and Low Latency Storage Service Hello! I am Sijie Guo - PMC Chair of Apache BookKeeper Co-creator of Apache DistributedLog Twitter Messaging/Pub-Sub Team Yahoo! R&D
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationIndirect Communication
Indirect Communication Vladimir Vlassov and Johan Montelius KTH ROYAL INSTITUTE OF TECHNOLOGY Time and Space In direct communication sender and receivers exist in the same time and know of each other.
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationAdvanced Data Processing Techniques for Distributed Applications and Systems
DST Summer 2018 Advanced Data Processing Techniques for Distributed Applications and Systems Hong-Linh Truong Faculty of Informatics, TU Wien hong-linh.truong@tuwien.ac.at www.infosys.tuwien.ac.at/staff/truong
More informationEnhancing cloud applications by using messaging services IBM Corporation
Enhancing cloud applications by using messaging services After you complete this section, you should understand: Messaging use cases, benefits, and available APIs in the Message Hub service Message Hub
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationTyphoon: An SDN Enhanced Real-Time Big Data Streaming Framework
Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Junguk Cho, Hyunseok Chang, Sarit Mukherjee, T.V. Lakshman, and Jacobus Van der Merwe 1 Big Data Era Big data analysis is increasingly common
More informationMapReduce and Hadoop
Università degli Studi di Roma Tor Vergata MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack High-level Interfaces Data Processing
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationIntroduction to Apache Apex
Introduction to Apache Apex Siyuan Hua @hsy541 PMC Apache Apex, Senior Engineer DataTorrent, Big Data Technology Conference, Beijing, Dec 10 th 2016 Stream Data Processing Data Delivery
More informationIncrease Value from Big Data with Real-Time Data Integration and Streaming Analytics
Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Cy Erbay Senior Director Striim Executive Summary Striim is Uniquely Qualified to Solve the Challenges of Real-Time
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationIntroduction to Big-Data
Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationIntroduction to Big Data
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Introduction to Big Data Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini
More information1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions
Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationBig Data on AWS. Peter-Mark Verwoerd Solutions Architect
Big Data on AWS Peter-Mark Verwoerd Solutions Architect What to get out of this talk Non-technical: Big Data processing stages: ingest, store, process, visualize Hot vs. Cold data Low latency processing
More informationManaging IoT and Time Series Data with Amazon ElastiCache for Redis
Managing IoT and Time Series Data with ElastiCache for Redis Darin Briskman, ElastiCache Developer Outreach Michael Labib, Specialist Solutions Architect 2016, Web Services, Inc. or its Affiliates. All
More informationThe Stream Processor as a Database. Ufuk
The Stream Processor as a Database Ufuk Celebi @iamuce Realtime Counts and Aggregates The (Classic) Use Case 2 (Real-)Time Series Statistics Stream of Events Real-time Statistics 3 The Architecture collect
More informationDistributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang
A lightweight, pluggable, and scalable ingestion service for real-time data ABSTRACT This paper provides the motivation, implementation details, and evaluation of a lightweight distributed extract-transform-load
More information<Insert Picture Here> QCon: London 2009 Data Grid Design Patterns
QCon: London 2009 Data Grid Design Patterns Brian Oliver Global Solutions Architect brian.oliver@oracle.com Oracle Coherence Oracle Fusion Middleware Product Management Agenda Traditional
More informationrkafka rkafka is a package created to expose functionalities provided by Apache Kafka in the R layer. Version 1.1
rkafka rkafka is a package created to expose functionalities provided by Apache Kafka in the R layer. Version 1.1 Wednesday 28 th June, 2017 rkafka Shruti Gupta Wednesday 28 th June, 2017 Contents 1 Introduction
More informationBig Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka
Course Curriculum: Your 10 Module Learning Plan Big Data and Hadoop About Edureka Edureka is a leading e-learning platform providing live instructor-led interactive online training. We cater to professionals
More informationIntroduction to Data Intensive Computing
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Introduction to Data Intensive Computing Corso di Sistemi Distribuiti e Cloud Computing A.A. 2017/18
More informationHadoop Ecosystem. Why an ecosystem
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Hadoop Ecosystem Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini Why an
More informationStreaming Log Analytics with Kafka
Streaming Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer Anything, In Real-Time. Why this talk? Humio is a Log Analytics system Designed to run on-prem High volume, real
More informationApache Kafka a system optimized for writing. Bernhard Hopfenmüller. 23. Oktober 2018
Apache Kafka...... a system optimized for writing Bernhard Hopfenmüller 23. Oktober 2018 whoami Bernhard Hopfenmüller IT Consultant @ ATIX AG IRC: Fobhep github.com/fobhep whoarewe The Linux & Open Source
More informationMicroservices, Messaging and Science Gateways. Review microservices for science gateways and then discuss messaging systems.
Microservices, Messaging and Science Gateways Review microservices for science gateways and then discuss messaging systems. Micro- Services Distributed Systems DevOps The Gateway Octopus Diagram Browser
More informationDiving into Open Source Messaging: What Is Kafka?
Diving into Open Source Messaging: What Is Kafka? The world of messaging middleware has changed dramatically over the last 30 years. But in truth the world of communication has changed dramatically as
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationiway iway Big Data Integrator New Features Bulletin and Release Notes Version DN
iway iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.0 DN3502232.1216 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo,
More informationStreaming Integration and Intelligence For Automating Time Sensitive Events
Streaming Integration and Intelligence For Automating Time Sensitive Events Ted Fish Director Sales, Midwest ted@striim.com 312-330-4929 Striim Executive Summary Delivering Data for Time Sensitive Processes
More informationBig data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT
: Choices for high availability and disaster recovery on Microsoft Azure By Arnab Ganguly DataCAT March 2019 Contents Overview... 3 The challenge of a single-region architecture... 3 Configuration considerations...
More informationEvolution of an Apache Spark Architecture for Processing Game Data
Evolution of an Apache Spark Architecture for Processing Game Data Nick Afshartous WB Analytics Platform May 17 th 2017 May 17 th, 2017 About Me nafshartous@wbgames.com WB Analytics Core Platform Lead
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationEsper EQC. Horizontal Scale-Out for Complex Event Processing
Esper EQC Horizontal Scale-Out for Complex Event Processing Esper EQC - Introduction Esper query container (EQC) is the horizontal scale-out architecture for Complex Event Processing with Esper and EsperHA
More informationBuilding LinkedIn s Real-time Data Pipeline. Jay Kreps
Building LinkedIn s Real-time Data Pipeline Jay Kreps What is a data pipeline? What data is there? Database data Activity data Page Views, Ad Impressions, etc Messaging JMS, AMQP, etc Application and
More informationSecurity and Performance advances with Oracle Big Data SQL
Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,
More informationIMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES
IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES by PRAMOD KUMAR GUDIPATI B.E., OSMANIA UNIVERSITY (OU), INDIA, 2012 A REPORT submitted in partial fulfillment of the requirements of the
More informationIEMS 5780 / IERG 4080 Building and Deploying Scalable Machine Learning Services
IEMS 5780 / IERG 4080 Building and Deploying Scalable Machine Learning Services Lecture 11 - Asynchronous Tasks and Message Queues Albert Au Yeung 22nd November, 2018 1 / 53 Asynchronous Tasks 2 / 53 Client
More informationPNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013
PNUTS: Yahoo! s Hosted Data Serving Platform Reading Review by: Alex Degtiar (adegtiar) 15-799 9/30/2013 What is PNUTS? Yahoo s NoSQL database Motivated by web applications Massively parallel Geographically
More informationOracle NoSQL Database Enterprise Edition, Version 18.1
Oracle NoSQL Database Enterprise Edition, Version 18.1 Oracle NoSQL Database is a scalable, distributed NoSQL database, designed to provide highly reliable, flexible and available data management across
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationHDInsight > Hadoop. October 12, 2017
HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond
More information