14th Iran Media Technology Conference. by H. Shah-Hosseini. 12 Dec Gathered & presented by H. Shah-Hosseini 1

Size: px
Start display at page:

Download "14th Iran Media Technology Conference. by H. Shah-Hosseini. 12 Dec Gathered & presented by H. Shah-Hosseini 1"

Transcription

1 14th Iran Media Technology Conference by H. Shah-Hosseini 12 Dec Gathered & presented by H. Shah-Hosseini 1

2 Topics Big data: Big data and its four vs: volume, velocity, variety, and veracity Another two v's for big data: valence, and value Data science: Data science and its five p's Data science process: acquire, process, analyze, report, act More on analysis (data mining): Classification, Regression, Clustering, Association analysis (rules), Graph analytics Hadoop Hadoop Distributed File system Hadoop Yarn Hadoop MapReduce 14th Iran Media Technology Conference Gathered & presented by H. Shah-Hosseini 2

3 Gathered & presented by H. Shah-Hosseini 3

4 Big data: Volume Gathered & presented by H. Shah-Hosseini 4

5 Big data: Volume (2) Volume is related to size and exponential growth of data Every minute: 204 million s are sent Facebook: 200,000 photos are uploaded; 1.8 Million likes are given. YouTube: 1.3 Million video views 72 hours of video uploads Challenges: storage, access, and processing Gathered & presented by H. Shah-Hosseini 5

6 Big data: Velocity Gathered & presented by H. Shah-Hosseini 6

7 Big data: Velocity (2) Velocity: Speed of creating data, and the need to hشve speed in storing and analyzing data. For big data: We need real-time action late decisions lead to missing opportunities, losing customers visiting your online store; or loss of lives in healthcare or disasters Thus, real-time processing is prefered in contrast to batch processing. Gathered & presented by H. Shah-Hosseini 7

8 Big data: Variety Gathered & presented by H. Shah-Hosseini 8

9 Big data: Variety (2) Variety is related to complexity of data structure Axes of Data Variety: Structural Variety: formats and models Media Variety: medium in which data get delivered Semantic Variety: how to interpret and operate on data Availability Variations: real-time? Intermittent? We can have variety even in a single Sender, receiver, date,..: Well-structured Body of the Text Attachments: Multi-media Who-sends-to-whom: Network A current cannot reference a past Semantics Real-time? Availability Gathered & presented by H. Shah-Hosseini 9

10 Big data: Veracity Gathered & presented by H. Shah-Hosseini 10

11 Big data: Veracity (2) Veracity refers to quality Accuracy of data data can be noisy, imprecise, biased, or full of uncertainty Reliability of the data source where the data come from or how was generated is also a factor Ordinary citizens who volunteer to report when they or someone in their family are experiencing symptoms of ILI. Flu Near You, a system run by the HealthMap initiative cofounded by Brownstein at Boston Children s Hospital, was launched in 2011 and now has 46,000 participants, covering 70,000 people. Gathered & presented by H. Shah-Hosseini 11

12 Big data: Characteristics 4+2 V s of big data: Valence: refers to connectedness of big data. Thus, it is how interconnected the data is. As there are more and more connections among the data the complexity of the analysis increases. Value: is the benefit we get from big data Gathered & presented by H. Shah-Hosseini 12

13 Gathered & presented by H. Shah-Hosseini 13

14 Data science, and its five components ( five P's) Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured (from wiki) which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD) Five P s of data science People: is the data scientists as a team. Purpose: is the big data challenge for example, rate of spread and direction in a wildfire Gathered & presented by H. Shah-Hosseini 14

15 Data scientist skills: Top five Top five skills needed for a data scientist (first two, technical): 1) Programming The ability to analyze large datasets, and create tools to do better data science. 2) Quantitative analysis Experimental design and analysis, Modeling of complex economic or growth systems(churn models), Machine learning 3) Product intuition Generating hypotheses, Defining metrics, Debugging analyses 4) Communication Communicating insights, Data visualization and presentation, General communication 5) Teamwork Being selfless, Constant iteration, and Sharing knowledge with others Gathered & presented by H. Shah-Hosseini 15

16 Data science process The data science process includes five steps: 1) Acquire: identity datasets and retrieve them 2) Prepare: composed of two sub-steps: explore & preprocess 3) Analyze: Select analytical techniques, and build models 4) Report: evaluation of analytical results and creating reports 5) Act: Apply the results The five steps can be repeated to demand the original purpose Gathered & presented by H. Shah-Hosseini 16

17 Step1: acquire Determine what data is available and acquire them. For this purpose, we should identity suitable data and make use of all data relevant to our problem for analysis Data comes from many different sources, structured or unstructured, with different velocities And different technologies to access these data Gathered & presented by H. Shah-Hosseini 17

18 Step 1: Acquire: example: traditional databases We use SQL and query browsers to acquire data from these databases Here, the data is structured Gathered & presented by H. Shah-Hosseini 18

19 Step 1: Acquire: example: files (such as text files and excel spreadsheets) We often use scripting languages to acquire data from files Gathered & presented by H. Shah-Hosseini 19

20 Step 1: Acquire: example: from websites There are a variety of formats and services that webpages are allowed to use based on W3C: Formats includes xml, html, that webpages are written within Web services are also hosted by websites to programmatically access to their data. Gathered & presented by H. Shah-Hosseini 20

21 Step 1: Acquire: example: nosql storage nosql storage systems are used to manage a variety of data types and also big data. In these data storage systems, data are not stored as rows and columns nosql storage systems provide API to access their data Also, most nosql systems provide webservices (such as REST) to interface with their data. Gathered & presented by H. Shah-Hosseini 21

22 Step 1: Acquire: a use case: wildfire case Sensor data from weather stations have been stored in relational databases. So we can use SQL to access these data which may use to model the fire Real-time data via a websocket service to receive weather station data. These data are processed and compared to the patterns found by our model to assess the situation Feeds can be retrieved via hash tags related to any fire occurring near the interested region To get tweets of people for sentiment analysis to measure how people feel: are they experiencing fear, anger, or ignorance of fire. Which may lead to measure the urgency of the fire A similar scenario can be designed for earthquake Gathered & presented by H. Shah-Hosseini 22

23 Step 2a: Exploring data First step after acquiring data is to explore it (understand the data) in step explore, we look for things such as correlations, outliers, and general trends. Without this step, we cannot use the data effectively Correlation graphs show the dependencies between variables in the data. Graphing the general trend show if there is a consistent direction in which the variables are moving toward. Such as sales prices are going up or down. An outlier is a data point that is distant from other data points. Gathered & presented by H. Shah-Hosseini 23 Outliers must be avoided.

24 Step 2a: Exploring data (2) Also, we may use statistics to describe your data with numerical values These numbers give us an idea of the nature of our data. For example, a negative range in the field of age indicates that there is something wrong in our data Gathered & presented by H. Shah-Hosseini 24

25 Step 2a: Exploring data (3) Visualization provides a quick look at the data in this preliminary analysis step: For example, heat maps provide a quick look where the hot spots are Histogram shows the distribution of data and may reveal unusual spread of data Gathered & presented by H. Shah-Hosseini 25

26 Step 2b: preprocess The raw data are never in the format we need. In the preprocess step: We have to clean the data Then, to transform the data to make it suitable for analysis Real data is messy. Data quality issues includes: Inconsistent values Missing values Duplicate records Invalid data (such as a longer postal code) Outliers (very different from the rest of the data) We need to detect and correct these quality issues Gathered & presented by H. Shah-Hosseini 26

27 Step 2b: preprocess: addressing data quality issues: cleaning the data To handle incomplete or incorrect data, we need domain knowledge Such as the knowledge of the application, how the data was collected; the users of the application, etc Gathered & presented by H. Shah-Hosseini 27

28 Step 2b: preprocess: data munging Here, we manipulate the clean data into the format needed for the analysis. Other names: data wrangling, data preprocessing Some operations may use in this preprocess step are: Feature selection, scaling, dimension reduction, transformation, manipulation Data preparation is very important for meaningful analysis. Gathered & presented by H. Shah-Hosseini 28

29 Step 2b: preprocess: data munging: scaling Scaling means changing the range of values to become within the specified range. The scaling prevents large values to dominate the results in the analysis Scale to [0,1] by : xnew=(x-min(x))/(max(x)-min(x)) Or to make data: zero-mean and unitvariance Gathered & presented by H. Shah-Hosseini 29

30 Step 2b: preprocess: data munging: transformation We can also transform data to make it better for analysis For example, we may use transformations to reduce noise in the data or to reduce variability. Aggregation (average filter) is such a transformation that reduce details or variability For example, daily sales figures have many irregular changes, aggregating to weekly or monthly figures results in smoother data Point: using such transformation removes details from the data. So, care must betaken if detail is needed for an application Gathered & presented by H. Shah-Hosseini 30

31 Step 2b: preprocess: data munging: transformation: Denoising An example of denoising with a neural networkbased autoencoder using Keras and Python. Gathered & presented by H. Shah-Hosseini 31

32 Step 2b: preprocess: data munging: feature selection The process of selecting a subset of relevant features (variables, predictors) that are useful to build a good predictor (model) Feature selection can be used for: Removing irrelevant or redundant features Which results in making the analysis easier Combining features Creating new features For example, If two feature are highly correlated, one of them can be removed without negatively affecting the analysis Feature selection algorithms broadly fall into three categories: filter, wrapper and embedded models. Gathered & presented by H. Shah-Hosseini 32

33 Step 2b: preprocess: data munging: feature selection: wrapper approach A wrapper feature selection, implemented in KNIME using a Naive Bayes classifier Gathered & presented by H. Shah-Hosseini 33

34 Step 2b: preprocess: data munging: dimensionality reduction Dimensionality reduction is useful when we have a large amount of dimensions (features) for each record in the dataset. It involves finding a smaller subset of dimensions that capture most of the variations in the data By doing this, we remove irrelevant features as well as reducing the features, which finally leads to simpler analysis. This can be used for data compression also. Here, a cat is represented by 1,2 and 5 components instead of 100 pixels. 3D MDS for visualization 2D Gathered & presented by H. Shah-Hosseini 34

35 Step 2b: preprocess: data munging: Transformation into feature space Example of using PCA (Principal Component Analysis) for transforming data into eigenspace: Gathered & presented by H. Shah-Hosseini 35

36 Step 2b: preprocess: data munging: data manipulation Raw data often has to be manipulated to be in the correct format for analysis For example, from samples recording daily changes in stocks prices, we may be interested in price changes of a particular market segment, such as real estate, or healthcare. which has to be extracted from the data. This requires to determine which stocks belongs to which market segment. Then, grouping them together and perhaps computing the mean, range, or standard deviations for each group. Each block shows a record in the dataset Gathered & presented by H. Shah-Hosseini 36

37 Step 3: analyze Building model from the data (input data) The model generates the output data Since there are different types of problems, then we have different types of techniques for analysis such as: Classification Regression Clustering Association analysis (rules) Graph analytics (graph mining) Recommendation systems Model building: Input data-> analyze technique-> model -> output data Gathered & presented by H. Shah-Hosseini 37

38 Step 3: analyze : Classification Classification: To predict the category of input data If we have only two categories, we call it binary classification. For handwritten digits, how many categories do we have? Another example: Spam filter for s, having two classes: spam vs. nonspam, which is a binary classification problem. A simple binary classifier: Example: spam or nonspam is a binary classification Gathered & presented by H. Shah-Hosseini 38

39 Step 3: analyze : Classification (2) Example: Deep Learning for image classification, using a Deep Learning model trained on Imagenet: Gathered & presented by H. Shah-Hosseini 39

40 Step 3: analyze : Regression Regression is when we have to predict a numeric value instead of category For example, to predict the price of a stock, gold, or oil To approximate a function by its data points, Example below: linear regression Gathered & presented by H. Shah-Hosseini 40

41 Step 3: analyze : Clustering Clustering: here, goal is to organizing similar items into groups. A clustering with three clusters Example: customer segmentation Using DBSCAN for clustering with KNIME: Gray points are considered noise with Gathered & presented by H. Shah-Hosseini DBSCAN 41

42 Step 3: analyze : association Association analysis: the goal is to find rules to capture association between items. An example is market-basket analysis, in which we want to discover which items come together frequently in baskets. The form of an association rule is: i j, where i is a set of items: i= {i1,i2, ik} and j is an item. The implication of this association rule is that if all of the items in i appear in some basket, then j is likely to appear in that basket as well point: Fequent itemsets are obtained to get association rules. Gathered & presented by H. Shah-Hosseini 42

43 Step 3: analyze : association (2) Example: Consider eight baskets for itemset {b,c,m,p,j}: B1 ={m,c,b} B2 ={m,p,j} B3 ={m,b} B4 ={c,j} B5 ={m,p,b} B6 ={m,c,b,j} B7 ={c,b,j} B8 ={b,c} An association rule: {m, b} --> c. has confidence=2/4=50% confidence( i j) support( i, j) support( i) If i is a set of items, the support for i is the number of baskets for which i is a subset. (m,b,c) appears in B1, and B6 (m,b) appears in B1,B3,B5, and B6 Gathered & presented by H. Shah-Hosseini 43

44 Step 3: analyze: graph analytics When data can be transformed into a graph, we may use the graph structure to find connections between entities. For example, it can be used for exploring the effect of a disease or epidemics by analyzing hospital or doctors records, or by analyzing social networks related to a specific region Example: community detection Gathered & presented by H. Shah-Hosseini 44

45 Step 3: analyze: node importance: Pagerank, degree centrality Pagerank scores have been normalized to sum to 100. The importance is visualized by their size. The more in-links leads to more importance Degree centrality for the Karate club graph: Gathered & presented by H. Shah-Hosseini 45

46 Step 3: analyze: recommendation systems There are different approches for recommendation systems: content-based collaborative filtering latent factors Example: book recom. Gathered & presented by H. Shah-Hosseini 46

47 Step 3: analyze: modelling Modelling includes: Selecting the technique, Building the model Validating the model Validation depends on the technique we used. For example, we may apply the model to new data samples that the model has not seen before for classification,we compare the predicted values with correct values in the test set. Gathered & presented by H. Shah-Hosseini 47

48 Step 4: Reporting To communicate your insights It is based on the audience The first thing to do is to determine what to present. To answer to this, we have to answer the following question: What is the main results (the punchline)? What added value do these results provide? How do the results compare to the success criteria determined at the beginning of the project? The results may be puzzling or they are counter to what you were hoping to find. You must report them too. Gathered & presented by H. Shah-Hosseini 48

49 Step 4: Reporting: Visualization tools Some open-source visualization tools are, Python, KNIME, R, and Tableau: Gathered & presented by H. Shah-Hosseini 49

50 Step 5: act: turning insights into action To determine what actions should be taken? For example, Is there something in the process that should be changed to remove bottlenecks? Is there data that should be added to the application to make it more accurate? Should we segment our population into more well-defined groups? How to implement the actions? what should be added into your process how should it be automated stakeholders need to be identified and involved Gathered & presented by H. Shah-Hosseini 50

51 Step 5: act: evaluation We need to assess the impact of the action By monitoring and measuring the impact of the action on the process or the application. Which finally leads to evaluation Evaluation determines the next steps: Should we revisit some data? We need to determine real-time actions and to automate these actions Gathered & presented by H. Shah-Hosseini 51

52 Gathered & presented by H. Shah-Hosseini 52

53 Hadoop: What is it? Apache Hadoop: is an open source software framework for storage & large scale processing of data-sets on clusters of commodity hardware Hadoop was created by Doug Cutting and Mike Cafarella in 2005, Cutting named the project after son's toy elephant some main Hadoop's features: moving computation to data instead of moving data to computation scalability reliability new kind of analysis: simple algorithm on large data Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework. Gathered & presented by H. Shah-Hosseini 53

54 Hadoop is layered A layered example: For example: Storm, Spark, and Flink can be used Real-time and in-memory processing. Gathered & presented by H. Shah-Hosseini 54

55 Hadoop: HDFS HDFS: Distributed, scalable, and portable file system written in Java for the Hadoop framework, derive from Google file system. Intended for large files and batch inserts. (Write Once, Read many times.) Gathered & presented by H. Shah-Hosseini 55

56 From Hadoop 1.0 to Hadoop 2.0: YARN YARN has been born to do resource management separate from data processing. YARN 'schedules applications in order to prioritize tasks and maintains big data analytics systems. As one part of a greater architecture, Yarn aggregates and sorts data to conduct specific queries for data retrieval. It helps to allocate resources to particular applications and manages other kinds of resource monitoring tasks. Gathered & presented by H. Shah-Hosseini 56

57 Hadoop Ecosystem Hadoop Ecosystem: refers to the various components of Apache Hadoop Software library. It is a set of tools and accessories to address particular needs in processing the Big Data. In other words, a set of different modules interacting together forms a Hadoop Ecosystem. Question: How to figure out this Zoo? Gathered & presented by H. Shah-Hosseini 57

58 Hadoop Zoo: Examples Facbook's stack: Gathered & presented by H. Shah-Hosseini 58

59 Hadoop Zoo: Examples (2) Yahoo's stack: Gathered & presented by H. Shah-Hosseini 59

60 Hadoop Zoo: Examples (3) LinkedIn's stack: Gathered & presented by H. Shah-Hosseini 60

61 Hadoop Zoo: Examples (4) Cloudera's stack: Gathered & presented by H. Shah-Hosseini 61

62 Hadoop's major componets: Sqoop Sqoop: a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Gathered & presented by H. Shah-Hosseini 62

63 Hadoop's major componets: HBase Column-oriented database management system Key-value store Based on Google Big Table It can hold extremely large data Dynamic data model It is not a Relational DBMS It supports both batch-style computations using MapReduce and point queries (random reads) Consistent performance of reads/writes to data used by Hadoop applications. Allows Data Store to be aggregated or processed using MapReduce functionality. Data platform for Analytics and Machine Learning. Bulk storage of logs, documents, real-time activity feeds and raw imported data. Gathered & presented by H. Shah-Hosseini 63

64 Hadoop's major componets: Pig, Hive, Oozie Pig: High level programming on top of Hadoop MapReduce. Data analysis problems as data flows Originally developed at Yahoo Hive: Data warehouse software facilitates querying and managing large datasets residing in distributed storage Mechanism to project structure onto this data and query the data using a SQLlike language called HiveQL Oozie: Workflow scheduler system to manage Apache Hadoop jobs Gathered & presented by H. Shah-Hosseini 64

65 Hadoop's major componets: Zookeeper Zookeeper: Provides operational services for a Hadoop cluster group services maintaining configuration information naming services providing distributed synchronization and providing group services Gathered & presented by H. Shah-Hosseini 65

66 HDFS Architecture: Summary Single NameNode: a master server that manages the file system namespace and regulates access to files by clients. Multiple DataNodes: typically one per node in the cluster. Datanode's functions: Manage storage Serving read/write requests from clients Block creation, deletion, replication based on instructions from NameNode Gathered & presented by H. Shah-Hosseini 66

67 HDS: Block size Default block size is 64MB It is good for large files For example, a 10GB file will be broken into: 10x1024/64 = 160 blocks NameNode memory usage: Every block represented as object Number of map tasks: data typically processed block at a time Network load: Number of checks with datanodes proportional to number of blocks so. small block size is not good Gathered & presented by H. Shah-Hosseini 67

68 Map and Reduce : An example Problem definition: We have a huge text document We wanto count the number of times each distinct word appears on the given file Some applications: analyzing web server logs statitics for query terms in search engines We need to define two functions: Map: scan each line of file, and extract something we care about (keys) Group by key: sort and shuffle, which is handled by Hadoop Reduce: aggregate, summarize, filter, or transform Gathered & presented by H. Shah-Hosseini 68

69 Wordcount: a serial code A serial code: 1) Get word 2) Look up word in table 3) Add 1 to count How would you count all the words in all the Star Wars scripts and books and blogs and etc? Solution: Map/Reduce strategy Gathered & presented by H. Shah-Hosseini 69

70 Wordcount: Mapper Let <word, 1> be the <key,value> pair. and let Hadoop do the hard work The Mapper: Loop until done: Get word Emit <word, 1> Gathered & presented by H. Shah-Hosseini 70

71 Wordcount: Shuffling and sorting This step is done by Hadoop: Gathered & presented by H. Shah-Hosseini 71

72 Wordcount: the Reducer Loop over Key-values: Get next <word,value> If <word> is same as previous word, add <value> to count; else emit <word, count> set count to 0 Gathered & presented by H. Shah-Hosseini 72

73 Wordcount, summary: Map/Reduce Gathered & presented by H. Shah-Hosseini 73

74 Wordcount: summary: Example Point: Shuffling is done by a hash function in Hadoop. Gathered & presented by H. Shah-Hosseini 74

75 Map/Reduce: In parallel Partitioning, sorting, and grouping, etc are done by Hadoop System uses a default partition function: hash(key) mod #Reducers Gathered & presented by H. Shah-Hosseini 75

76 Refinement to map/reduce: Use combiners Combiner: combines the values of all keys of a single mapper (single node): Often a Map produces many pairs with same key: <key,value1>,<key,value2>, <key, valule3>, We can aggregate these pairs by a combiner (similar to reducer) such as: combine(<key, [value1, value2,...]) --> <key, valuefinal> valuefinal=value1+value2+... Much less data is needed to be shuffle or copied Gathered & presented by H. Shah-Hosseini 76

77 End of the session on Big Data Gathered & presented by H. Shah-Hosseini 77

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd. Processing Unstructured Data Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd. http://dinesql.com / Dinesh Priyankara @dinesh_priya Founder/Principal Architect dinesql Pvt Ltd. Microsoft Most

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka Course Curriculum: Your 10 Module Learning Plan Big Data and Hadoop About Edureka Edureka is a leading e-learning platform providing live instructor-led interactive online training. We cater to professionals

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

File Inclusion Vulnerability Analysis using Hadoop and Navie Bayes Classifier

File Inclusion Vulnerability Analysis using Hadoop and Navie Bayes Classifier File Inclusion Vulnerability Analysis using Hadoop and Navie Bayes Classifier [1] Vidya Muraleedharan [2] Dr.KSatheesh Kumar [3] Ashok Babu [1] M.Tech Student, School of Computer Sciences, Mahatma Gandhi

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Big Data Analytics. Description:

Big Data Analytics. Description: Big Data Analytics Description: With the advance of IT storage, pcoressing, computation, and sensing technologies, Big Data has become a novel norm of life. Only until recently, computers are able to capture

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Data Platforms and Pattern Mining

Data Platforms and Pattern Mining Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,

More information

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

Chapter 3. Foundations of Business Intelligence: Databases and Information Management Chapter 3 Foundations of Business Intelligence: Databases and Information Management THE DATA HIERARCHY TRADITIONAL FILE PROCESSING Organizing Data in a Traditional File Environment Problems with the traditional

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition What s the BIG deal?! 2011 2011 2008 2010 2012 What s the BIG deal?! (Gartner Hype Cycle) What s the

More information

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management Management Information Systems Review Questions Chapter 6 Foundations of Business Intelligence: Databases and Information Management 1) The traditional file environment does not typically have a problem

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

<Insert Picture Here> Introduction to Big Data Technology

<Insert Picture Here> Introduction to Big Data Technology Introduction to Big Data Technology The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

Microsoft Big Data and Hadoop

Microsoft Big Data and Hadoop Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common

More information

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Information Technology IT6701 - INFORMATION MANAGEMENT Anna University 2 & 16 Mark Questions & Answers Year / Semester: IV / VII Regulation: 2013

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Chapter 6 VIDEO CASES

Chapter 6 VIDEO CASES Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

REVIEW ON BIG DATA ANALYTICS AND HADOOP FRAMEWORK

REVIEW ON BIG DATA ANALYTICS AND HADOOP FRAMEWORK REVIEW ON BIG DATA ANALYTICS AND HADOOP FRAMEWORK 1 Dr.R.Kousalya, 2 T.Sindhupriya 1 Research Supervisor, Professor & Head, Department of Computer Applications, Dr.N.G.P Arts and Science College, Coimbatore

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Department of Information Technology, St. Joseph s College (Autonomous), Trichy, TamilNadu, India

Department of Information Technology, St. Joseph s College (Autonomous), Trichy, TamilNadu, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 5 ISSN : 2456-3307 A Survey on Big Data and Hadoop Ecosystem Components

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

Top 25 Big Data Interview Questions And Answers

Top 25 Big Data Interview Questions And Answers Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Prototyping Data Intensive Apps: TrendingTopics.org

Prototyping Data Intensive Apps: TrendingTopics.org Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been

More information

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : About Quality Thought We are

More information

Configuring and Deploying Hadoop Cluster Deployment Templates

Configuring and Deploying Hadoop Cluster Deployment Templates Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29, BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29, 2016 1 OBJECTIVES ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29, 2016 2 WHAT

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

More information

Oracle Big Data Science

Oracle Big Data Science Oracle Big Data Science Tim Vlamis and Dan Vlamis Vlamis Software Solutions 816-781-2880 www.vlamis.com @VlamisSoftware Vlamis Software Solutions Vlamis Software founded in 1992 in Kansas City, Missouri

More information

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi Journal of Energy and Power Engineering 10 (2016) 405-410 doi: 10.17265/1934-8975/2016.07.004 D DAVID PUBLISHING Shirin Abbasi Computer Department, Islamic Azad University-Tehran Center Branch, Tehran

More information

Big Data Specialized Studies

Big Data Specialized Studies Information Technologies Programs Big Data Specialized Studies Accelerate Your Career extension.uci.edu/bigdata Offered in partnership with University of California, Irvine Extension s professional certificate

More information

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10 Scalable Web Programming CS193S - Jan Jannink - 2/25/10 Weekly Syllabus 1.Scalability: (Jan.) 2.Agile Practices 3.Ecology/Mashups 4.Browser/Client 7.Analytics 8.Cloud/Map-Reduce 9.Published APIs: (Mar.)*

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

docs.hortonworks.com

docs.hortonworks.com docs.hortonworks.com : Getting Started Guide Copyright 2012, 2014 Hortonworks, Inc. Some rights reserved. The, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing,

More information

New Approaches to Big Data Processing and Analytics

New Approaches to Big Data Processing and Analytics New Approaches to Big Data Processing and Analytics Contributing authors: David Floyer, David Vellante Original publication date: February 12, 2013 There are number of approaches to processing and analyzing

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Tackling Big Data Using MATLAB

Tackling Big Data Using MATLAB Tackling Big Data Using MATLAB Alka Nair Application Engineer 2015 The MathWorks, Inc. 1 Building Machine Learning Models with Big Data Access Preprocess, Exploration & Model Development Scale up & Integrate

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

Hadoop course content

Hadoop course content course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail

More information

Online Bill Processing System for Public Sectors in Big Data

Online Bill Processing System for Public Sectors in Big Data IJIRST International Journal for Innovative Research in Science & Technology Volume 4 Issue 10 March 2018 ISSN (online): 2349-6010 Online Bill Processing System for Public Sectors in Big Data H. Anwer

More information