14th Iran Media Technology Conference. by H. Shah-Hosseini. 12 Dec Gathered & presented by H. Shah-Hosseini 1

Size: px

Start display at page:

Download "14th Iran Media Technology Conference. by H. Shah-Hosseini. 12 Dec Gathered & presented by H. Shah-Hosseini 1"

Gavin Barton
6 years ago
Views:

1 14th Iran Media Technology Conference by H. Shah-Hosseini 12 Dec Gathered & presented by H. Shah-Hosseini 1

2 Topics Big data: Big data and its four vs: volume, velocity, variety, and veracity Another two v's for big data: valence, and value Data science: Data science and its five p's Data science process: acquire, process, analyze, report, act More on analysis (data mining): Classification, Regression, Clustering, Association analysis (rules), Graph analytics Hadoop Hadoop Distributed File system Hadoop Yarn Hadoop MapReduce 14th Iran Media Technology Conference Gathered & presented by H. Shah-Hosseini 2

3 Gathered & presented by H. Shah-Hosseini 3

4 Big data: Volume Gathered & presented by H. Shah-Hosseini 4

Big data: Volume (2) Volume is related to size and exponential growth of data Every minute: 204 million emails are sent Facebook: 200,000 photos are uploaded; 1.8 Million likes are given. YouTube: 1.

5 Big data: Volume (2) Volume is related to size and exponential growth of data Every minute: 204 million s are sent Facebook: 200,000 photos are uploaded; 1.8 Million likes are given. YouTube: 1.3 Million video views 72 hours of video uploads Challenges: storage, access, and processing Gathered & presented by H. Shah-Hosseini 5

6 Big data: Velocity Gathered & presented by H. Shah-Hosseini 6

7 Big data: Velocity (2) Velocity: Speed of creating data, and the need to hشve speed in storing and analyzing data. For big data: We need real-time action late decisions lead to missing opportunities, losing customers visiting your online store; or loss of lives in healthcare or disasters Thus, real-time processing is prefered in contrast to batch processing. Gathered & presented by H. Shah-Hosseini 7

8 Big data: Variety Gathered & presented by H. Shah-Hosseini 8

9 Big data: Variety (2) Variety is related to complexity of data structure Axes of Data Variety: Structural Variety: formats and models Media Variety: medium in which data get delivered Semantic Variety: how to interpret and operate on data Availability Variations: real-time? Intermittent? We can have variety even in a single Sender, receiver, date,..: Well-structured Body of the Text Attachments: Multi-media Who-sends-to-whom: Network A current cannot reference a past Semantics Real-time? Availability Gathered & presented by H. Shah-Hosseini 9

10 Big data: Veracity Gathered & presented by H. Shah-Hosseini 10

Big data: Veracity (2) Veracity refers to quality Accuracy of data data can be noisy, imprecise, biased, or full of uncertainty Reliability of the data source ----------------------- where the data

11 Big data: Veracity (2) Veracity refers to quality Accuracy of data data can be noisy, imprecise, biased, or full of uncertainty Reliability of the data source where the data come from or how was generated is also a factor Ordinary citizens who volunteer to report when they or someone in their family are experiencing symptoms of ILI. Flu Near You, a system run by the HealthMap initiative cofounded by Brownstein at Boston Children s Hospital, was launched in 2011 and now has 46,000 participants, covering 70,000 people. Gathered & presented by H. Shah-Hosseini 11

12 Big data: Characteristics 4+2 V s of big data: Valence: refers to connectedness of big data. Thus, it is how interconnected the data is. As there are more and more connections among the data the complexity of the analysis increases. Value: is the benefit we get from big data Gathered & presented by H. Shah-Hosseini 12

13 Gathered & presented by H. Shah-Hosseini 13

Data science, and its five components ( five P's) Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured

14 Data science, and its five components ( five P's) Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured (from wiki) which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD) Five P s of data science People: is the data scientists as a team. Purpose: is the big data challenge for example, rate of spread and direction in a wildfire Gathered & presented by H. Shah-Hosseini 14

Data scientist skills: Top five Top five skills needed for a data scientist (first two, technical): 1) Programming The ability to analyze large datasets, and create tools to do better data science.

15 Data scientist skills: Top five Top five skills needed for a data scientist (first two, technical): 1) Programming The ability to analyze large datasets, and create tools to do better data science. 2) Quantitative analysis Experimental design and analysis, Modeling of complex economic or growth systems(churn models), Machine learning 3) Product intuition Generating hypotheses, Defining metrics, Debugging analyses 4) Communication Communicating insights, Data visualization and presentation, General communication 5) Teamwork Being selfless, Constant iteration, and Sharing knowledge with others Gathered & presented by H. Shah-Hosseini 15

Data science process The data science process includes five steps: 1) Acquire: identity datasets and retrieve them 2) Prepare: composed of two sub-steps: explore & preprocess 3) Analyze: Select

16 Data science process The data science process includes five steps: 1) Acquire: identity datasets and retrieve them 2) Prepare: composed of two sub-steps: explore & preprocess 3) Analyze: Select analytical techniques, and build models 4) Report: evaluation of analytical results and creating reports 5) Act: Apply the results The five steps can be repeated to demand the original purpose Gathered & presented by H. Shah-Hosseini 16

17 Step1: acquire Determine what data is available and acquire them. For this purpose, we should identity suitable data and make use of all data relevant to our problem for analysis Data comes from many different sources, structured or unstructured, with different velocities And different technologies to access these data Gathered & presented by H. Shah-Hosseini 17

18 Step 1: Acquire: example: traditional databases We use SQL and query browsers to acquire data from these databases Here, the data is structured Gathered & presented by H. Shah-Hosseini 18

19 Step 1: Acquire: example: files (such as text files and excel spreadsheets) We often use scripting languages to acquire data from files Gathered & presented by H. Shah-Hosseini 19

Step 1: Acquire: example: from websites There are a variety of formats and services that webpages are allowed to use based on W3C: Formats includes xml, html,

20 Step 1: Acquire: example: from websites There are a variety of formats and services that webpages are allowed to use based on W3C: Formats includes xml, html, that webpages are written within Web services are also hosted by websites to programmatically access to their data. Gathered & presented by H. Shah-Hosseini 20

21 Step 1: Acquire: example: nosql storage nosql storage systems are used to manage a variety of data types and also big data. In these data storage systems, data are not stored as rows and columns nosql storage systems provide API to access their data Also, most nosql systems provide webservices (such as REST) to interface with their data. Gathered & presented by H. Shah-Hosseini 21

22 Step 1: Acquire: a use case: wildfire case Sensor data from weather stations have been stored in relational databases. So we can use SQL to access these data which may use to model the fire Real-time data via a websocket service to receive weather station data. These data are processed and compared to the patterns found by our model to assess the situation Feeds can be retrieved via hash tags related to any fire occurring near the interested region To get tweets of people for sentiment analysis to measure how people feel: are they experiencing fear, anger, or ignorance of fire. Which may lead to measure the urgency of the fire A similar scenario can be designed for earthquake Gathered & presented by H. Shah-Hosseini 22

Step 2a: Exploring data First step after acquiring data is to explore it (understand the data) in step explore, we look for things such as correlations, outliers, and general trends.

23 Step 2a: Exploring data First step after acquiring data is to explore it (understand the data) in step explore, we look for things such as correlations, outliers, and general trends. Without this step, we cannot use the data effectively Correlation graphs show the dependencies between variables in the data. Graphing the general trend show if there is a consistent direction in which the variables are moving toward. Such as sales prices are going up or down. An outlier is a data point that is distant from other data points. Gathered & presented by H. Shah-Hosseini 23 Outliers must be avoided.

Step 2a: Exploring data (2) Also, we may use statistics to describe your data with numerical values These numbers give us an idea of the nature of our

24 Step 2a: Exploring data (2) Also, we may use statistics to describe your data with numerical values These numbers give us an idea of the nature of our data. For example, a negative range in the field of age indicates that there is something wrong in our data Gathered & presented by H. Shah-Hosseini 24

Step 2a: Exploring data (3) Visualization provides a quick look at the data in this preliminary analysis step: For example, heat maps provide a quick

25 Step 2a: Exploring data (3) Visualization provides a quick look at the data in this preliminary analysis step: For example, heat maps provide a quick look where the hot spots are Histogram shows the distribution of data and may reveal unusual spread of data Gathered & presented by H. Shah-Hosseini 25

26 Step 2b: preprocess The raw data are never in the format we need. In the preprocess step: We have to clean the data Then, to transform the data to make it suitable for analysis Real data is messy. Data quality issues includes: Inconsistent values Missing values Duplicate records Invalid data (such as a longer postal code) Outliers (very different from the rest of the data) We need to detect and correct these quality issues Gathered & presented by H. Shah-Hosseini 26

27 Step 2b: preprocess: addressing data quality issues: cleaning the data To handle incomplete or incorrect data, we need domain knowledge Such as the knowledge of the application, how the data was collected; the users of the application, etc Gathered & presented by H. Shah-Hosseini 27

28 Step 2b: preprocess: data munging Here, we manipulate the clean data into the format needed for the analysis. Other names: data wrangling, data preprocessing Some operations may use in this preprocess step are: Feature selection, scaling, dimension reduction, transformation, manipulation Data preparation is very important for meaningful analysis. Gathered & presented by H. Shah-Hosseini 28

29 Step 2b: preprocess: data munging: scaling Scaling means changing the range of values to become within the specified range. The scaling prevents large values to dominate the results in the analysis Scale to [0,1] by : xnew=(x-min(x))/(max(x)-min(x)) Or to make data: zero-mean and unitvariance Gathered & presented by H. Shah-Hosseini 29

Step 2b: preprocess: data munging: transformation We can also transform data to make it better for analysis For example, we may use transformations to reduce

Aggregation (average filter) is such a transformation that reduce details or variability For example, daily sales figures have many irregular changes,

30 Step 2b: preprocess: data munging: transformation We can also transform data to make it better for analysis For example, we may use transformations to reduce noise in the data or to reduce variability. Aggregation (average filter) is such a transformation that reduce details or variability For example, daily sales figures have many irregular changes, aggregating to weekly or monthly figures results in smoother data Point: using such transformation removes details from the data. So, care must betaken if detail is needed for an application Gathered & presented by H. Shah-Hosseini 30

31 Step 2b: preprocess: data munging: transformation: Denoising An example of denoising with a neural networkbased autoencoder using Keras and Python. Gathered & presented by H. Shah-Hosseini 31

Step 2b: preprocess: data munging: feature selection The process of selecting a subset of relevant features (variables, predictors) that are useful to build a good predictor (model) Feature selection

32 Step 2b: preprocess: data munging: feature selection The process of selecting a subset of relevant features (variables, predictors) that are useful to build a good predictor (model) Feature selection can be used for: Removing irrelevant or redundant features Which results in making the analysis easier Combining features Creating new features For example, If two feature are highly correlated, one of them can be removed without negatively affecting the analysis Feature selection algorithms broadly fall into three categories: filter, wrapper and embedded models. Gathered & presented by H. Shah-Hosseini 32

33 Step 2b: preprocess: data munging: feature selection: wrapper approach A wrapper feature selection, implemented in KNIME using a Naive Bayes classifier Gathered & presented by H. Shah-Hosseini 33

Step 2b: preprocess: data munging: dimensionality reduction

dimensions (features) for each record in the dataset.

the variations in the data By doing this, we remove irrelevant features

This can be used for data compression also.

34 Step 2b: preprocess: data munging: dimensionality reduction Dimensionality reduction is useful when we have a large amount of dimensions (features) for each record in the dataset. It involves finding a smaller subset of dimensions that capture most of the variations in the data By doing this, we remove irrelevant features as well as reducing the features, which finally leads to simpler analysis. This can be used for data compression also. Here, a cat is represented by 1,2 and 5 components instead of 100 pixels. 3D MDS for visualization 2D Gathered & presented by H. Shah-Hosseini 34

35 Step 2b: preprocess: data munging: Transformation into feature space Example of using PCA (Principal Component Analysis) for transforming data into eigenspace: Gathered & presented by H. Shah-Hosseini 35

Step 2b: preprocess: data munging: data manipulation Raw data often has to be manipulated to be in the correct format for analysis For example, from samples recording daily changes in stocks prices,

36 Step 2b: preprocess: data munging: data manipulation Raw data often has to be manipulated to be in the correct format for analysis For example, from samples recording daily changes in stocks prices, we may be interested in price changes of a particular market segment, such as real estate, or healthcare. which has to be extracted from the data. This requires to determine which stocks belongs to which market segment. Then, grouping them together and perhaps computing the mean, range, or standard deviations for each group. Each block shows a record in the dataset Gathered & presented by H. Shah-Hosseini 36

37 Step 3: analyze Building model from the data (input data) The model generates the output data Since there are different types of problems, then we have different types of techniques for analysis such as: Classification Regression Clustering Association analysis (rules) Graph analytics (graph mining) Recommendation systems Model building: Input data-> analyze technique-> model -> output data Gathered & presented by H. Shah-Hosseini 37

Step 3: analyze : Classification Classification: To predict the category of input data If we have only two categories, we call it binary classification.

38 Step 3: analyze : Classification Classification: To predict the category of input data If we have only two categories, we call it binary classification. For handwritten digits, how many categories do we have? Another example: Spam filter for s, having two classes: spam vs. nonspam, which is a binary classification problem. A simple binary classifier: Example: spam or nonspam is a binary classification Gathered & presented by H. Shah-Hosseini 38

39 Step 3: analyze : Classification (2) Example: Deep Learning for image classification, using a Deep Learning model trained on Imagenet: Gathered & presented by H. Shah-Hosseini 39

a stock, gold, or oil To approximate a function by its data points,

40 Step 3: analyze : Regression Regression is when we have to predict a numeric value instead of category For example, to predict the price of a stock, gold, or oil To approximate a function by its data points, Example below: linear regression Gathered & presented by H. Shah-Hosseini 40

A clustering with three clusters Example: customer segmentation

41 Step 3: analyze : Clustering Clustering: here, goal is to organizing similar items into groups. A clustering with three clusters Example: customer segmentation Using DBSCAN for clustering with KNIME: Gray points are considered noise with Gathered & presented by H. Shah-Hosseini DBSCAN 41

42 Step 3: analyze : association Association analysis: the goal is to find rules to capture association between items. An example is market-basket analysis, in which we want to discover which items come together frequently in baskets. The form of an association rule is: i j, where i is a set of items: i= {i1,i2, ik} and j is an item. The implication of this association rule is that if all of the items in i appear in some basket, then j is likely to appear in that basket as well point: Fequent itemsets are obtained to get association rules. Gathered & presented by H. Shah-Hosseini 42

43 Step 3: analyze : association (2) Example: Consider eight baskets for itemset {b,c,m,p,j}: B1 ={m,c,b} B2 ={m,p,j} B3 ={m,b} B4 ={c,j} B5 ={m,p,b} B6 ={m,c,b,j} B7 ={c,b,j} B8 ={b,c} An association rule: {m, b} --> c. has confidence=2/4=50% confidence( i j) support( i, j) support( i) If i is a set of items, the support for i is the number of baskets for which i is a subset. (m,b,c) appears in B1, and B6 (m,b) appears in B1,B3,B5, and B6 Gathered & presented by H. Shah-Hosseini 43

For example, it can be used for exploring the effect of a disease or epidemics by analyzing hospital or

44 Step 3: analyze: graph analytics When data can be transformed into a graph, we may use the graph structure to find connections between entities. For example, it can be used for exploring the effect of a disease or epidemics by analyzing hospital or doctors records, or by analyzing social networks related to a specific region Example: community detection Gathered & presented by H. Shah-Hosseini 44

The more in-links leads to more importance ------------------------------

45 Step 3: analyze: node importance: Pagerank, degree centrality Pagerank scores have been normalized to sum to 100. The importance is visualized by their size. The more in-links leads to more importance Degree centrality for the Karate club graph: Gathered & presented by H. Shah-Hosseini 45

content-based collaborative filtering latent factors

46 Step 3: analyze: recommendation systems There are different approches for recommendation systems: content-based collaborative filtering latent factors Example: book recom. Gathered & presented by H. Shah-Hosseini 46

47 Step 3: analyze: modelling Modelling includes: Selecting the technique, Building the model Validating the model Validation depends on the technique we used. For example, we may apply the model to new data samples that the model has not seen before for classification,we compare the predicted values with correct values in the test set. Gathered & presented by H. Shah-Hosseini 47

48 Step 4: Reporting To communicate your insights It is based on the audience The first thing to do is to determine what to present. To answer to this, we have to answer the following question: What is the main results (the punchline)? What added value do these results provide? How do the results compare to the success criteria determined at the beginning of the project? The results may be puzzling or they are counter to what you were hoping to find. You must report them too. Gathered & presented by H. Shah-Hosseini 48

49 Step 4: Reporting: Visualization tools Some open-source visualization tools are, Python, KNIME, R, and Tableau: Gathered & presented by H. Shah-Hosseini 49

Is there data that should be added to the application to make it more accurate?

50 Step 5: act: turning insights into action To determine what actions should be taken? For example, Is there something in the process that should be changed to remove bottlenecks? Is there data that should be added to the application to make it more accurate? Should we segment our population into more well-defined groups? How to implement the actions? what should be added into your process how should it be automated stakeholders need to be identified and involved Gathered & presented by H. Shah-Hosseini 50

Which finally leads to evaluation- ------------------------ Evaluation determines the next steps:

51 Step 5: act: evaluation We need to assess the impact of the action By monitoring and measuring the impact of the action on the process or the application. Which finally leads to evaluation Evaluation determines the next steps: Should we revisit some data? We need to determine real-time actions and to automate these actions Gathered & presented by H. Shah-Hosseini 51

52 Gathered & presented by H. Shah-Hosseini 52

53 Hadoop: What is it? Apache Hadoop: is an open source software framework for storage & large scale processing of data-sets on clusters of commodity hardware Hadoop was created by Doug Cutting and Mike Cafarella in 2005, Cutting named the project after son's toy elephant some main Hadoop's features: moving computation to data instead of moving data to computation scalability reliability new kind of analysis: simple algorithm on large data Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework. Gathered & presented by H. Shah-Hosseini 53

54 Hadoop is layered A layered example: For example: Storm, Spark, and Flink can be used Real-time and in-memory processing. Gathered & presented by H. Shah-Hosseini 54

Google file system. Intended for large files and batch inserts.

55 Hadoop: HDFS HDFS: Distributed, scalable, and portable file system written in Java for the Hadoop framework, derive from Google file system. Intended for large files and batch inserts. (Write Once, Read many times.) Gathered & presented by H. Shah-Hosseini 55

56 From Hadoop 1.0 to Hadoop 2.0: YARN YARN has been born to do resource management separate from data processing. YARN 'schedules applications in order to prioritize tasks and maintains big data analytics systems. As one part of a greater architecture, Yarn aggregates and sorts data to conduct specific queries for data retrieval. It helps to allocate resources to particular applications and manages other kinds of resource monitoring tasks. Gathered & presented by H. Shah-Hosseini 56

57 Hadoop Ecosystem Hadoop Ecosystem: refers to the various components of Apache Hadoop Software library. It is a set of tools and accessories to address particular needs in processing the Big Data. In other words, a set of different modules interacting together forms a Hadoop Ecosystem. Question: How to figure out this Zoo? Gathered & presented by H. Shah-Hosseini 57

58 Hadoop Zoo: Examples Facbook's stack: Gathered & presented by H. Shah-Hosseini 58

59 Hadoop Zoo: Examples (2) Yahoo's stack: Gathered & presented by H. Shah-Hosseini 59

60 Hadoop Zoo: Examples (3) LinkedIn's stack: Gathered & presented by H. Shah-Hosseini 60

61 Hadoop Zoo: Examples (4) Cloudera's stack: Gathered & presented by H. Shah-Hosseini 61

62 Hadoop's major componets: Sqoop Sqoop: a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Gathered & presented by H. Shah-Hosseini 62

63 Hadoop's major componets: HBase Column-oriented database management system Key-value store Based on Google Big Table It can hold extremely large data Dynamic data model It is not a Relational DBMS It supports both batch-style computations using MapReduce and point queries (random reads) Consistent performance of reads/writes to data used by Hadoop applications. Allows Data Store to be aggregated or processed using MapReduce functionality. Data platform for Analytics and Machine Learning. Bulk storage of logs, documents, real-time activity feeds and raw imported data. Gathered & presented by H. Shah-Hosseini 63

64 Hadoop's major componets: Pig, Hive, Oozie Pig: High level programming on top of Hadoop MapReduce. Data analysis problems as data flows Originally developed at Yahoo Hive: Data warehouse software facilitates querying and managing large datasets residing in distributed storage Mechanism to project structure onto this data and query the data using a SQLlike language called HiveQL Oozie: Workflow scheduler system to manage Apache Hadoop jobs Gathered & presented by H. Shah-Hosseini 64

Hadoop's major componets: Zookeeper Zookeeper: Provides operational services for a Hadoop cluster group services maintaining configuration

65 Hadoop's major componets: Zookeeper Zookeeper: Provides operational services for a Hadoop cluster group services maintaining configuration information naming services providing distributed synchronization and providing group services Gathered & presented by H. Shah-Hosseini 65

66 HDFS Architecture: Summary Single NameNode: a master server that manages the file system namespace and regulates access to files by clients. Multiple DataNodes: typically one per node in the cluster. Datanode's functions: Manage storage Serving read/write requests from clients Block creation, deletion, replication based on instructions from NameNode Gathered & presented by H. Shah-Hosseini 66

map tasks: data typically processed block at a time Network load: Number of checks with datanodes

67 HDS: Block size Default block size is 64MB It is good for large files For example, a 10GB file will be broken into: 10x1024/64 = 160 blocks NameNode memory usage: Every block represented as object Number of map tasks: data typically processed block at a time Network load: Number of checks with datanodes proportional to number of blocks so. small block size is not good Gathered & presented by H. Shah-Hosseini 67

68 Map and Reduce : An example Problem definition: We have a huge text document We wanto count the number of times each distinct word appears on the given file Some applications: analyzing web server logs statitics for query terms in search engines We need to define two functions: Map: scan each line of file, and extract something we care about (keys) Group by key: sort and shuffle, which is handled by Hadoop Reduce: aggregate, summarize, filter, or transform Gathered & presented by H. Shah-Hosseini 68

69 Wordcount: a serial code A serial code: 1) Get word 2) Look up word in table 3) Add 1 to count How would you count all the words in all the Star Wars scripts and books and blogs and etc? Solution: Map/Reduce strategy Gathered & presented by H. Shah-Hosseini 69

70 Wordcount: Mapper Let <word, 1> be the <key,value> pair. and let Hadoop do the hard work The Mapper: Loop until done: Get word Emit <word, 1> Gathered & presented by H. Shah-Hosseini 70

71 Wordcount: Shuffling and sorting This step is done by Hadoop: Gathered & presented by H. Shah-Hosseini 71

72 Wordcount: the Reducer Loop over Key-values: Get next <word,value> If <word> is same as previous word, add <value> to count; else emit <word, count> set count to 0 Gathered & presented by H. Shah-Hosseini 72

73 Wordcount, summary: Map/Reduce Gathered & presented by H. Shah-Hosseini 73

74 Wordcount: summary: Example Point: Shuffling is done by a hash function in Hadoop. Gathered & presented by H. Shah-Hosseini 74

75 Map/Reduce: In parallel Partitioning, sorting, and grouping, etc are done by Hadoop System uses a default partition function: hash(key) mod #Reducers Gathered & presented by H. Shah-Hosseini 75

Refinement to map/reduce: Use combiners Combiner: combines the values of all keys of a single mapper (single node): Often a Map produces many pairs with same key: <key,value1>,<key,value2>, <key,

76 Refinement to map/reduce: Use combiners Combiner: combines the values of all keys of a single mapper (single node): Often a Map produces many pairs with same key: <key,value1>,<key,value2>, <key, valule3>, We can aggregate these pairs by a combiner (similar to reducer) such as: combine(<key, [value1, value2,...]) --> <key, valuefinal> valuefinal=value1+value2+... Much less data is needed to be shuffle or copied Gathered & presented by H. Shah-Hosseini 76

77 End of the session on Big Data Gathered & presented by H. Shah-Hosseini 77

Big Data Hadoop Stack

Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware