Delegated Access for Hadoop Clusters in the Cloud

Similar documents
Integrating OpenID with proxy re-encryption to enhance privacy in cloud-based identity services

Cloud security is an evolving sub-domain of computer and. Cloud platform utilizes third-party data centers model. An

The Road to a Secure, Compliant Cloud

Fujitsu World Tour 2018

Encrypted Data Deduplication in Cloud Storage

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

BigDataBench-MT: Multi-tenancy version of BigDataBench

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

Multi-tenancy version of BigDataBench

Privacy-Enhancing Technologies & Applications to ehealth. Dr. Anja Lehmann IBM Research Zurich

Oasis: An Active Storage Framework for Object Storage Platform

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Clock-Based Proxy Re-encryption Scheme in Unreliable Clouds

Introduction. CSE 5351: Introduction to cryptography Reading assignment: Chapter 1 of Katz & Lindell

Lecture 11 Hadoop & Spark

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

STRONG DATA PRIVACY AND SECURITY FOR THE IOT

High Performance Computing on MapReduce Programming Framework

Crypto Background & Concepts SGX Software Attestation

ZigBee Security Using Attribute-Based Proxy Re-encryption

DriveScale-DellEMC Reference Architecture

Accelerate Big Data Insights

Database Applications (15-415)

Resiliency Replication Appliance Installation Guide Version 7.2

CS370 Operating Systems

Introduction to MapReduce

Forget about the Clouds, Shoot for the MOON

A User-level Secure Grid File System

Chapter 5. The MapReduce Programming Model and Implementation

Achieving Horizontal Scalability. Alain Houf Sales Engineer

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

M 2 R: Enabling Stronger Privacy in MapReduce Computa;on

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Outline. Data Encryption Standard. Symmetric-Key Algorithms. Lecture 4

Blanket Encryption. 385 Moffett Park Dr. Sunnyvale, CA Technical Report

HIGH LEVEL SECURITY IMPLEMENTATION IN DATA SHARING ON SOCIAL WEBSITES

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Lecture 6: Symmetric Cryptography. CS 5430 February 21, 2018

RSA DISTRIBUTED CREDENTIAL PROTECTION

Encryption. INST 346, Section 0201 April 3, 2018

Research and Realization of AP Clustering Algorithm Based on Cloud Computing Yue Qiang1, a *, Hu Zhongyu2, b, Lei Xinhua1, c, Li Xiaoming3, d

CS 161 Computer Security. Week of September 11, 2017: Cryptography I

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Cryptography III. Public-Key Cryptography Digital Signatures. 2/1/18 Cryptography III

Cryptography. Andreas Hülsing. 6 September 2016

Big Data for Engineers Spring Resource Management

Design and Analysis of High Performance Crypt-NoSQL

SOLUTION BRIEF BIG DATA SECURITY

Cryptography (Overview)

Design and Implementation of Privacy-Preserving Surveillance. Aaron Segal

Attribute-based encryption with encryption and decryption outsourcing

Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop

CloudSky: A Controllable Data Self-Destruction System for Untrusted Cloud Storage Networks

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud

Cryptography & Key Exchange Protocols. Faculty of Computer Science & Engineering HCMC University of Technology

Embedded Technosolutions

The MapReduce Framework

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

IntegrityMR: Integrity Assurance Framework for Big Data Analytics and Management

UNIT - IV Cryptographic Hash Function 31.1

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

An Efficient Certificateless Proxy Re-Encryption Scheme without Pairing

An Enhanced Approach for Resource Management Optimization in Hadoop

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

Apache Commons Crypto: Another wheel of Apache Commons. Dapeng Sun/ Xianda Ke

DRIVESCALE-MAPR Reference Architecture

Getting to Grips with Public Key Infrastructure (PKI)

Trusted Cloud protects your critical data by ensuring that no unauthorised code can run undetected on your critical server infrastructure.

Introduction to Hadoop and MapReduce

Distributed Face Recognition Using Hadoop

Efficient Memory Integrity Verification and Encryption for Secure Processors

Framework Research on Privacy Protection of PHR Owners in Medical Cloud System Based on Aggregation Key Encryption Algorithm

VMWare Horizon View 6 VDI Scalability Testing on Cisco 240c M4 HyperFlex Cluster System

Veeam Backup & Replication on IBM Cloud Solution Architecture

MOHA: Many-Task Computing Framework on Hadoop

M 2 R: ENABLING STRONGER PRIVACY IN MAPREDUCE COMPUTATION

Public-key Cryptography: Theory and Practice

FADE: A Secure Overlay Cloud Storage System with Access Control and Assured Deletion. Patrick P. C. Lee

Top 25 Hadoop Admin Interview Questions and Answers

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

EEC-484/584 Computer Networks

White-Box Cryptography State of the Art. Paul Gorissen

DRIVESCALE-HDP REFERENCE ARCHITECTURE

Getting Started with Spark

Winter 2011 Josh Benaloh Brian LaMacchia

MI-PDB, MIE-PDB: Advanced Database Systems

Discovering Dependencies between Virtual Machines Using CPU Utilization. Renuka Apte, Liting Hu, Karsten Schwan, Arpan Ghosh

On BigFix Performance: Disk is King. How to get your infrastructure right the first time! Case Study: IBM Cloud Development - WW IT Services

Goals of Modern Cryptography

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

Spatial Encryption. March 17, Adam Barth, Dan Boneh, Mike Hamburg

Efficient Auditable Access Control Systems for Public Shared Cloud Storage

Big Data 7. Resource Management

A Parametric Family of Attack Models for Proxy Re-Encryption

ENEE 457: Computer Systems Security 09/12/16. Lecture 4 Symmetric Key Encryption II: Security Definitions and Practical Constructions

Network Security. Chapter 4 Symmetric Encryption. Cornelius Diekmann With contributions by Benjamin Hof. Technische Universität München

Ref:

S. Erfani, ECE Dept., University of Windsor Network Security

Transcription:

Delegated Access for Hadoop Clusters in the Cloud David Nuñez, Isaac Agudo, and Javier Lopez Network, Information and Computer Security Laboratory (NICS Lab) Universidad de Málaga, Spain Email: dnunez@lcc.uma.es IEEE CloudCom 2014 Singapore

1. Introduction Motivation Scenario Proposal 2. The Hadoop Framework 3. Proxy Re-Encryption Overview Access Control based on Proxy Re-Encryption 4. DASHR: Delegated Access System for Hadoop based on Re-Encryption 5. Experimental results Experimental setting Results 6. Conclusions

Introduction Big Data use of vast amounts of data that makes processing and maintenance virtually impossible from the traditional perspective of information management Security and Privacy challenges In some cases the data stored is sensitive or personal Malicious agents (insiders and outsiders) can make a profit by selling or exploiting this data Security is usually delegated to access control enforcement layers, which are implemented on top of the actual data stores. Technical staff (e.g., system administrators) are often able to bypass these traditional access control systems and read data at will.

Apache Hadoop Apache Hadoop stands out as the most prominent framework for processing big datasets. Apache Hadoop is a framework that enables the storing and processing of large-scale datasets by clusters of machines. The strategy of Hadoop is to divide the workload into parts and spreading them throughout the cluster. Hadoop was not designed with security in mind However, it is widely used by organizations that have strong security requirements regarding data protection.

Goal Goal Stronger safeguards based on cryptography Cloud Security Alliance on Security Challenges for Big Data : [...] sensitive data must be protected through the use of cryptography and granular access control. Motivating Scenario Big Data Analytics as a Service

Motivating Scenario: Big Data Analytics as a Service Big Data Analytics is a new opportunity for organizations to transform the way they market services and products through the analysis of massive amounts of data Small and medium size companies are not often capable of acquiring and maintaining the necessary infrastructure for running Big Data Analytics on-premise The Cloud is a natural solution to this problem, in particular for small organizations Access to on-demand high-end clusters for analysing massive amounts of data (e.g.: Hadoop on Google Cloud) It has even more sense when the organizations are already operating in the cloud, so analytics can be performed where the data is located

Motivating Scenario: Big Data Analytics as a Service There are several risks, such as the ones that stem for a multi-tenant environment. Jobs and data from different tenants are then kept together under the same cluster in the cloud, which could be unsafe when one considers the weak security measures provided by Hadoop The use of encryption for protecting data at rest can decrease the risks associated to data disclosures in such scenario. Our proposed solution fits in well with the outsourcing of Big Data processing to the cloud, since information can be stored in encrypted form in external servers in the cloud and processed only if access has been delegated.

Proposal DASHR: Delegated Access System for Hadoop based on Re-Encryption Cryptographically-enforced access control system Based on Proxy Re-Encryption Data remains encrypted in the filesystem until it is needed for processing Experimental results show that the overhead is reasonable

Hadoop operation JobTracker Distribution of workload and coordination TaskTracker Map TaskTracker Data store (e.g. HDFS) TaskTracker Red Map Input Splits Map phase Reduce phase

Proxy Re-Encryption: Overview A Proxy Re-Encryption scheme is a public-key encryption scheme that permits a proxy to transform ciphertexts under Alice s public key into ciphertexts under Bob s public key. The proxy needs a re-encryption key r A B to make this transformation possible, generated by the delegating entity Proxy Re-Encryption enables delegation of decryption rights

Access Control based on Proxy Re-Encryption Three entities: The data owner, with public and private keys The delegatee, with public and private keys The proxy entity, which permits access through re-encryption

Creating an Encrypted Lockbox Encrypted lockbox Data Symmetric encryption Data key PRE encryption Data owner's public key

Delegating an Encrypted Lockbox Encrypted lockbox Encrypted lockbox Re-Encryption Re-Encryption key

Opening an Encrypted Lockbox Encrypted lockbox Symmetric decryption Data PRE decryption Data key Delegatee's Private key

DASHR DASHR: Delegated Access System for Hadoop based on Re-Encryption Data is stored encrypted in the cluster and the owner can delegate access rights to the computing cluster for processing. The data lifecycle is composed of three phases: 1. Production phase: during this phase, data is generated by different data sources, and stored encrypted under the owner s public key for later processing. 2. Delegation phase: the data owner produces the necessary master re-encryption key for initiating the delegation process. 3. Consumption phase: This phase occurs each time a user of the Hadoop cluster submits a job; is in this phase where encrypted data is read by the worker nodes of the cluster. At the beginning of this phase, re-encryption keys for each job are generated.

DASHR: Production phase Generation of data by different sources Data is splitted into blocks by the filesystem (e.g., HDFS) Each block is an encrypted lockbox, which contains encrypted data and an encrypted key, using the public key of the data owner pk DO

DASHR: Delegation phase The dataset owner produces a master re-encryption key mrk DO to allow the delegation of access to the encrypted data The master re-encryption key is used to derive re-encryption keys in the next phase. The delegation phase is done only once for each computing cluster

DASHR: Delegation phase This phase involves the interaction of three entities: 1. Dataset Owner (DO), with a pair of public and secret keys (pk DO, sk DO ), the former used to encrypted generated data for consumption 2. Delegation Manager (DM), with keys (pk DM, sk DM ), and which belongs to the security domain of the data owner, so it is assumed trusted. It can be either local or external to the computing cluster. If it is external, then the data owner can control the issuing of re-encryption keys during the consumption phase. The delegation manager has a pair of public and secret keys, pk DM and sk DM. 3. Re-Encryption Key Generation Center (RKGC), which is local to the cluster and is responsible for generating all the re-encryption keys needed for access delegation during the consumption phase.

DASHR: Delegation phase These three entities follow a simple three-party protocol, so no secret keys are shared The value t used during this protocol is simply a random value that is used to blind the secret key. At the end of this protocol, the RKGC possesses the master re-encryption key mrk DO that later will be used for generating the rest of re-encryption keys in the consumption phase, making use of the transitive property of the proxy re-encryption scheme.

DASHR: Delegation Protocol 2. t sk 1 DO Delegation Manager sk DM Data Owner sk DO 1. t Re-Encryption Key Generation Center 3. t sk DM sk 1 DO mrk DO = sk DM sk 1 DO

DASHR: Consumption phase This phase is performed each time a user submits a job to the Hadoop cluster A pair of public and private keys (pk TT, sk TT ) for the TaskTrackers is initialized in this step, which will be used later during encryption and decryption. The Delegation Manager, the Re-Encryption Key Generation Center, the JobTracker and the TaskTrackers interact in order to generate the re-encryption key rk DO TT The final re-encryption key rk DO TT held by the JobTracker, who will be the one performing re-encryptions. This process could be repeated in case that more TaskTrackers keys are in place.

DASHR: Re-Encryption Key Generation Protocol TaskTracker sk TT 3. u sk 1 DM 4. u rk DM TT 2. u sk 1 DM JobTracker Delegation Manager sk DM 6. rk DO TT 5. u rk DM TT 1. u Re-Encryption Key Generation Center mrk DO

DASHR: Consumption phase Plaintext data Encrypted data JobTracker Encrypted Lockbox Re-Encrypted Lockbox Dec TaskTracker Map Enc TaskTracker Encrypted data TaskTracker Dec Red Enc Dec Map Enc Encrypted Splits (Blocks) Map phase Reduce phase

Experimental setting Focused on the main part of the consumption phase, where the processing of the data occurs. From the Hadoop perspective, the other phases are offline processes, since are not related with Hadoop s flow. Environment: Virtualized environment on a rack of IBM BladeCenter HS23 servers connected through 10 gigabit Ethernet, running VMware ESXi 5.1.0. Each of the blade servers is equipped with two quad-core Intel(R) Xeon(R) CPU E5-2680 @ 2.70GHz. Cluster of 17 VMs (1 master node and 16 slave nodes) Each of the VMs has two logical cores and 4 GB of RAM, running a modified version of Hadoop 1.2.1.

Experimental setting: Cryptographic details Proxy Re-Encryption scheme from Weng et al. Implemented using the NIST P-256 curve, which provides 128 bits of security and is therefore appropriate for encapsulating 128 bits symmetric keys. AES-128-CBC for symmetric encryption. We make use of the built-in support for AES included in some Intel processors through the AES-NI instruction set.

Experiment Execution of the WordCount benchmark, one of the sample programs included in Hadoop A simple application that counts the occurence of words over a set of files. In the case of our experiment, the job input was a set of 1800 encrypted files of 64 MB each In total, the input contains almost 30 billions of words and occupies aproximately 112.5 GB. The size of each input file is sligthly below 64 MB, in order to fit HDFS blocks.

Results Two runs over the same input: The first one using a clean version of Hadoop The second one using our modified version. The total running time of the experiment was 1932.09 and 1960.74 seconds, respectively. That is, a difference of 28.74 seconds, which represents a relative overhead of 1.49%.

Time costs Operation Time (ms) Block Encryption (AES-128-CBC, 64 MB) 212.62 Block Decryption (AES-128-CBC, 64 MB) 116.81 Lockbox Encryption (PRE scheme) 17.84 Lockbox Re-Encryption (PRE scheme) 17.59 Lockbox Decryption (PRE scheme) 11.66

Conclusions DASHR is a cryptographically-enforced access control system for Hadoop, based on proxy re-encryption Stored data is always encrypted and encryption keys do not need to be shared between different data sources. The use of proxy re-encryption allows to delegate access to the stored data to the computing cluster. Experimental results show that the overhead produced by the encryption and decryption operations is manageable Our proposal will fit well in applications with an intensive map phase, since then the overhead introduced by the use of cryptography will be reduced.

Thank you!