Databricks Enterprise Security Guide

Similar documents
Managing and Auditing Organizational Migration to the Cloud TELASA SECURITY

Layer Security White Paper

SECURITY ON AWS 8/3/17. AWS Security Standards MORE. By Max Ellsberry

Security and Compliance at Mavenlink

Security Information & Policies

SignalFx Platform: Security and Compliance MARZENA FULLER. Chief Security Officer

Kenna Platform Security. A technical overview of the comprehensive security measures Kenna uses to protect your data

SECURITY & PRIVACY DOCUMENTATION

QuickBooks Online Security White Paper July 2017

Watson Developer Cloud Security Overview

AWS Security. Stephen E. Schmidt, Directeur de la Sécurité

The Common Controls Framework BY ADOBE

University of Pittsburgh Security Assessment Questionnaire (v1.7)

Solution Pack. Managed Services Virtual Private Cloud Security Features Selections and Prerequisites

RAPID7 INFORMATION SECURITY. An Overview of Rapid7 s Internal Security Practices and Procedures

VMware vcloud Air SOC 1 Control Matrix

OptiSol FinTech Platforms

HOW SNOWFLAKE SETS THE STANDARD WHITEPAPER

Twilio cloud communications SECURITY

Introduction. Deployment Models. IBM Watson on the IBM Cloud Security Overview

Security and Privacy Overview

Security & Compliance in the AWS Cloud. Vijay Rangarajan Senior Cloud Architect, ASEAN Amazon Web

Understanding Perimeter Security

Security & Compliance in the AWS Cloud. Amazon Web Services

PrepAwayExam. High-efficient Exam Materials are the best high pass-rate Exam Dumps

RAPID7 INSIGHT PLATFORM SECURITY

Cloud is the 'Only' Way Forward in Information Security. Leveraging Scale to Make the Unknown Known, in Dev, Sec & Ops.

Daxko s PCI DSS Responsibilities

8/3/17. Encryption and Decryption centralized Single point of contact First line of defense. Bishop

Hackproof Your Cloud Responding to 2016 Threats

Security Architecture

Oracle Data Cloud ( ODC ) Inbound Security Policies

CPM. Quick Start Guide V2.4.0

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

Simple Security for Startups. Mark Bate, AWS Solutions Architect

INCREASE APPLICATION SECURITY FOR PCI DSS VERSION 3.1 SUCCESS AKAMAI SOLUTIONS BRIEF INCREASE APPLICATION SECURITY FOR PCI DSS VERSION 3.

SAP Vora - AWS Marketplace Production Edition Reference Guide

Security on AWS(overview) Bertram Dorn EMEA Specialized Solutions Architect Security and Compliance

Amazon AWS-Solution-Architect-Associate Exam

Criminal Justice Information Security (CJIS) Guide for ShareBase in the Hyland Cloud

Introduction to AWS GoldBase. A Solution to Automate Security, Compliance, and Governance in AWS

Getting Started with AWS Security

Cloud Security Whitepaper

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide

Juniper Vendor Security Requirements

Quick Install for Amazon EMR

Magento Commerce Architecture and Security Model Last updated: Aug 2017

Are You Sure Your AWS Cloud Is Secure? Alan Williamson Solution Architect at TriNimbus

Cloudera s Enterprise Data Hub on the Amazon Web Services Cloud: Quick Start Reference Deployment October 2014

Data Security and Privacy at Handshake

InterCall Virtual Environments and Webcasting

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

AWS Solution Architect Associate

How-to Guide: Tenable.io for Microsoft Azure. Last Updated: November 16, 2018

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

How-to Guide: Tenable Nessus for Microsoft Azure. Last Updated: April 03, 2018

WHITE PAPER- Managed Services Security Practices

Cloud Computing /AWS Course Content

AWS Solutions Architect Associate (SAA-C01) Sample Exam Questions

Awareness Technologies Systems Security. PHONE: (888)

Security in Bomgar Remote Support

AWS Administration. Suggested Pre-requisites Basic IT Knowledge

Amazon Web Services (AWS) Training Course Content

Best Practices for Cloud Security at Scale. Phil Rodrigues Security Solutions Architect Amazon Web Services, ANZ

Architecting for Greater Security in AWS

The Nasuni Security Model

Cloud FastPath: Highly Secure Data Transfer

APPLICATION & INFRASTRUCTURE SECURITY CONTROLS

Amazon Web Services Training. Training Topics:

DevOps Anti-Patterns. Have the Ops team deal with it. Time to fire the Ops team! Let s hire a DevOps unit! COPYRIGHT 2019 MANICODE SECURITY

TIBCO Cloud Integration Security Overview

Title: Planning AWS Platform Security Assessment?

Introduction to Cloud Computing

AWS continually manages risk and undergoes recurring assessments to ensure compliance with industry standards.

ASD CERTIFICATION REPORT

Launching a Highly-regulated Startup in the Cloud

SoftLayer Security and Compliance:

Make Cloud the Most Secure Environment for Business. Seth Hammerman, Systems Engineer Mvision Cloud (formerly Skyhigh Networks)

Minfy MS Workloads Use Case

Enroll Now to Take online Course Contact: Demo video By Chandra sir

Verasys Enterprise Security and IT Guide

Security Overview of the BGI Online Platform

Security Principles for Stratos. Part no. 667/UE/31701/004

90% 191 Security Best Practices. Blades. 52 Regulatory Requirements. Compliance Report PCI DSS 2.0. related to this regulation

A company built on security

Automate sharing. Empower users. Retain control. Utilizes our purposebuilt cloud, not public shared clouds

Training on Amazon AWS Cloud Computing. Course Content

WHITE PAPER Cloud FastPath: A Highly Secure Data Transfer Solution

Overview of AWS Security - Database Services

NGF0502 AWS Student Slides

Virtual Machine Encryption Security & Compliance in the Cloud

OnCommand Cloud Manager 3.2 Deploying and Managing ONTAP Cloud Systems

Look Who s Hiring! AWS Solution Architect AWS Cloud TAM

CYBER SECURITY WHITEPAPER

Building a Modular and Scalable Virtual Network Architecture with Amazon VPC

The following security and privacy-related audits and certifications are applicable to the Lime Services:

Accelerating the HCLS Industry Through Cloud Computing

W H IT E P A P E R. Salesforce Security for the IT Executive

01.0 Policy Responsibilities and Oversight

SOC-2 Requirement Solution Brief. EventTracker 8815 Centre Park Drive, Columbia MD SOC-2

Transcription:

Databricks Enterprise Security Guide 1

Databricks is committed to building a platform where data scientists, data engineers, and data analysts can trust that their data is secure. Through implementing industry-wide best practices and building upon the many security related features provided by AWS, Databricks addresses the most commonly required security controls, highlighted in this document. This document describes Databricks deployment architecture in detail, illustrating how security is addressed throughout. Contents Deployment Model... 3 Compliance Program... 4 Defense in Depth... 5 Customer Data... 6 Databricks Access to Customer Environment... 6 Employee Access... 7 Data Governance... 7 Data Flow & Encryption... 7 Customer Credentials Management... 10 Backups... 10 Application... 11 Authentication and Authorization - End User Access Control... 11 Role-based Access Controls (ACL)... 11 Change Management & Secure Coding... 12 Host... 13 Hardening Standards... 13 Vulnerability Management... 13 Network Security... 14 Network Isolation... 14 Spark Cluster Network Isolation... 14 VPC Isolation of Customer s Service in Databricks Account... 14 Security Groups & Network ACLs... 14 No Public IPs... 14 Monitoring... 14 Physical Security... 15 Infrastructure... 15 Office... 15 Logging and Monitoring... 15 Policies & Procedures... 15 2

Deployment Model Databricks Enterprise offering is a single tenant deployment. Data plane Spark clusters are deployed in a customer AWS account. Customer datasets are stored in customer owned and managed storage e.g. AWS S3, RDBMS, NoSQL. Control plane Runs in Databricks account in a VPC dedicated to a single customer. Databricks Dedicated VPC Databricks VPC Customer Controlled Audited Access* Databricks Admin TLS Home Workspace Notebooks Tables Jobs TLS Central Services SOC 2 Type 2 (3/17) VPN gateway Customer VPC IAM Role Cross-Account API Access Customer VPCs Customer Choice of Connectivity Clusters Clusters Clusters Clusters Clusters Data Sources * Refer to Audited Controls End-to-End encryption, & integrity protection KMS Encryption Controlled by Customer Zero Maintenance Single-Tenant VPC Isolation of Control Plane Secured Internal Communication Secured Access and Authorization Encrypted Customer State Isolated AWS Accounts Apache Spark Cluster Network Isolation Smarter cost controls 3

Compliance Program Databricks engages with an independent CPA firm to perform annual and semi-annual audits. We currently hold: A SOC 2, Type 2 attestation. SOC 2 report covers design and operational effectiveness of controls to meet the trust criteria for the security, availability, and confidentiality. An attestation of HIPAA compliance. Additionally, Databricks is engaged with an independent third party organization, NCC Group (formerly isec Partners) to conduct annual code reviews and penetration tests. 4

Defense in Depth Databricks follows the Defense in Depth approach in order to address security as a whole. This comprehensive strategy spans technology, policies and procedures, as well as promoting a security first culture. Databricks Defense in Depth covers Customer Data, Application, Host, Network, Physical, Logging and Monitoring, Policies, Procedures and Awareness. Customer Data Application Host Network Security Physical Security Logging and Monitoring Policies and Procedures 5

Customer Data CUSTOMER DATASETS Databricks is built to work with a customer s existing data. It does not provide a persistent storage layer in-and-of-itself, but is instead designed to leverage Spark s excellent support for various preexisting data sources and data formats, and provides additional optimizations where applicable. Databricks customers most often utilize AWS Simple Storage Service (S3), but can also access a number of other sources (e.g. RDBMS, NoSQL, CSV uploads, etc.) A wide range of data formats are supported, including CSV, Parquet, JSON, Hadoop (e.g. Sequence Files, Avro). All sources and formats are accessible using whatever client authentication mechanisms are required for the given source. CUSTOMER METADATA Customer metadata, including customer queries, outputs of the queries, as well as web user accounts, is stored in Databricks AWS RDS and encrypted with AWS KMS. Databricks provides customers with an option to user their own encryption (AWS KMS) to secure data at rest. SECURED INTERFACES TO SPARK CLUSTERS Spark clusters are ultimately responsible for accessing and processing data in the Databricks environment, and access to Spark clusters occurs primarily through the web frontend interface. Access to frontend services requires authenticated identities and is encrypted through SSL. Commands are pushed from the frontend to the Spark cluster through an SSL-encrypted connection and utilizes certificate based authentication. VPC PEERING TO ADDITIONAL CUSTOMER VPC Network access from the Databricks Spark clusters to any additional customer data sources can be conveniently enabled through VPC peering between the Spark clusters VPC and the external VPC. In lieu of VPC peering, standard network routing or VPN configurations can be used. Databricks Access to Customer Environment PROGRAMMATIC Privileged Databricks services have the ability to monitor and update customer deployments. Our monitoring agent has the ability to make metadata-only black box checks against the customer environment, such as listing clusters or jobs to ensure that the respective services are healthy and resulting in valid data. Additionally, we make EC2 describe calls to ensure the health of the AWS resources. Our update agent has the ability to provision new EC2 instances in the customer environment and to request that existing instances pull new artifacts from the Databricks artifact repository and self-update. 6

Employee Access Databricks has developed a proprietary system for requesting, approving, revoking, and logging access to customer data - Genie. As a general practice, Databricks employees do not access customer data unless specifically requested by a customer (e.g. to troubleshoot). Such requests should be documented in a Zendesk ticket and include consent for Databricks to access their environment. Following receipt of a Zendesk ticket, a Databricks engineer will review the issue reported and, if needed, submit a request to Genie to grant him/her access to the customer environment in order to address the issue. Genie, upon successful validation of the ticket number and customer consent, approves the engineer s access to the customer environment. Such access is approved for a specified period of time after which the access permission is automatically revoked. Genie can approve access only for a limited group of engineers, which is reviewed and revalidated quarterly. All access to a customer environment by Databricks personnel, including any actions taken, is logged and available for customers to review as part of the Databricks service audit logs. Data Governance Customer data is stored in Amazon S3 and Databricks designates which physical region individual customers data and servers will be located. Data replication for Amazon S3 data objects is done within the regional cluster where the data is stored and is not replicated to data center clusters in other regions. For example, by default, all data from Databricks customers in the EU will have their cloud data, logs, databases, and cluster management stored in the AWS data center in the EU, and that data will not be transferred to data centers outside the EU. Data Flow & Encryption This section details data flow, where a user s data enters Databricks, how it moves through the system and gets stored, with the particular goal of ensuring the data is always encrypted in transit and at rest. 7

CUSTOMER DATA ENTERS DATABRICKS THROUGH TWO MECHANISMS: 1. Data sources that are accessed through Databricks 2. User-entered data (typically credentials) The data flow below illustrates (i) the Databricks-owned instances for Databricks Services and (ii) customer-owned Worker instances on which the customer-owned Container Processes and Databricksowned Data Daemon reside. RDS (8) (8) (9) Root Bucket (9) (1) Customer Data (1) Customer Input (2) Service (6) (3) Container Process (6/7) (4) (5) Data Daemon (6/7) EBS EBS EBS (10) (10) (10) Logs S3/Kinesis Container Process Lines indicate where data is in transit and disks indicate where data lies at rest. Orange is input to the system (customer data) and green is Databricks-owned, where customer data initially does not reside. 1. Customer data stored in customer-owned data sources (e.g., S3, RedShift, RDS) is read directly by the container. The customer is responsible for using encrypted connections. Databricks provided defaults always use encryption for S3 access. The Data Daemon (which always uses S3 Root Bucket) always uses HTTPS to talk to S3. 2. Data input by the customer to Databricks services (or secrets which may give access to customer data) always uses HTTPS (either through a browser session or through our API which requires TLS 1.1 or 1.2). a) For AWS-related calls, customers are recommended to use roles. 3. Communication between the Databricks Service (Control Plane) and Container Process (Data Plane) occurs over an RPC mechanism which uses TLS 1.2 and client/server mutual authentication. 4. Communication between the Container Process and Data Daemon is not encrypted but it is colocated on the same physical instance and iptables rules prevent other containers from observing the traffic. 5. Spark will transfer data between executors in order to perform distributed operations. This data is not encrypted and travels between physical instances within the same VPC. 6. Databricks Services, the Data Daemon, and the Container Process write logs to their local EBS volumes. Encryption depends on the configuration of the EBS (see below). 8

(Figure repeated) RDS (8) (8) (9) Root Bucket (9) (1) Customer Data (1) Customer Input (2) Service (6) (3) Container Process (6/7) (4) (5) Data Daemon (6/7) EBS EBS EBS (10) (10) (10) Logs S3/Kinesis Container Process 7. The Container Process and Data Daemon additionally write customer data to their local EBS volumes for the sake of caching. Same encryption story as 6. a) Local disks are used for logs and data caching. When Amazon launches a new instance, the bootstrap disk can either be a copy of a local disk image stored in S3 or an EBS volume snapshot. Our AMIs are based on EBS volumes. The bootstrap EBS volume snapshot may be encrypted with KMS, but then the AMI cannot be directly shared with other accounts. As a result of this stipulation, our current solution regarding encrypted EBS volumes is a bit nuanced: i) Instances running in our account (Databricks Services) use an encrypted EBS volume, and as a result, are encrypted using KMS. ii) Instances running in the customer account do not use an encrypted EBS volume on boot, but we instead request additional data EBS volumes encrypted with KMS and put all container data on these disks. 8. The Databricks Services and in some configurations, the Container Process, share an RDS instance in which they store user-input data (including access keys) as well as results of customer queries. The instance uses a per-customer KMS key to encrypt its EBS and backups. The database is also backed up to S3 where it is also KMS-encrypted using the same key. 9. Databricks Services and the Data Daemon store certain data (namely, mount point metadata) in the Databricks Root bucket which may contain customer data. Customer-input secret keys are encrypted with SSE-S3. 10. Log data is uploaded to the Databricks Log Pipeline via Kinesis. Logs at rest are encrypted with AWS KMS and logs in flight are encrypted with TLS 1.2. 9

Customer Credentials Management Data input by the customer to Databricks services (or secrets which may give access to customer data) always uses HTTPS (either through a browser session or through our API which requires TLS 1.1 or 1.2). Customer AWS credentials are stored encrypted with client side encryption on a private and secure S3 bucket. The key used to encrypt the credentials are stored encrypted on S3 in separate private and secure S3 bucket. The stored credentials are only accessed by our automated deployment process and no Databricks personnel has direct access to the credentials. Backups Databricks performs automated scheduled backups of metadata and systems every 24 hours. The backups are stored in AWS RDS with access restricted to authorized employees. Backup and recovery procedures are tested on an annual basis. 10

Application Authentication and Authorization - End User Access Control SSO Databricks provides Single Sign-On (SSO) to enable a customer to authenticate its employees using a customer s identity provider. As long as the identity provider supports SAML 2.0 protocol (e.g. OKTA, Google for Work, OneLogin, Ping Identity, Microsoft Windows Active Directory), a customer can use Databricks SSO to integrate with your identity provider and sign in. Databricks provides several ways to control access to both data and clusters inside of Databricks. Role-based Access Controls (ACL) CLUSTERS IAM ROLES An IAM role is an AWS identity with permission policies that determine what the identity can and cannot do in AWS. IAM Roles allow you to access your data from Databricks clusters without having to embed your AWS keys in notebooks. CLUSTER ACL There are two configurable types of permissions for Cluster Access Control: Individual Cluster Permissions - This controls a user s ability to attach notebooks to a cluster, as well as to restart/resize/terminate clusters. Cluster Creation Permissions - This controls a user s ability to create clusters. WORKSPACE ACL Workspace ACL provides control over who can view, edit, and run notebooks. You can assign five permission levels to notebooks and folders: No Permissions, Read (View Cells, Comment), Run (Run Commands, Attach/Detach Notebooks), Edit Cells, and Manage (Change Permissions). NOTEBOOKS ACL All notebooks within a folder inherit all permissions settings of that folder. For example, if you give a user Run permission on a folder, that user will have Run permission on all notebooks in that folder. LIBRARY AND JOBS All users can view libraries. To control who can attach libraries to clusters, use Cluster Access Control. A user can only create jobs from notebooks that they have read permissions to. Also, users can view a Notebook Job run result only if they have Read permissions on the notebook of that job. If a user deletes a notebook, only admins can view the runs. 11

Change Management & Secure Coding Databricks has a formal change management process in place. All changes must be authorized, tested, approved, and documented. Databricks has implemented a secure development lifecycle (SDL) to ensure that security best practices are integral part of development. The SDL covers formal design reviews by the security team, threat modeling, automated and manual code peer review, as well as penetration testing by a leading security firm. Additionally, all developers are provided with secure coding practices training as part of their onboarding. 12

Host Databricks has formal host hardening and vulnerability management processes in place. Hardening Standards All hosts run the latest version on Ubuntu operating system and are hardened according to Center for Internet Security (CIS) benchmarks. In summary the hardening standards cover the following: Changing of all vendor supplied defaults and elimination of unnecessary default accounts. Enabling only necessary services, protocols, daemons, etc., as required for the function of the system. Implementing additional security features for any required services. Configuring system security parameters to prevent misuse. Removing all unnecessary functionality, such as scripts, drivers, features, subsystems, file systems, and unnecessary web servers. Vulnerability Management PATCHING UPDATES All hosts are patched periodically for security updates and critical patch fixes. All patches are authorized, tested, and approved in accordance with Databricks change management process. Zeroday exploits are patched as soon as possible after testing. SCANNING All hosts are scanned periodically for vulnerabilities with Nessus. All security vulnerabilities are investigated by the security team and remediated according to Databricks security incident remediation SLA: Critical Immediately High Within five days Medium Within 60 days Low Based on the business requirements 13

Network Security Network Isolation Databricks is deployed in a customer AWS account. We recommend that a customer uses a separate AWS account for deploying the Databricks service because the IAM role required for running the service could theoretically affect other services within the account. Spark Cluster Network Isolation The Spark deployments are firewalled by default and isolated from each other. Access to these clusters is limited to the frontend of Databricks by default, but can also be opened up by adding an Elastic IP address (Databricks provides sample notebooks for performing this operation). VPC Isolation of Customer s Service in Databricks Account Databricks operates and maintains the web frontend and cluster management resources on behalf of the customer, but isolates those resources from other customer deployments by deploying within a dedicated VPC. The VPC uses dynamic IPs in the range 10.54.0.0/16. Security Groups & Network ACLs A Databricks deployment utilizes multiple AWS security groups to control and protect egress and ingress traffic. The external facing resources such as the Databricks web portal instance uses a security group that exposes port 443 which provides the ability for users to login. The login to the web portal via port 443 is secured by SSL encryption. There are no other ports exposed externally on the Databricks webapp instance. Other instances such as the Databricks cluster manager instance and Spark workers, do not expose any external facing ports. The AWS security groups attached to these instances only allow internal facing traffic between instances. In addition to security groups, a Databricks deployment utilizes network ACLs to control inbound and outbound traffic at the subnet level. No Public IPs The Databricks customer success team can enable a feature flag to turn off not having public IPs in the workers as well as white list IP addresses that are allowed to access the Databricks web portal. Monitoring All network activity is logged and monitored. Databricks leverages AWS VPC flow logs to capture information about the IP traffic going to and from network interfaces as well as all VPC and AWS Cloudtrail logs to capture all APIs made by a Databricks AWS account. The log data is retained for a minimum of 365 days and access to the logs is restricted to prevent tampering. 14

Physical Security Infrastructure Databricks is hosted on AWS. AWS data centers are frequently audited and comply with a comprehensive set of frameworks including ISO 27001, SOC 1, SOC 2, SOC 3, PCI DSS. AWS physical data centers are located in secret locations and have stringent physical access controls in place to ensure that no unauthorized access is permitted including biometric access controls and twenty-four-hour armed guards and video surveillance. Office Databricks implements physical controls in its office including badge readers, a staffed reception desk, visitor sign-in, and a clean desk policy. Logging and Monitoring Databricks provides comprehensive end-to-end audit logs of activities done by the users on the platform, allowing enterprises to monitor the detailed usage patterns of Databricks as the business requires. The audit logs cover Accounts, Notebooks, Clusters, DBFS, Genie, Jobs, SQL Permissions, Customer SSH Access, Tables. Once enabled for your account, Databricks will automatically start shipping the audit logs in human readable format to that location every 24 hours. The logs will be available within 72 hours of an activation. Databricks encrypts audit logs using Amazon S3 server-side encryption. Policies & Procedures Databricks has implemented a number of policies and procedures aimed at enforcing security best practices. The policy and procedures documents are accessible to all employees, reviewed and updated at least annually, and communicated to all employees upon hire and periodically thereafter. The suite of security policies includes the following: Data Classification Defines levels of data sensitivity (public, private, sensitive, confidential, secret) and describes acceptable methods for storage, access, sharing. Access Management Describes procedures for provisioning and deprovisioning of access, periodic access reviews, password and MFA requirements (provisioning, deprovisioning, 2fa, reviews). Acceptable Use Describes acceptable and unacceptable use as well as enforcement. Security Training Outlines types of security trainings per function (engineering vs. general), frequency, and delivery methods. Incident Response Describes incident response process, responsibilities, SLA. 15

Risk Management Describes risk management methodology and frequency of assessment. Threat Modeling Describes threat modeling methodology and tools. Performance Monitoring Defines system performance KPIs and describes escalation process. Hardening Standards Describes system hardening standards and process. Databricks has a dedicated security team focused on product security, corporate security, security operations, as well as privacy and risk and compliance. Secure Your Enterprise Workload Today Hundreds of organizations have deployed the Databricks virtual analytics platform to improve the productivity of their data teams, power their production Spark applications, and securely democratize data access. Databricks is available in Amazon Web Services globally, including the AWS GovCloud (US) region. Contact Databricks for a personalized demo, or register to try Databricks for free. 16