WHITEPAPER 2015 BGI Online All rights reserved Version: Draft v3, April 2015 Security Overview of the BGI Online Platform Data security is, in general, a very important aspect in computing. We put extra effort to ensure data security on BGI Online based on two reasons. First, BGI Online is a platform dealing with genomic data, which concerns individuals (say, patients) and deserves to be handled with the highest possible level of privacy protection. Second, BGI Online is an online system hosted on Amazon Web Services (AWS) cloud. Not only important genomic data, but also the analytic pipelines from different organizations and users will co-exist on a cloud platform. As a result, BGI Online is designed and built with stringent security and privacy requirement. This paper highlights the security measures designed into the BGI Online platform to ensure both the low level security and the permission control over the users on the platform. Our Approach The security design of BGI Online is divided into two levels, the infrastructure level and the business logic level. For the infrastructure level, the general security measures used by the cloud computing industry have been incorporated, which include data encryption, authentication, API rate limit, VPC protection, firewall protection and vulnerability protection. For the business logic level, the primary concern is how to support the collaborative nature of genomic cloud users, while providing an easy-to-manage, yet well-protected business logic for ensuring a secured workflow on the system. To this end, BGI Online has several tailor-made design concepts, including the de-identification of objects, fine grain access control, and the sharing mechanism of files. It has been noted that while there exist a few regulatory frameworks for governing the security and privacy of genomic data (namely, the US Health Insurance Portability and Accountability Act (HIPAA), Clinical Laboratory Improvement Amendments (CLIA), ISO/IEC 27001:2013), these regulatory frameworks do not cover concrete requirement, guidelines and regulations for handling genomic data on cloud platforms. Nevertheless, when BGI Online is designed and developed, the principles and underlying spirit of these frameworks have been observed and followed.
Infrastructure Level Security 1. Encryption BGI Online ensures all data handled by the system is encrypted during transfer and at rest. For the data being transfer, not a single data connection in the system allows plain data being transferred. For those connections with encryption option (e.g. HTTP/S), we will enforce the use of the encrypted option. For those connections that do not come with a built-in encryption option, BGI Online has developed and uses an encrypted version of them. For BGI Online, there are actually four different types of at rest storage: 1) The ephemeral storage used by the EC2 computation instances 2) The Tier-1 cache comprises EC2 instances with multiple ephemeral disks. 3) AWS Simple Storage Service (S3). 4) AWS Glacier. Figure 1 shows the data flow in BGI Online. BGI Online implements all types of at rest encryption. Data is by default uploaded to Tier-1 cache encrypted using an industrial standard AES256 algorithm, and at the same time synchronized to encrypted S3 bucket leveraging S3 server-side encryption. During computation, all data and temporary disk volumes are being encrypted using AES256. Infrequently accessed data would be removed from Tier-1 cache, or further moved to Glacier, which is also encrypted with AES256 on server-side, for archival. All data transfer is done through encrypted SSL/TLS channels.
1 User logs on BGI-Online. 2 BGI-Online creates temporary access token. 3 Using the token, data is uploaded to Engine and being de-identified. Keys to restore the data are stored in Metadata database. De-identified data are stored in Encrypted tier-1 cache and S3 bucket synchronously. 4 Once the user starts a computation, BGI-Online calculates the optimal execution plan. Final results are uploaded to Encrypted tier-1 cache. 5 Infrequently accessed data are removed from Encrypted tier-1 cache, 6 or being further archived in Encrypted Glacier Vault and removed from S3. Figure 1 Data flow in BGI Online When the data is no longer used in a particular place (e.g. on a computing node) or an authorized user decides to remove the data from BGI Online, data are wiped with U.S. Department of Energy M205.1-2 Standard to ensure that all data is safely. The standard uses three wiping passes: Pass 1-2: overwrite the data with a pseudo random values Pass 3: overwrite the data with zero-filled pattern 2. Authentication Authentication to the AWS instances follows the best practices advocated by Amazon. It requires a strong RSA key. This ensures that the infrastructure is well protected. On the system level, BGI Online users authenticate on the platform using user name and secure password. A time-limited temporary token will be generated upon successful authentication. This token is kept secured and being used for accessing the system for a short period of time. This limits the possibility of any hackers to obtain access to the system by brute force trial-and-error.
3. API Rate Limit BGI Online implements a rate limit for accessing the system. All accesses to the system, including the front-end web page operations, are done through API. For each user, there will be a maximum rate to access BGI Online s system API. This limits the possibility of malicious users to tamper the system using denial-of-service type of attacks. 4. VPC Protection All Amazon Web Services (AWS) EC2 instances used by BGI Online run within an Amazon Virtual Private Clouds (VPC). VPC provides a logically isolated section of the AWS Cloud where BGI Online launches its AWS resources in a virtual network that is specifically defined. Using VPC, BGI Online uses a specific IP address range, subnets, route tables and network gateways. 5. Firewall Protection Amazon EC2 provides security groups for BGI Online computing resources. A security group acts as a virtual firewall that controls the network traffic flowing in and out of the BGI Online computing resources. Every instance launched in BGI Online is associated with just the needed security groups. Using the rules in security group, fine grain control can be applied to allow traffic to or from other instances. 6. Vulnerability Protection BGI Online is developed using a number of third party open source libraries and software. Like every other software, these libraries may have vulnerability problem, which may be discovered over time. The BGI Online team will conduct regular vulnerability assessments in a pro-active manner. Whenever potential risks are identified, immediate remedies will be applied to ensure the system is well protected. Also, AWS does provide vulnerability checking for the users. The security guidelines and advices provided by AWS will be actively followed so as to promote the vulnerability protection of the system. Business Logic Level Security 1. De-identification of objects All entities in the BGI Online system are represented by a UUID, which is 128-bit value for guaranteeing a practical uniqueness. In practice, holding a UUID cannot determine the details of the entity. For example, getting hold of a file s UUID does not give the holder any information, i.e. name, metadata, owner, create date, belonging project, etc., about the file. Likewise, getting hold of a project s UUID does not give the holder any information about what the project is about. Though the possible values of UUID is finite, the BGI Online only use an extremely sparse subset of the values. Therefore, it is practically impossible for users to obtain information about the other users, or the other projects of the system by guessing the UUIDs or by deducing information from the UUIDs they hold.
2. Fine Grained Access Control Access controls on the BGI Online Platform are very fine-grained. Six permissions types including Admin, Upload, View, Modify, Run and Share are set on a per-user-per-project basis. Files are grouped under projects, and each project members can have his/her own permission towards the files. As a result, different users/members can be assigned with different privileges that allow them to access just enough information for their work only. This includes sharing of data, which can only be performed via the platform itself unless a user has the Share permission to download a file. 3. Sharing of Files on BGI Online Platform A file could be shared through link only, thus prohibiting additional copies. The accessibility of a link only shared file could be revoked immediately by unsharing or deleting the file. To account for the two natures of shared file, one is publicly shared such as 1000 genome project data; another is privately shared where the number of recipient should only be one, BGI-Online implements two sharing methods: Public: shared files could be viewed, linked or copied (if allowed) by all projects. Private (hand-shaking): Sharer shares a file to a Project ID (Recipient) provided by the recipient. The recipient needs to enter the Project ID (Sharer) that owns the shared file tolink or copy the shared file.