CIS 602-01: Computational Reproducibility Containers Dr. David Koop
Virtual Machines Software Abstraction - Behaves like hardware - Encapsulates all OS and application state Virtualization Layer - Extra level of indirection - Decouples hardware, OS - Enforces isolation - Multiplexes physical hardware across VMs [via E. de Lara] 2
Virtualization Properties Isolation - Fault isolation - Performance isolation Encapsulation - Cleanly capture all VM state - Enables VM snapshots, clones Portability - Independent of physical hardware - Enables migration of live, running VMs Interposition - Transformations on instructions, memory, I/O - Enables transparent resource overcommitment, encryption, compression, replication 3
Types of Virtualization Native/Bare metal (Type 1) - Higher performance - ESX, Xen, HyperV Hosted (Type 2) - Easier to install - Leverage host s device drivers - VMware Workstation, Parallels [http://itechthoughts.wordpress.com/tag/full-virtualization/ via E. de Lara] 4
Virtual Machine Uses Software Testing: Test multiple configurations on one computer Migration: if a server fails, move the virtual machine elsewhere Cross-environment work: Windows on Linux Enterprise support: upgrade via image Education: concentrate on math/programming rather than install Custom prototypes: try-before-you-buy [B. Howe, 2014] 5
Approaches to disseminating software high effort required by experimenter low controlled environments extensive documentation raw code and data extensive documentation controlled environments raw code and data virtual machines virtual machines low high effort required by those who only reproduce the experiments low high effort required by those who reuse and extend the results [B. Howe, 2014] 6
Improving Reproducibility Capturing more variables Fewer constraints on research methods On-Demand Backups Virtual Machines as Citable Publications Code, Data, Environment + Resources Automatic Upgrades Competitive, Elastic Pricing Reproducibility for Complex Architectures Unfettered Collaborative Experiments Data-intensive Computing Cost Sharing A Foundation for Single-Payer Funding Compatibility with Other Approaches [B. Howe, 2014] 7
Remaining Challenges Cost Culture Provenance Reuse [B. Howe, 2014] 8
Non-challenges Security Licensing Vendor Lock-In and Long-Term Preservation [B. Howe, 2014] 9
Ocean Appliance Example Ship the entire machine instead of trying to configure an existing machine with all of the new software Easier, cheaper, and safer to build the box in the lab and hand it out for free than to work with the ships admin to get our software running. Modern analog: Easier to build and distribute a virtual appliance than it is to support installation of your software. [B. Howe, 2014] 10
Virtualization and the Cloud Virtualization = Code + Data + Environment Cloud = Virtualization + Resources + Services Cloud allows on-demand resources, centralized maintenance, supply+demand Computation done near the data (cannot FTP data around for many datasets due to size and transfer costs) [B. Howe, 2014] 11
Project Find some papers that you may be interested in reproducing Do a survey of the material that is available for each paper: - Code? Is the code under version control? - Data? Is it clear how to process or understand the data? Is there metadata? - Virtual machine or container? Does the hardware/software that deals with these still work? - Provenance? Do we have a record of the steps taken in producing a result? How complete is it? 12
Project If you are interested in a topic that aligns with reproducibility, please email me/talk to me about your ideas For example, if you are working on a research project that could incorporate reproducibility Formal Specification Online http://www.cis.umassd.edu/~dkoop/cis602/project.html Due Monday, November 7 13
Introduction to Docker Docker, Inc.
The Problem Matrix [Docker, Inc., 2016] 15
The Solution: Containers [Docker, Inc., 2016] 16
Containers vs. Virtual Machines [Docker, Inc., 2016] 17
Containers vs. Virtual Machines [D. Merkel, 2014] 18
Related: Package Management & Deployment Examples: - Anaconda for Python - Gems for Ruby - apt-get, yum, etc. for Linux distributions 19
Containers and Reproducibility What are the benefits of containers over virtual machines with respect to reproducibility? Do containers address all of the problems we are concerned with? What issues remain? 20