DMTN-003: Description of v1.0 of the Alert Production Simulator
|
|
- Sara McDonald
- 6 years ago
- Views:
Transcription
1 DMTN-003: Description of v1.0 of the Alert Production Simulator Release 1.0 Stephen Pietrowicz
2
3 Contents 1 Systems 3 2 Software packages 5 3 Workflow Status Components OCS Base DMCS Replicator Central Manager Replicator Execution Node Replicator Node Process Replicator Job Distributors Archive DMCS Worker Master Node Worker Execution Node Worker Job i
4 ii
5 This document describes the current state of the Alert Production Simulator, which was written to test the workflow described in LDM-230, revision 1.2 dated Oct 10, The system consists of twenty-five virtual machines simulating over two hundred machines, simulating the workflow required for one thread of processing. Please note that image archiving, science image processing, catch-up archiver, and EFD replicator are not implemented in this simulator. Contents 1
6 2 Contents
7 CHAPTER 1 Systems The AP simulator runs on VMware virtual machines in the following configuration: lsst-base - main Base DMCS system lsst-base2 - failover Base DMCS system lsst-rep - replicator HTCondor central manager lsst-rep1 - HTCondor replicator job execution node, running 11 slots lsst-rep2 - HTCondor replicator job execution node, running 11 slots lsst-dist - distributor node, running 22 distributor node processes lsst-work - worker HTCondor central manager lsst-run1 through lsst-run14 (14 virtual machines) - each configured with 13 slots lsst-run15 - configured with 7 slots. lsst-archive - main Archive DMCS system lsst-archive2 - failover Archive DMCS system lsst-ocs - LSST OCS system - sends simulated DDS events, startintegration, nextvisit, and startreadout ; also runs simulated file server for camera readout. 3
8 4 Chapter 1. Systems
9 CHAPTER 2 Software packages The software packages: ctrl_ap htcondor (this is not currently integrated in the LSST stack, but contains the python software install on the systems lsst-base and lsst-base2) must be setup on each of the following machines: lsst-base lsst-base2 lsst-archive lsst-archive2 lsst-rep lsst-dist lsst-work lsst-ocs After setup, execute the following commands on the following machines: lsst-base - basedmcs.py - starts the base DMCS master (via config) lsst-base2 - basedmcs.py - starts the base DMCS failover (via config) lsst-archive - archivedmcs.py - starts the Archive DMCS master (via config) lsst-archive2 - archivedmcs.py - starts the Archive DMCS failover (via config) lsst-rep - launchreplicators.sh - starts the replicator node processes on lsst-rep1 and lsst-rep2 (22 processes, 11 on each system) lsst-dist - launchdistributors.sh - starts the distributor node processes on lsst-dist (22 processes) lsst-ocs - ocsfilenode.py - starts a server the delivers files to replicator jobs that request them. There are two HTCondor pools. The Replicator pool is set up with lsst-rep as the central manager, and lsst-rep1 and lsst-rep2 containing 11 HTCondor slots each. The Worker pool is setup with lsst-work as the central manager, and lsst-run1 through lsst-run15 running 189 HTCondor slots in total. 5
10 6 Chapter 2. Software packages
11 CHAPTER 3 Workflow After all the services are configured, an external command is used to send commands as if they had been sent from the OCS, and images pulled from a simulated CDS service through to the entire workflow through worker jobs. Events are sent from the OCS system to the Base site and trigger specific operations. There are three event types that the OCS system sends: nextvisit, startintegration, and startreadout. The nextvisit event starts worker jobs on the HTCondor Worker pool. The startintegration event starts replicator jobs on the HTCondor Replicator pool. The startreadout event triggers the replicator jobs to make a CDS library call to read rafts. Replicator jobs send each raft to it s pair distributor. Each distributor breaks up the raft into CCDs. Worker jobs rendezvous with the distributors and process each CCD individually. This entire process is described in greater detail below: 1. A nextvisit event is sent from the OCS system, lsst-ocs. 2. The message is received by the Base DMCS system, lsst-base. 3. The Base DMCS system submits 189 worker jobs (with the information about CCD image they re looking for) to the worker central manager, lsst-work, which controls the worker pool. 4. The worker jobs begin to execute on slots available on the HTCondor worker cluster. 5. The worker jobs subscribe to the OCS startreadout event. 6. The worker jobs contact the Archive DMCS to ask which distributor has the image they re looking for. 7. A startintegration event is sent from the OCS system. 8. The Base DMCS system submits 21 replicator jobs (containing the visit id, exposure sequence number of the RAFT they re looking for) to the replicator central manager, lsst-rep, which controls the replicator pool. 9. The replicator jobs begin to execute on the HTCondor replicator cluster. 10. The replicator jobs send the visit id, exposure sequence number and the raft to the replicator node process, which passed it to the paired distributor. 11. The distributor sends a messages to the Archive DMCS telling it which CCDs it will be handling for this RAFT. 12. The workers waiting for the distributor information from the Archive DMCS are told which distributors to rendezvous with when that information arrives. 13. The workers connect to the distributor with their image information, and block until the image is received. 14. A startreadout event is sent from the OCS system. 15. The replicator jobs receive the startreadout event from the OCS system 7
12 16. The replicator jobs contact the OCS system to read the cross-talk corrected image. 17. The replicator jobs send the image data to the distributor node processes and exit. 18. The distributor node processes break up the raft images into ccd images 19. The distributor send the image to the worker job. 20. Steps 5 through 14 are repeated for a second exposure, and then move on to Step The worker jobs process the images they receive and exit. 3.1 Status Components transmit their activity as status via the ctrl_event messages. Each status message includes the source of the message, the activity type, and an optional free form message, in addition to other standard event data (publish time, host, etc). Each component in the system uses a subset of activity types. When any of these activities occurs a status messages is sent. Other programs have been written to capture this status information, and we ve used it to drive external animations of the overall AP simulator activity. An example of this can be seen here: 8 Chapter 3. Workflow
13 CHAPTER 4 Components 4.1 OCS The OCS system, lsst-ocs, is used to run a simulated CDS server that delivers images to Replicator Jobs. events (nextvisit, startintegration, startreadout) are sent from this system. The OCS We use the commands ocstransmitter.py and automate.py to send OCS events to the Base DMCS. The ocstransmitter.py command sends the specified event for each command invocation. The automate.py command can send groups of events at a specific cadence. We use the ocsfile.py server process as the simulated CDS server that delivers images. Notes: We created the ocsfile.py server process as a substitute for the CDS library call that will be made by the replicator jobs to retrieve images from the CDS. We do not know what mechanisms will be used to server or deliver the images, other that it is through a library call. The OCS commands are sent via the ctrl_events ActiveMQ message system, not through the DDS system. 4.2 Base DMCS The Base DMCS receives messages from the OCS and controls the job submission to the replicator cluster and the worker cluster. Two Base DMCS processes are started, one on primary system and one on a secondary system. These can be started in any order. On start up, the processes contact each other to negotiate the role each has, either main or failover. Both processes subscribe and receive messages from the simulated OCS, but only the the process currently designated as main acts on the messages. A heartbeat thread is maintained by both processes. If the failover process detects that the main process is no longer alive, it s role switches to main. When the process that had been designated as main returns, it s role is now reassigned to failover. The following describes the actions of the Base DMCS in the main role. OCS messages are implemented as DM software events, since the OCS DDS library was not available when the simulator was written. The Base DMCS process subscribes to one topic, and receives all events on that topic. It responds only to the startintegration and nextvisit events, and submits jobs to the appropriate HTCondor pool. These jobs are submitted via the Python API that HTCondor software provides. On receipt of the nextvisit event, the Base DMCS submits 189 Worker jobs and 4 Wavefront jobs to the Worker central manager. Each job is given the visit id, number of exposures to be taken, boresight pointing, filter id, and CCD id. The jobs are place holders for the real scientific code, and are described here. On receipt of the startintegration event, the Base DMCS submits 22 replicator jobs, one for each raft, and a single job for the wavefront sensors. The replicator jobs are described here. Notes: Both the replicator jobs and worker jobs are submitted to their respective central managers, and have no mechanism for monitoring their progress. In data challenges, work was submitted via HTCondor s DAGman, which provided a mechanism for resubmitting jobs that failed automatically for a configured number of times. Furthermore, 9
14 it provided a resubmission DAG for jobs that completed failed after that set number of times. HTCondor itself does not resubmit failed jobs automatically; it will, however resubmit a job if a HTCondor job slot in which it was running has an error of some kind. None of this is desired behavior. We need to monitor job progress, success and failure. Resubmitting a job on failure of software or hardware to run given the time constraints is not feasible. Using DAGman to submit files would require us to keep track of the resubmit files, which seems like overkill. We need a mechanism that logs success and the reason for failure so that we can take appropriate action. 4.3 Replicator Central Manager The HTCondor Central Manager for the Replicator pool is configured on lsst-rep. This system acts as the HTCondor central manager for two VMs, lsst-rep1 and lsst-rep2, which are configured with 2 CPUs and 4 gig of memory each. HTCondor is configured on lsst-rep1 and lsst-rep2 to have 11 job slots. This central manager node accepts job submissions from lsst-base, and runs those jobs on lsst-rep1 and lsst-rep Replicator Execution Node A replicator node is part of the HTCondor pool, which is controlled by the central manager. Each node accepts replicator Jobs scheduled by the central manager. There are 22 worker nodes, one for each raft (including wavefront). Notes: Due to limitations capacity at the time the simulator was written, this was simulated across 2 VMs. Each VM was configured with 2 CPUs, and 4 gig of memory. HTCondor will ordinarily make the number of slots for jobs equal to the number of CPUs, but we overrode this to configure 11 slots per for each VM. We noted varying startup times from the time the job was submitted to the central manager. The pool was configured to retain ownership of the slot, which increased the speed at which jobs were matched to slots. In general, the start up was very quick, but there were times when we noted start up times of between 15 and 30 seconds. It was difficult to determine why exactly this occurred, but given the limited capacity of the VMs themselves, we believe this is a contributing factor. 4.5 Replicator Node Process The Replicator Node Process receives messages and data from the replicator job, and transmits this information to its paired Distributor Node process. Twenty-two replicator node processes are started, eleven each on lsst-rep1 and lsst-rep2. The processes are started with a paired distributor address and port to connect to. The distributor node process does not have to be running at the time the replicator node process starts because it will continue to attempt a connection until it is successful. Once successful, a heartbeat thread is started which sends messages to the paired distributor node heartbeat thread at regular intervals in order to monitor the health of the connection. If this connection fails, the process attempts to connect to its paired distributor until it succeeds. A network server port is also opened for connections from a Replicator Job process. When the Replicator Job process connects it sends the Replicator Node process the visit id, exposure id, and raft id it was assigned. The Replicator Node process sends this information to the paired distributor. The Replicator Node process then waits to receive the crosstalk-corrected image location from the replicator job process. When this is received, the file is transferred to the paired distributor. The connection to the Replicator Job process is closed (since the job dies at this point), and the cycle repeats. Notes: The replicator node process only handles transmitting to the paired distributor. where the connection to the distributor is down or interrupted. It needs to handle the case 10 Chapter 4. Components
15 The replicator node process starts knowing which distributor process to connect to via a command line argument. We should look into using a configuration manager which the replicator node process could contact to retrieve the host/port of the distributor it should connect to. This would make the system more robust if nodes go down and need to be replaced. 4.6 Replicator Job Replicator Jobs are submitted by the Base DMCS process to be executed on the Replicator Node HTCondor pool. Each job is given a raft, visit id, and exposure sequence id. This information is transmitted to the Replicator Node Process on the same host the job is running on. The replicator job then makes a method call to retrieve that particular raft. The replicator job retrieves the raft, and then sends a message with the location of the data to the Replicator Node Process, and exits. Notes: The process of starting a new replicator job for every exposure seems to be quite a bit of overkill to transfer one image to the node, and then on to the distributor. One of the issues that we ve been trying to mitigate is the start up time for the job. Generally this is pretty quick, but we ve seen some latency in the start of the process when submitted through HTCondor. I don t think it makes sense to have a process be started and stopped through HTCondor, and depend on it starting at such a rapid pace. We should explore having Replicator Node Processes transfer the files and use redis to do the raft assignments, and eliminate the Replicator Jobs completely. 4.7 Distributors The machine lsst-dist runs twenty-two distributor processes. These processes receive images from their paired replicators and split the images into nine CCDs. Each CCD is later retrieved by Worker Jobs running on nodes in the HTCondor Worker Pool. The distributor is started with a parameter of which port to use as it s incoming port. Before the distributor send messages, it sets up several things. First, it starts a thread for incoming Archive DMCS requests. This event triggers a dump of all CCD identification data to the Archive DMCS so the Archive DMCS can replenish it s cache for worker requests. This is done in case the Archive DMCS goes offline and loses it s cached information. Next the Distributor also sends it s own identification information to the Archive DMCS. When the Archive DMCS receives this information, it removes all previous information the Distributor sent it. The Distributor does not maintain a cache of it s information. At this point, the distributor can receive connects from worker jobs and its paired replicator. Once the replicator contacts the distributor, the network connection is maintained throughout its lifetime. If the connection is dropped for some reason, the distributor goes back and waits for the replicator to reconnect. At this point messages from the Replicator Node Process can be received, generally in pairs. The first type of message serves as a notice to the Distributor with information about the raft it is about to receive. The Distributor can at this point send that information to the Archive DMCS. (The Archive DMCS informs any workers waiting for Distributor information for a particular CCD of the Distributor s location). The second type of message that can be received from the Replicator Node Process is the data itself. Header information in the transmission describes the raft data being sent, and a length for the data payload. The data is read by the Distributor, and split into 9 CCDs. Workers contact the Distributors, and request that a CCD be transmitted. If the CCD is not yet available, the worker blocks until it is received by the distributor. The waiting workers get the image once the CCD is received. Once the worker receives the image, the connection to the distributor is broken. Notes: The distributor/replicator pairing maintains a continuous connection until one side is brought down, or an error is detected. The location of the distributor is specified on invocation; it might be better to have something like REDIS 4.6. Replicator Job 11
16 keep this information. It might also be good to keep distributor/ccd location information in REDIS, eliminating the pairing software that currently exists in the Archive DMCS. 4.8 Archive DMCS The Archive DMCS process is a rendezvous point between Worker Jobs and Distributors. On startup, the Archive DMCS sends a messages to all distributors asking for image data meta data that they currently hold. The Archive DMCS process opens a port on which Worker Jobs can contact it, and subscribes to an event topic on which it can get advisory messages from the Distributors. Distributors send two types of messages. The first is notifies the Archive DMCS that it is starting. This clears all entries in the Archive DMCS for that Distributor, since the Distributors do not have an image cache and may be a completely new Distributor with no previous knowledge of images. The second is an information event that tells the Archive DMCS which image it has available. On acceptance of a connection from the worker, a thread is spawned and a lookup for the requested Distributor is performed. If the Distributor is not found, the worker thread blocks until that data arrives, or until a TTL times out. When the Distributor receives information from its paired Replicator about the raft image it will handle, events are sent to the Archive DMCS containing the information for all the CCDs in the raft. When the Archive DMCS receives this information from the Distributor, it adds those entries into its cache, and notifies the waiting workers that new data is available. If a Worker Job receives the information it was looking for, it disconnects from the Archive DMCS and contacts the Distributor, which will send the job CCD image it requested. There is a passive Archive DMCS that shadows the active Archive DMCS, and will respond to requests if the main Archive DMCS fails. Both Worker Jobs and Distributors are configured with the active and failover host/port and can respond to a failed connection (or unreachable host) appropriately. Notes: There are a number of issues which were solved in creating the Archive DMCS, including timeouts of TTL counters for threads, cleanup on workers that disconnected while waiting, and caching of information that multiple threads were requesting. Some issues, such as expiring data deemed out of daea (since no worker would request that old data from a distributor after a certain amount of time), were not addressed. While a duplicate Archive DMCS was built and can act as a failover shadow, this implementation does not seem ideal because of the way the workers and distributors need to be configured for failover. Additionally, a better mechanism than having Worker Job threads connect at wait for data from the Archive DMCS, seem possible via DM Event Services. Since this portion of the simulator was written, we ve found that there are some open source packages, such as Redis and Zookeeper, that seem to duplicate the functionality we ve written and address the issues listed above. Some small bit of code may have to be written to clear data cache on Distributor startup. Both packages have Python interfaces. These are in widespread use in other projects and companies. This seems to be worth investigating. 4.9 Worker Master Node The HTCondor central manager for the Worker pool is configured on lsst-work. This system acts as the central manager for fifteen VMs, lsst-run1 through lsst-run15, which are configured with 2 CPUs and 4 gig of memory each. HTCondor is configured on lsst-run1 through lsst-run14 to have 13 job slots each, and lsst-run15 is configured with 7 slots. This central manager accepts job submissions from lsst-base, and runs those jobs on lsst-run1 through lsst-run Chapter 4. Components
17 4.10 Worker Execution Node A worker node is part of the HTCondor pool, which is controlled by the central manager. Each node accepts Worker Jobs scheduled by the central manager. There are 189 worker nodes, one for each CCD. Notes: Due to limitations capacity at the time the simulator was written, this was all simulated across 15 VMs. Each VM was configured with 2 CPUs, and 4 gig of memory. HTCondor will ordinarily make the number of slots for jobs equal to the number of CPUs, but we overrode this to configure 13 slots per for the first 14 VMs, and 7 for the last one. As with the replicator jobs, we noted varying startup times from the time the job was submitted to the central manager. The pool was configured to retain ownership of the slot, which increased the speed at which jobs were matched to slots. In general, the start up was very quick, but there were times when we noted start up times of between 15 and 30 seconds. It was difficult to determine why exactly this occurred, but given the limited capacity of the VMs themselves, we believe this is a contributing factor Worker Job The Worker Job starts with the CCD id, visit id, raft id, foresight, filter id, and the number of exposures to consume. A job termination thread is started. If the timer in the thread expires, the job is terminated. The job contacts the Archive DMCS with the CCD information and blocks until it receives the distributor location for that CCD. Once the worker retrieves the distributor location, it contacts that distributor and asks for that CCD. If the distributor has no information about the CCD, the worker returns to the Archive DMCS and requests the information again. Once all exposures for that CCD are retrieved, the work jobs sleep for a short time to simulate processing. When this completed, the worker job exits Worker Execution Node 13
TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationPlug-in Configuration
Overview, page 1 Threading Configuration, page 2 Portal Configuration, page 3 Async Threading Configuration, page 3 Custom Reference Data Configuration, page 4 Balance Configuration, page 6 Diameter Configuration,
More informationFinding Support Information for Platforms and Cisco IOS Software Images
First Published: June 19, 2006 Last Updated: June 19, 2006 The Cisco Networking Services () feature is a collection of services that can provide remote event-driven configuring of Cisco IOS networking
More informationPrimary-Backup Replication
Primary-Backup Replication CS 240: Computing Systems and Concurrency Lecture 7 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Simplified Fault Tolerance
More informationCall Control Discovery
CHAPTER 3 The call control discovery feature leverages the Service Advertisement Framework (SAF) network service, a proprietary Cisco service, to facilitate dynamic provisioning of inter-call agent information.
More informationHortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :
Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.
More informationDHCP Failover: An Improved Approach to DHCP Redundancy
Overview The DHCP Failover protocol specification and ISC s implementation of the protocol have problems that can cause issues in production environments, primarily in those environments where configurations
More informationDistributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang
A lightweight, pluggable, and scalable ingestion service for real-time data ABSTRACT This paper provides the motivation, implementation details, and evaluation of a lightweight distributed extract-transform-load
More informationDistributed Computation Models
Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case
More informationPrimary/Backup. CS6450: Distributed Systems Lecture 3/4. Ryan Stutsman
Primary/Backup CS6450: Distributed Systems Lecture 3/4 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed
More informationCloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]
s@lm@n Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ] Question No : 1 Which two updates occur when a client application opens a stream
More informationPlug-in Configuration
Overview, on page 1 Threading Configuration, on page 2 Portal Configuration, on page 3 Async Threading Configuration, on page 4 Custom Reference Data Configuration, on page 6 Balance Configuration, on
More informationLinksys Stackable Switches
TECHNICAL BULLETIN Linksys Stackable Switches How to Build Stacks and Understand Their Operation This document describes how to stack Linksys switches and covers advanced stacking information, as well
More informationMicroservice Splitting the Monolith. Software Engineering II Sharif University of Technology MohammadAmin Fazli
Microservice Software Engineering II Sharif University of Technology MohammadAmin Fazli Topics Seams Why to split the monolith Tangled Dependencies Splitting and Refactoring Databases Transactional Boundaries
More informationOverview. CPS Architecture Overview. Operations, Administration and Management (OAM) CPS Architecture Overview, page 1 Geographic Redundancy, page 5
CPS Architecture, page 1 Geographic Redundancy, page 5 CPS Architecture The Cisco Policy Suite (CPS) solution utilizes a three-tier virtual architecture for scalability, system resilience, and robustness
More informationWhy the Threat of Downtime Should Be Keeping You Up at Night
Why the Threat of Downtime Should Be Keeping You Up at Night White Paper 2 Your Plan B Just Isn t Good Enough. Learn Why and What to Do About It. Server downtime is an issue that many organizations struggle
More information! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like
Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total
More informationA Generic Multi-node State Monitoring Subsystem
A Generic Multi-node State Monitoring Subsystem James A. Hamilton SLAC, Stanford, CA 94025, USA Gregory P. Dubois-Felsmann California Institute of Technology, CA 91125, USA Rainer Bartoldus SLAC, Stanford,
More informationChapter 11. High Availability
Chapter 11. High Availability This chapter describes the high availability fault-tolerance feature in D-Link Firewalls. Overview, page 289 High Availability Mechanisms, page 291 High Availability Setup,
More informationData Management Middleware Design
Large Synoptic Survey Telescope (LSST) Data Management Middleware Design K.-T. Lim, G. Dubois-Felsmann, M. Johnson, M. Juric, and D. Petravick LDM-152 Latest Revision: 2017-07-05 This LSST document has
More informationUnderstanding the TOP Server ControlLogix Ethernet Driver
Understanding the TOP Server ControlLogix Ethernet Driver Page 2 of 23 Table of Contents INTRODUCTION 3 UPDATE RATES AND TAG REQUESTS 4 CHANNEL AND DEVICE CONFIGURATION 7 PROTOCOL OPTIONS 9 TAG GENERATION
More informationBASIC OPERATIONS. Managing System Resources
48 PART 2 BASIC OPERATIONS C H A P T E R 5 Managing System Resources CHAPTER 5 MANAGING SYSTEM RESOURCES 49 THE part of Windows Vista that you see the Vista desktop is just part of the operating system.
More informationInternet Control Message Protocol
Internet Control Message Protocol The Internet Control Message Protocol is used by routers and hosts to exchange control information, and to inquire about the state and configuration of routers and hosts.
More informationData Loss and Component Failover
This chapter provides information about data loss and component failover. Unified CCE uses sophisticated techniques in gathering and storing data. Due to the complexity of the system, the amount of data
More informationContainerized Cloud Scheduling Environment
University of Victoria Engineering & Computer Science Co-op Work Term Report Fall 2017 Containerized Cloud Scheduling Environment Department of Physics University of Victoria Victoria, BC Tahya Weiss-Gibbons
More informationSurveillance Dell EMC Isilon Storage with Video Management Systems
Surveillance Dell EMC Isilon Storage with Video Management Systems Configuration Best Practices Guide H14823 REV 2.0 Copyright 2016-2018 Dell Inc. or its subsidiaries. All rights reserved. Published April
More informationApril 21, 2017 Revision GridDB Reliability and Robustness
April 21, 2017 Revision 1.0.6 GridDB Reliability and Robustness Table of Contents Executive Summary... 2 Introduction... 2 Reliability Features... 2 Hybrid Cluster Management Architecture... 3 Partition
More informationConfiguring Advanced BGP
CHAPTER 6 This chapter describes how to configure advanced features of the Border Gateway Protocol (BGP) on the Cisco NX-OS switch. This chapter includes the following sections: Information About Advanced
More informationA High Availability Solution for GRID Services
A High Availability Solution for GRID Services Álvaro López García 1 Mirko Mariotti 2 Davide Salomoni 3 Leonello Servoli 12 1 INFN Sezione di Perugia 2 Physics Department University of Perugia 3 INFN CNAF
More informationVMWARE VREALIZE OPERATIONS MANAGEMENT PACK FOR. Nutanix. User Guide
VMWARE VREALIZE OPERATIONS MANAGEMENT PACK FOR Nutanix User Guide TABLE OF CONTENTS 1. Purpose... 3 2. Introduction to the Management Pack... 3 2.1 How the Management Pack Collects Data... 3 2.2 Data the
More informationInternet Engineering Task Force (IETF) Request for Comments: 8156 Category: Standards Track ISSN: June 2017
Internet Engineering Task Force (IETF) Request for Comments: 8156 Category: Standards Track ISSN: 2070-1721 T. Mrugalski ISC K. Kinnear Cisco June 2017 DHCPv6 Failover Protocol Abstract DHCPv6 as defined
More informationCSCI 204 Introduction to Computer Science II Lab 7 Queue ADT
CSCI 204 Introduction to Computer Science II Lab 7 Queue ADT 1. Objectives In this lab, you will practice the following: Implement the Queue ADT using a structure of your choice, e.g., array or linked
More informationDynamic Host Configuration (DHC) Internet-Draft Intended status: Standards Track Expires: August 31, 2017 February 27, 2017
Dynamic Host Configuration (DHC) Internet-Draft Intended status: Standards Track Expires: August 31, 2017 T. Mrugalski ISC K. Kinnear Cisco February 27, 2017 DHCPv6 Failover Protocol draft-ietf-dhc-dhcpv6-failover-protocol-06
More informationMicroservices, Messaging and Science Gateways. Review microservices for science gateways and then discuss messaging systems.
Microservices, Messaging and Science Gateways Review microservices for science gateways and then discuss messaging systems. Micro- Services Distributed Systems DevOps The Gateway Octopus Diagram Browser
More informationConfiguring Spanning Tree Protocol
Restrictions for STP Restrictions for STP, on page 1 Information About Spanning Tree Protocol, on page 1 How to Configure Spanning-Tree Features, on page 13 Monitoring Spanning-Tree Status, on page 25
More informationConfiguring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.
Configuring the Oracle Network Environment Objectives After completing this lesson, you should be able to: Use Enterprise Manager to: Create additional listeners Create Oracle Net Service aliases Configure
More informationConfiguring STP. Understanding Spanning-Tree Features CHAPTER
CHAPTER 11 This chapter describes how to configure the Spanning Tree Protocol (STP) on your switch. For information about the Rapid Spanning Tree Protocol (RSTP) and the Multiple Spanning Tree Protocol
More informationProduct Overview. Overview CHAPTER
CHAPTER 1 This chapter provides a brief introduction to the Cisco TV Content Delivery System for an Interactive Services Architecture (ISA) environment. This chapter covers the following major topics:
More informationLecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.
CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Databricks and Stanford. Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci,
More informationChapter 32 VSRP Commands
Chapter 32 VSRP Commands activate Activates a VSRP VRID. NOTE: This command is equivalent to the enable command. ProCurveRS(config)# vlan 200 ProCurveRS(config-vlan-200)# tag ethernet 1/1 to 1/8 ProCurveRS(config-vlan-200)#
More informationBluetooth Serial Port Adapter Optimization
Tomas Henriksson 2008-01-15 cbproduct-0701-03 (7) 1 (15) Bluetooth Serial Port Adapter Optimization For the third version connectblue serial port adapter products, there are some additional AT commands
More informationInternal Server Architectures
Chapter3 Page 29 Friday, January 26, 2001 2:41 PM Chapter CHAPTER 3 Internal Server Architectures Often, it is important to understand how software works internally in order to fully understand why it
More informationRecall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers
Replicated s, RAFT COS 8: Distributed Systems Lecture 8 Recall: Primary-Backup Mechanism: Replicate and separate servers Goal #: Provide a highly reliable service Goal #: Servers should behave just like
More informationCloud Computing. Up until now
Cloud Computing Lecture 4 and 5 Grid: 2012-2013 Introduction. Up until now Definition of Cloud Computing. Grid Computing: Schedulers: Condor SGE 1 Summary Core Grid: Toolkit Condor-G Grid: Conceptual Architecture
More informationBOSCO Architecture. Derek Weitzel University of Nebraska Lincoln
BOSCO Architecture Derek Weitzel University of Nebraska Lincoln Goals We want an easy to use method for users to do computational research It should be easy to install, use, and maintain It should be simple
More informationIPv6 PIM. Based on the forwarding mechanism, IPv6 PIM falls into two modes:
Overview Protocol Independent Multicast for IPv6 () provides IPv6 multicast forwarding by leveraging static routes or IPv6 unicast routing tables generated by any IPv6 unicast routing protocol, such as
More informationProcess Description and Control. Chapter 3
Process Description and Control 1 Chapter 3 2 Processes Working definition: An instance of a program Processes are among the most important abstractions in an OS all the running software on a computer,
More informationGFS-python: A Simplified GFS Implementation in Python
GFS-python: A Simplified GFS Implementation in Python Andy Strohman ABSTRACT GFS-python is distributed network filesystem written entirely in python. There are no dependencies other than Python s standard
More informationOptimal Algorithm. Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs
Optimal Algorithm Replace page that will not be used for longest period of time Used for measuring how well your algorithm performs page 1 Least Recently Used (LRU) Algorithm Reference string: 1, 2, 3,
More informationBatches and Commands. Overview CHAPTER
CHAPTER 4 This chapter provides an overview of batches and the commands contained in the batch. This chapter has the following sections: Overview, page 4-1 Batch Rules, page 4-2 Identifying a Batch, page
More informationFollowing are a few basic questions that cover the essentials of OS:
Operating Systems Following are a few basic questions that cover the essentials of OS: 1. Explain the concept of Reentrancy. It is a useful, memory-saving technique for multiprogrammed timesharing systems.
More informationUniversity of Waterloo. CS251 Final Examination. Spring 2008
University of Waterloo CS251 Final Examination Spring 2008 Student Name: Student ID Number: Unix Userid: Course Abbreviation: CS452 Course Title: Real-time Programming Time and Date of Examination: 13.00
More informationEvaluation of Long-Held HTTP Polling for PHP/MySQL Architecture
Evaluation of Long-Held HTTP Polling for PHP/MySQL Architecture David Cutting University of East Anglia Purplepixie Systems David.Cutting@uea.ac.uk dcutting@purplepixie.org Abstract. When a web client
More informationCreate High Performance, Massively Scalable Messaging Solutions with Apache ActiveBlaze
Create High Performance, Massively Scalable Messaging Solutions with Apache ActiveBlaze Rob Davies Director of Open Source Product Development, Progress: FuseSource - http://fusesource.com/ Rob Davies
More informationStarting the Avalanche:
Starting the Avalanche: Application DoS In Microservice Architectures Scott Behrens Jeremy Heffner Introductions Scott Behrens Netflix senior application security engineer Breaking and building for 8+
More informationScaling DreamFactory
Scaling DreamFactory This white paper is designed to provide information to enterprise customers about how to scale a DreamFactory Instance. The sections below talk about horizontal, vertical, and cloud
More informationExtend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:
Putting it all together for SMR: Two-Phase Commit, Leader Election RAFT COS 8: Distributed Systems Lecture Recall: Primary-Backup Mechanism: Replicate and separate servers Goal #: Provide a highly reliable
More informationBroker Clusters. Cluster Models
4 CHAPTER 4 Broker Clusters Cluster Models Message Queue supports the use of broker clusters: groups of brokers working together to provide message delivery services to clients. Clusters enable a Message
More informationSnapManager 7.2 for Microsoft Exchange Server Administration Guide
SnapManager 7.2 for Microsoft Exchange Server Administration Guide June 2017 215-10626_B0 doccomments@netapp.com Table of Contents 3 Contents Product overview... 8 Backing up and verifying your databases...
More informationECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017
ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 The Operating System (OS) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke)
More informationP2 Recitation. Raft: A Consensus Algorithm for Replicated Logs
P2 Recitation Raft: A Consensus Algorithm for Replicated Logs Presented by Zeleena Kearney and Tushar Agarwal Diego Ongaro and John Ousterhout Stanford University Presentation adapted from the original
More informationWHY BUILDING SECURITY SYSTEMS NEED CONTINUOUS AVAILABILITY
WHY BUILDING SECURITY SYSTEMS NEED CONTINUOUS AVAILABILITY White Paper 2 Why Building Security Systems Need Continuous Availability Always On Is the Only Option. If All Systems Go Down, How Can You React
More informationRunning the Setup Web UI
CHAPTER 2 The Cisco Cisco Network Registrar setup interview in the web user interface (UI) takes you through a series of consecutive pages to set up a basic configuration. For an introduction, configuration
More informationCS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15. Advanced Topics: Distributed File Systems
CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #4 Tuesday, December 15 th 11:00 12:15 Advanced Topics: Distributed File Systems SOLUTIONS This exam is closed book, closed notes. All
More informationServers & Developers. Julian Nadeau Production Engineer
Servers & Developers Julian Nadeau Production Engineer Provisioning & Orchestration of Servers Setting a server up Packer - one server at a time Chef - all servers at once Containerization What are Containers?
More informationSupplement #56 RSE EXTENSIONS FOR WDSC 5.X
84 Elm Street Peterborough, NH 03458 USA 1-800-545-9485 (010)1-603-924-8818 FAX (010)1-603-924-8508 Website: http://www.softlanding.com Email: techsupport@softlanding.com RSE EXTENSIONS FOR WDSC 5.X Supplement
More informationRelease Notes for Patches for the MapR Release
Release Notes for Patches for the MapR 5.0.0 Release Release Notes for the December 2016 Patch Released 12/09/2016 These release notes describe the fixes that are included in this patch. Packages Server
More informationUMP Alert Engine. Status. Requirements
UMP Alert Engine Status Requirements Goal Terms Proposed Design High Level Diagram Alert Engine Topology Stream Receiver Stream Router Policy Evaluator Alert Publisher Alert Topology Detail Diagram Alert
More informationHealthcare IT A Monitoring Primer
Healthcare IT A Monitoring Primer Published: February 2019 PAGE 1 OF 13 Contents Introduction... 3 The Healthcare IT Environment.... 4 Traditional IT... 4 Healthcare Systems.... 4 Healthcare Data Format
More informationParallel Python using the Multiprocess(ing) Package
Parallel Python using the Multiprocess(ing) Package K. 1 1 Department of Mathematics 2018 Caveats My understanding of Parallel Python is not mature, so anything said here is somewhat questionable. There
More informationGrid Compute Resources and Grid Job Management
Grid Compute Resources and Job Management March 24-25, 2007 Grid Job Management 1 Job and compute resource management! This module is about running jobs on remote compute resources March 24-25, 2007 Grid
More informationConfiguring OpenFlow 1
Contents Configuring OpenFlow 1 Overview 1 OpenFlow switch 1 OpenFlow port 1 OpenFlow instance 2 OpenFlow flow table 3 Group table 5 Meter table 5 OpenFlow channel 6 Protocols and standards 7 Configuration
More informationFrom eventual to strong consistency. Primary-Backup Replication. Primary-Backup Replication. Replication State Machines via Primary-Backup
From eventual to strong consistency Replication s via - Eventual consistency Multi-master: Any node can accept operation Asynchronously, nodes synchronize state COS 418: Distributed Systems Lecture 10
More informationConfiguring Box-to-Box Redundancy
CHAPTER 3 This chapter describes how to configure redundancy between two identically configured Cisco Content Services Switches (CSSs). Information in this chapter applies to all CSS models, except where
More informationCS455: Introduction to Distributed Systems [Spring 2019] Dept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [THREADS] The House of Heap and Stacks Stacks clean up after themselves But over deep recursions they fret The cheerful heap has nary a care Harboring memory
More informationManaging Switch Stacks
Finding Feature Information, page 1 Prerequisites for Switch Stacks, page 1 Restrictions for Switch Stacks, page 2 Information About Switch Stacks, page 2 How to Configure a Switch Stack, page 14 Troubleshooting
More informationCS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [THREADS] Frequently asked questions from the previous class survey
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [THREADS] Shrideep Pallickara Computer Science Colorado State University L6.1 Frequently asked questions from the previous class survey L6.2 SLIDES CREATED BY:
More informationGR Reference Models. GR Reference Models. Without Session Replication
, page 1 Advantages and Disadvantages of GR Models, page 6 SPR/Balance Considerations, page 7 Data Synchronization, page 8 CPS GR Dimensions, page 9 Network Diagrams, page 12 The CPS solution stores session
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs
More informationExample File Systems Using Replication CS 188 Distributed Systems February 10, 2015
Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015 Page 1 Example Replicated File Systems NFS Coda Ficus Page 2 NFS Originally NFS did not have any replication capability
More informationATS Summit: CARP Plugin. Eric Schwartz
ATS Summit: CARP Plugin Eric Schwartz Outline CARP Overview CARP Plugin Implementation Yahoo! Insights CARP vs. Hierarchical Caching Other CARP Plugin Features Blacklist/Whitelist Pre- vs. Post-remap Modes
More informationRunning the Setup Web UI
The Cisco Prime IP Express setup interview in the web UI takes you through a series of consecutive pages to set up a basic configuration. For an introduction and details on the basic navigation for the
More informationLixia Zhang M. I. T. Laboratory for Computer Science December 1985
Network Working Group Request for Comments: 969 David D. Clark Mark L. Lambert Lixia Zhang M. I. T. Laboratory for Computer Science December 1985 1. STATUS OF THIS MEMO This RFC suggests a proposed protocol
More informationECS High Availability Design
ECS High Availability Design March 2018 A Dell EMC white paper Revisions Date Mar 2018 Aug 2017 July 2017 Description Version 1.2 - Updated to include ECS version 3.2 content Version 1.1 - Updated to include
More informationCLUSTERING. What is Clustering?
What is Clustering? CLUSTERING A cluster is a group of independent computer systems, referred to as nodes, working together as a unified computing resource. A cluster provides a single name for clients
More informationEthereal Exercise 2 (Part B): Link Control Protocol
Course: Semester: ELE437 Introduction Ethereal Exercise 2 (Part B): Link Control Protocol In this half of Exercise 2, you will look through a more complete capture of a dial-up connection being established.
More informationCS 167 Final Exam Solutions
CS 167 Final Exam Solutions Spring 2018 Do all questions. 1. [20%] This question concerns a system employing a single (single-core) processor running a Unix-like operating system, in which interrupts are
More informationFuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc
Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,
More informationThreads. CS3026 Operating Systems Lecture 06
Threads CS3026 Operating Systems Lecture 06 Multithreading Multithreading is the ability of an operating system to support multiple threads of execution within a single process Processes have at least
More informationGrid Compute Resources and Job Management
Grid Compute Resources and Job Management How do we access the grid? Command line with tools that you'll use Specialised applications Ex: Write a program to process images that sends data to run on the
More informationScience User Interface and Tools: Status. David R. Ciardi & Xiuqin Wu On Behalf of the SUIT Team at IPAC
Science User Interface and Tools: Status David R. Ciardi & Xiuqin Wu On Behalf of the SUIT Team at IPAC 1 Building the SUIT Vision Given the speed with which web technologies evolve, SUIT was intentionally
More informationConcept of Operations for the LSST Data Facility Services
Large Synoptic Survey Telescope (LSST) Concept of Operations for the LSST Data Facility Services D. Petravick and M. Butler and M. Gelman LDM-230 Latest Revision: 2018-07-17 This LSST document has been
More informationMaintaining a Clean CRM Database
Maintaining a Clean CRM Database Contents How messy is your CRM Database? 3 Managing duplicate records Arrest duplicates before they can be created 4 Merge duplicate records 4 Validating CRM data 6 Automatic
More informationCarbonite Availability. Technical overview
Carbonite Availability Technical overview Table of contents Executive summary The availability imperative...3 True real-time replication More efficient and better protection... 4 Robust protection Reliably
More informationTwo phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus
Recall: Linearizability (Strong Consistency) Consensus COS 518: Advanced Computer Systems Lecture 4 Provide behavior of a single copy of object: Read should urn the most recent write Subsequent reads should
More informationTo do. Consensus and related problems. q Failure. q Raft
Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the
More informationRequest for Comments: 851 Obsoletes RFC: 802. The ARPANET 1822L Host Access Protocol RFC 851. Andrew G. Malis ARPANET Mail:
Request for Comments: 851 Obsoletes RFC: 802 The ARPANET 1822L Host Access Protocol Andrew G. Malis ARPANET Mail: malis@bbn-unix Bolt Beranek and Newman Inc. 50 Moulton St. Cambridge, MA 02238 April 1983
More informationSurveillance Dell EMC Storage with LENSEC Perspective VMS
Surveillance Dell EMC Storage with LENSEC Perspective VMS Configuration Guide H14767 REV 1.1 Copyright 2016-2017 Dell Inc. or its subsidiaries. All rights reserved. Published March 2016 Dell believes the
More informationMonitoring Operator Guide. Access Control Manager Software Version
Monitoring Operator Guide Access Control Manager Software Version 5.12.0 2016-2018, Avigilon Corporation. All rights reserved. AVIGILON, the AVIGILON logo, ACCESS CONTROL MANAGER, ACM, and ACM VERIFY are
More information