Meta-Monitoring for performance optimization of a computing cluster for data-intensive analysis

IEKP-KA/2016-XXX Meta-Monitoring for performance optimization of a computing cluster for data-intensive analysis Meta-Monitoring zur Leistungsoptimierung eines Rechner-Clusters für datenintensive Analysen BACHELORARBEIT von Sebastian Brommer An der Fakultät für Physik Institut für Experimentelle Kernphysik (IEKP) Referent: Prof. Dr. Günther Quast Korreferent: Dr. Manuel Giffels

"One of the great mistakes is to judge policies and programs by their intentions rather than their results." GARRY TRUDEAU

Contents 1 Introduction 3 2 Essentials 5 2.1 LHC.................................... 5 2.2 Compact Muon Solenoid......................... 6 2.3 Computing................................ 7 2.3.1 High Energy Physics Computing Jobs............. 7 2.3.2 CMS Computing Model..................... 9 2.3.3 Local IEKP Computing..................... 10 3 The HappyFace framework 13 3.1 Monitoring................................ 13 3.2 General Concept............................. 13 3.3 Implementation.............................. 14 3.3.1 HappyFace Core......................... 14 3.3.2 Database............................. 15 3.3.3 Modules.............................. 15 4 Local HappyFace Instance 17 4.1 Collector.................................. 17 4.1.1 Data Sources........................... 17 4.1.2 Additional Components..................... 20 4.2 HappyFace Instance.......................... 21 4.3 Batch System Modules.......................... 22 4.3.1 Job Status............................ 23 4.3.2 Site Status............................ 25 4.3.3 History.............................. 26 4.4 Cache Modules.............................. 29 4.4.1 Cache Details and Cache Summary............... 29 4.4.2 Cache HitMiss.......................... 30 4.4.3 Cache Life Time......................... 31 1

Contents 4.4.4 Cache Distribution........................ 31 5 Conclusion and Outlook 33 A Appendix 35 A.1 Configuration Keys............................ 35 A.2 Additional Plots and Tables....................... 37 List of Figures 41 Listings 43 Bibliography 45 2

1 Introduction The generation and processing of enormous amounts of data is one of the biggest challenges of modern society. IBM estimates that 2500 PB of data are generated every day [1]. Naturally, data intensive analysis is and will be a crucial part of industry, economy and scientific research projects. Large computing clusters are required to process immense amounts of data in a short period of time, as it is for example done to provide accurate real time search results. Nowadays computing infrastructure is oftentimes very complex. New and innovative technologies are tested and implemented on a daily basis and many solutions are designed to improve specialized use cases. This rapid development improves the efficiency of data analysis on both software and hardware side. On the other hand, every new product has to be tested, documented and maintained. It is not uncommon that newly developed frameworks only offer inadequate means for monitoring, testing and optimizing their performance on a day-to-day business. The development focuses on solving the actual problem rather than easy monitoring solutions. Many projects offer interfaces to obtain the necessary monitoring information but lack easy access and visualization. High Energy Physics is a perfect example of highly data driven science projects [2]. Institues like the the Institute of Experimental Nuclear Physics (IEKP) at the Karlsruhe Institute of Technology (KIT) operate its own computing cluster to perform scientific analysis. The Meta-Monitoring framework HappyFace [3] was developed at IEKP to offer a new monitoring solution for modern computing systems. HappyFace provides an individually customized monitoring solution concerning virtually any aspect of a computing cluster. It unifies visualization and data handling, while the actual data acquisition can be freely customized to suit developers, operators and supervisors needs. 3

1 Introduction In the course of this thesis, a new meta-monitoring instance was set up, in order to monitor the computing cluster at the IEKP. This cluster consits of several complex services and grew over several years. It includes a batch system [4] with access to severl computing resources like external cloud resources or a large desktop cluster. Furthermore modern hardware for high throughput data analysis [5, 6] was added recently. Chapter 2 provides an introduction to the application of computing clusters in High Energy Physics and associated analysis tasks. Moreover, it introduces the computing structure at IEKP and their components as one of many possible applications of HappyFace. Chapter 3 provides a rough overview of the HappyFace framework and its features. In Chapter 4, the main features and design ideas behind the new HappyFace instance are explained in detail. Finally Chapter 5 gives a quick outlook on further developments before concluding the thesis with a short summary. 4

2 Essentials 2.1 LHC The Large Hardon Collider [7] is a ring-accelerator and collider at CERN (European Organization for Nuclear Research) [8] near Geneva. To date, it is the biggest accelerator built by mankind and provides the highest particle energy ever artificially created on earth. With its recent upgrade the LHC now collides two proton beams at a center-of-mass-energy of 13 TeV instead of 8 TeV before. The accelerator contains two separate beam tubes to accelerate two protons in opposite directions. The tunnel where the LHC is located as a total length of 26.7 km and was originally built for the LEP experiment between 1984 and 1989. Since 2015, the LHC has been cable of running at a luminosity of 10 10 34 cm 2 s 1 for proton proton and 10 10 27 cm 2 s 1 for heavy ions. With the Run-2 upgrade, the beam crossing interval was upgraded to 25 ns, resulting in a crossing frequency of 40 MHz. Four major experiments are part of the LHC complex: ALICE [9], ATLAS [10], CMS [11] and LHCb [12]. ALICE (A large Ion Collider Experiment) was designed to analyze the QCD (Quantum Chromodynamics) sector of the standard model by studying the collision of heavy nuclei. The LHCb is an experiment which was designed to study heavy flavor physics. Its goal is to find indirect evidence of new physics for example by measuring CP-Violation and rare decays of B-Mesons. ATLAS (A Toroidal LHC ApparatuS) and CMS (Compact Muon Solenoid) are the two biggest detectors. They are designed as multi-purpose detectors to cover a wide area of physical topics. Recent successes of ATLAS and CMS were the discovery of the Higgs-Boson [13] and the increase in precision of several other standard model parameters. 5

2 Essentials Figure 2.1: Overview of the CMS Detector [14] 2.2 Compact Muon Solenoid Like ATLAS, the CMS detector is built in a barrel shape around the beam pipe as shown in Figure 2.1. In total, the detector is 21.6 m long and has a diameter of 14.6 m. Like all modern detectors, CMS is made up of several sub detectors, each serving a dedicated role for particle identification and measurement. The innermost part is a silicon tracker that is made up of silicon pixels and silicon micro stripes. This inner tracker allows a spatial resolution from 15 µm to 20 µm and consists of a barrel region and two end caps. This allows a very precise reconstruction of innermost particle tracks originating from the collision point. The calorimeters are placed around this silicon tracker. CMS has an electromagnetic as well as a hadronic calorimeter. The electromagnetic calorimeter consists of more than 60000 lead tungstate crystals. This material has a very high density and a short radiation length which allows for the calorimeter to be very compact yet ensuring the full energy deposition. The hadronic calorimeter is positioned around the electromagnetic calorimeter so no particle energy gets lost. A mix of brass and plastic crystals is used as material to determine the energy of hadron particles. 6

2.3 Computing A superconducting magnet is located around the calorimeters and the inner silicon tracker. The magnet is designed to ensure a homogeneous magnetic field of 3.8 T in the inner part of the detector. The magnet is operated at a temperature of 1.8 K and is the largest superconducting magnet ever built [15]. Its magnetic field is essential for the measurement of charge and transversal momentum p T of particles in the silicon tracker and the calorimeters. A Muon System is built around the superconducting magnet and between the return yoke of the magnet. As the name indicates, CMS was built to measure and detect muons with high precision. The Muon system is able to identify and track muons. Any charged particle that is able to pass the two calorimeters and is detected by the muon system is most likely a muon. Due to the fact that the muon system is the most outer part of the detector, it is also the biggest part of the detector. It uses chambers filled with gas to track particles which lowers the costs of the detector. Since 40 million bunch collisions take place inside the detector every second, it is essential to use very efficient triggers to filter out any uninteresting event. CMS uses a two step system: a Level-1 Trigger(L1) and a High-Level Trigger(HLT). The Level-1 Trigger is made of programmable electronics and is fully automated. The L1 combines information from the trackers and calorimeters to make a very fast decision whether an event is potentially of interest. The HLT is a software with the purpose of filtering events using basic event reconstruction to look for characteristic, physical signatures. It executes a quick analysis using algorithms designed to perform very fast, rather than exact. In the end, a few hundred events per second are selected, resulting in a data flow of roughly 600 MB s 1 [16] for Run 2. 2.3 Computing In order to make use of the data produced by the detector, a great variety of software and computing resources are needed to enable a successful analysis of the measured data. 2.3.1 High Energy Physics Computing Jobs There are three main computing tasks in High Energy Physics. Theoretical predictions of events have to be calculated, the detector data has to be reconstructed and a physical analysis has to be made. Monte Carlo Simulations Monte Carlo Simulations are essential for comparing measured data with theoretical predictions. Monte Carlo datasets are composed of a simulation of particle collisions as well as a full simulation of the behavior of the resulting particles in the detector. In 7

2 Essentials Figure 2.2: Reconstruction of an event in CMS showing the largest jet pair event observed so far. The di-jet system has a total mass of 6.14 TeV [18] 2012, the CMS collaboration simulated more than 6.5 billion events citecmssimulation. These simulations get more demanding with a higher center of mass energy. More processes are possible and more processes occur in addition to the hard interaction process. Although recent software improvements have increased the speed of Monte Carlo Simulations (up to 50 %) [17], the need for simulated events is still difficult to satisfy. Improvement of the simulation process will continue to be one of the major challenges for the computing infrastructure of High Energy Physics. Monte Carlo Simulations are CPU intensive but their Input/Output (I/O) load is very low since they produce data themselves and only need simulation parameters as input. Overall, they are the most time consuming computing task in High Energy Physics. Data Reconstruction In this process, signals recorded by the detector get reconstructed into physical objects. Signals from the tracker allow a reconstruction of particle tracks and its energy gets calculated from the calorimeter data. By putting all this data together, the initial interaction process that took place can be reconstructed and displayed as shown in Figure 2.2. The recent upgrade of the LHC leads to an increase of 8

2.3 Computing complexity concerning this task as more events have to be reconstructed and the amount of pile-up processes (soft scattering processes that take place during the same bunch crossing) grows due to the increase of the center-of-mass-energy. The reconstruction is an I/O intensive task since raw detector data is needed to create a much more compact physical dataset for further analysis. The reconstructed datasets are often reduced even more to exclude variables and parameters that are not relevant for a certain physical analysis and only take up valuable disk space. This process is called skimming and is very useful for saving resources and time, since the datasets being used get smaller. Physical analysis The physical analysis itself is very diverse. The requirements differ depending on the type and size of the datasets that are used in the analysis. In general, physical analysis tasks are I/O intense as they use different datasets from reconstructed and simulated events. 2.3.2 CMS Computing Model In order to provide solutions to these tasks, the LHC uses a specialized computing model, which is realized in the Worldwide LHC Computing Grid (WLCG). The WLCG is the worlds largest computing grid, connecting more than 170 computing facilities in more than 40 countries. The mission of the WLCG project is to provide global computing resources to store, distribute and analyse the 30 Petabytes of data annually generated by the LHC at CERN [...] [2]. Instead of using a few large resources, such as supercomputers, the WLCG relies on sharing the overall load between many computing centers. This approach is much more dynamic. In HEP, a high output over a long period of time is favored rather than a high output per minute or second [19]. The WLCG approach allows high overall output whilst granting contribution to everybody who wants to be part of the community regardless the size of their resources provided. In order to ensure a well organized structure, the WLCG uses a hierarchical model of resources: 1. The detector data enters the grid at Tier0 computing resource at CERN. Here the data is aggregated, stored on tape and distributed further. 2. Tier1 centers are also responsible for safekeeping and data reconstruction. 3. Tier2 center s tasks are simulations and processing of the data. 9

2 Essentials 4. Tier3 centers are utilized for physical analysis by the users of the WLCG. They are only loosely connected to the grid. This model was designed to provide the CMS detector measurements to as many people as possible. It enables smaller universities and institutes to do successful research without the need of excessive computing resources. A more detailed description of the CMS computing model can be found at [20, 21]. 2.3.3 Local IEKP Computing At the IEKP, several local computing resources are available and listed in the following. The available resources are compareable to a Tier3 center of the WLCG grid. New systems are implemented and tested here, for example a cache based analysis middleware solution (HPDA) [5]. The local resources are joined together via the batch system HTCondor. Processing Resources A number of different local resources are usable at IEKP. The most obvious ones are the Desktop PCs. About 60 desktop machines are available with different specifications depending on the age of the machine. Specifications range from old dual core PCs with 2 GB of RAM to new ones with modern i7 processors, SSDs and enough RAM to support a regular works well as job executions from the batch system simultaneously. Another resource is the EKPSG-/EKPSM cluster. It adds a distributed caching mechanisms to the existing work flow of a Batch System (more in Chapter Cache System). The cluster consists of 5 machines, each providing 32 Cores and 64 G B of RAM as well as SSDs for caching and hard drives for storage. IEKP owns several file servers to store skims, datasets and analysis results. Furthermore, IEKP hosts several service machines. These servers have multiple purposes such as hosting general infrastructure (firewall, web space...) or specialized services like HTCondor. Their specifications differ from server to server. All these services are connected to the backbone of the SCC (Steinbruch Center of Computing) which provides the IT management at KIT. Batch System HTCondor is a high-throughput distributed batch computing system [4]. This project exists since 1984 an is developed and maintained at the University of Wisconsin- Madison. It is available for every common operating system. Its main goal is to allow an easy management of computing resources whilst staying flexible and stable even for big computing systems. 10

2.3 Computing User ekpcms6 resources EKP Desktop Cluster submit job / recieve output message execute job / send output message HTCondor Agent execute job send output message ekpsm/ekpsg HPDA ekp-cloud bw4cluster Freiburg gridka openstack HTCondor Central Manager icons made by Freepik, licensed by Creative Commons CY 3.0 Figure 2.3: A representation of the HTCondor setup at IEKP. Users submit jobs to the HTCondor manager. The HTCondor manager handles execution of jobs and returns the results, including the location of the result data to the user. The basic features of Condor include a job management mechanism, scheduling, and priority policy as well as resource management and monitoring. In the default work flow, a user submits a job to HTCondor and HTCondor chooses where and when to run the job, supervises the process and informs the user after the job is completed. The strength of HTCondor compared to other batch systems is its ability to utilize the available cluster processing in an efficient and effective way, such as detecting idle workstations and using otherwise wasted computing resources. HT- Condor is able to adapt without requiring the resources to be permanently available. HTCondor can even move jobs away from certain machines without the need of restarting the whole job. Resource Requests (jobs) and Resource Offers (machines) get matched based on a policy that is defined by the user using the ClassAds language [22]. Figure 2.3 shows that HTCondor is a central component of the day-to-day work setup at IEKP. Cache System The High-Performance Data Analysis concept (HPDA) is being developed at KIT and addresses the issue of high I/O analyses being inefficient due to insufficient bandwidth. The idea behind HPDA is to combine the strength of a dedicated storage where all files are saved on one file server and every node accesses the files from there, and integrated storage, where every node has its own file storage. A system of shared caches is used to fulfill this task. In order to work efficiently, HPDA includes an API (application programming interface) to exchange information with the batch system (HTCondor). In addition, a caching algorithm is implemented to determine whether a file is worth caching, and 11

2 Essentials if the cache is full, which files are going to be deleted in order to keep the cache as full as possible. A coordinator component is responsible for delegating all the worker nodes and their caches [6]. More information can be found in the original design paper [5]. 12

3 The HappyFace framework 3.1 Monitoring Monitoring is a crucial part of modern computing infrastructure. Monitoring means "Supervising activities in progress to ensure they are on-course and onschedule in meeting the objectives and performance targets [23]." The number of Cloud-based services has increased rapidly in the last years and so has the complexity of the infrastructures, behind these services. To properly operate and manage such complex infrastructures, effective and efficient monitoring is constantly needed [24] especially for every modern computing system used in High Energy Physics. Monitoring is the key tool for providing information on hardware and software performance as well as user based statistics. It allows a performance and workload analysis, on top of an activity and availability analysis. This is crucial when testing new systems such as the HPDA framework. Monitoring data can be further used to conclude future investment steps, identify problems and optimize the performance of monitored computing systems. 3.2 General Concept The idea behind HappyFace is to create a meta monitoring framework instead of a traditional monitoring framework. Rather than collecting raw data where it originates (e.g. every node in a computing network), HappyFace is designed to aggregate data from existing monitoring services and provide this information in a more compact way. In addition, an interpretation of the data is done in HappyFace. HappyFace collects data from different data sources and displays them on one website no matter how different the data sources are. With aggregated data stored in a database, HappyFace s history function allows to show the previous state of monitored resources. New data gets acquired at a regular time interval to ensure 13

3 The HappyFace framework HappyFace external monitoring sources config - category 1 * mod 1 * mod 2 - category 2... DataBase Modules module.py HTML output incl SQL queries module.html final HTML output Figure 3.1: Diagram displaying the work flow of the HappyFace framework from data sources to final output on the web page. Adapted from: [3] up-to-date data. The modular structure of HappyFace is a flexible approach and allows for a highly customizable monitoring service, including results of custom tests and plots. As this is only a small overview, more information can be found in the original design paper [3]. 3.3 Implementation The main design idea is to create a monitoring framework that is as simple as possible. The original version was implemented in Perl in 2008 [25]. The current version 3.0 is written in Python and can be found at [26]. HappyFace has several components: the core, a database and the actual modules. Figure 3.1 illustrates the work flow of HappyFace. 3.3.1 HappyFace Core The HappyFace core is the heart of any HappyFace instance. It is responsible for providing the basic features of HappyFace such as a module template, web page template, access to the database, a download service to load data into HappyFace and logging. HappyFace contains two main Python scripts. acquire.py executes all module Python scripts to fill the database with the newest data. render.py is executed every time a user accesses the web page and generates an output web page using the module HTML templates and the data stored in the database. 14

3.3 Implementation 3.3.2 Database The database is the central place where all information is stored. The default installation of HappyFace uses a sq lite typed database but a setup with a PostgreSQL database is also possible. Every module obtains its own data table in the database. Every time a new aquire.py cycle gets triggered, a new row of data is added to the table. Modules may also feature an arbitrary number of subtables. These subtables can be used to add additional information to a module, e.g. a detailed overview of all jobs in a batch system. Subtables are linked to the main table of the module in order to maintain a logical structure. All parameters stored in the database can be accessed via the plotgenerator. The plotgenerator is a powerful tool which allows users to receive detailed information on every parameter available in HappyFace. As demonstrated in Figure 3.2, every value stored in the database can be plotted in a custom chosen time frame. The values do not have to belong to the same module; every value available in the database can be plotted. 3.3.3 Modules Modules define the content of a HappyFace instance. Every module can be designed individually. Each module has its own configuration file in which an arbitrary number of configuration parameters, such as location of the data source, are definable. The module code itself is responsible for processing information and storing the desired values in the database. Plots are being generated during this step using the matplotlib library and saved as Portable Network Graphics files in a static archive. To reduce the size of the database only the location of plots is stored in the database, rather than the plot itself. Every module consists of two files. A Python script which collects, processes and stores the data in a database table and a HTML template, where the style of the module output such as tables, plots and interactive elements are defined. This is done via the MAKO template library. MAKO combines HTML template and data from the database to provide the full HTML code that is used for the website. Modules may be organized into discretionary categories. Per category, an arbitrary number of modules can be featured. Each module has a status parameter which is used to determine the quality of a module. The status ranges from OK to Critical (green, yellow and red color). The status of a whole category is determined by the statuses of the modules it contains and is displayed in the navigation bar. The status calculation is fully up to the module developer. The design may differ from module to module. The fact that every module has an individual python script allows nearly every monitoring idea to be fulfilled. 15

3 The HappyFace framework Figure 3.2: Plot generated by the plotgenerator tool of HappyFace. It shows the number of claimed slots and running jobs. The curves are almost identical which indicates that the system is running just fine. As the aquire.py cycle is executed once very 15 minutes, a data point is taken every 15 minutes. 16

4 Local HappyFace Instance In order to monitor the local resources at IEKP, the need for a monitoring system emerged. The HappyFace framework proved to be the best solution for this task, as many different parameters from various data sources can be monitored without problems. In order to fulfill this task, the ekplocal project consists of two seperate parts running on two different machines, the collector and the actual HappyFace instance. The project is available at [27]. 4.1 Collector The first component is the collector. It is a small software tool currently running on cms6. Its purpose is to collect data from various sources, put them in a process-friendly format and transfer the data to the ekphappyface machine, where the HappyFace instance is located. The collector is independent from HappyFace, but the output data design is in line with the HappyFace design to allow an easy interaction. The collector reformats information from data sources and combines them to create several new metrics that were not available before. The collector is written in Python. 4.1.1 Data Sources Four different data sources are used at the moment: the batch system (HTCondor), the cache worker nodes (ekpsg / ekpsm), the cache coordinator and Ganglia. Each data source behaves differently and requires a different extracting and processing method. 17

4 Local HappyFace Instance work flow data flow HTCondor collection of data by collector raw data EKPSG/SM Cache Coordinator Ganglia upload to ekphappyface ekpcms6.json /.png aquire.py py HappyFace HappyFace database.json /.png ekphappyface html / css HappyFace core html / css render.py by HappyFace apache webserver html / css user output on website data source collector HappyFace icons made by Designmodo, Freepik, licensed by Creative Commons CY 3.0 Figure 4.1: Scheme displaying the work flow as well as the data flow of the ekplocal project. Input data is collected by the collector running on cms6. After that, the data is transfered to happyface, where the HappyFace core processes the data, stores it in a database, generates the plots and displays the output web page. Batch System HTCondor provides an interface which returns information on the command line. Users are able to specify the parameters to be displayed as well as to constrain certain parameters. A complete documentation of available HTCondor commands can be found at [28] The collector uses three different HTCondor API calls: 1. condor_q: Provides information on currently running jobs and active users (used in module Job Status) 2. condor_status: Provides information on currently active machines and site usage (used in module Site Status) 3. condor_history: Provides information on jobs that have left the batch system (used in module History). 18

4.1 Collector 1 echo { $( condor_status - format "%s" Name -format :{" State ":"%s" State - format, " Activity ":"%s" Activity -format, " LoadAvg ":" %s"}, LoadAvg ) "}" > condor_status. json Listing 4.1: Example of the condor_status command used for generating data for the Site Status Module The command 4.1 generates a file called condor_status.json. This file is an almost valid JSON file (the syntax gets corrected in the module code). The actual API call is done by the condor_status command. The echo command is used to minimize the corrections that have to be done later on in the HappyFace instance. Cache Worker Nodes The cache worker nodes provide an interface responding to HTML calls. The API returns a summary status of all worker nodes. In addition it is also able to return a list of every cached file, its size, age and cache parameter to determine the importance of a file (used in Module Cache Details and Cache Summary). The collector accumulates this information from every worker node and merges into two different JSON files. One contains the summary data and one the detailed information on every file present across all caches as seen in Listing 4.2. 1 {" ekpsg02 ": { 2 "/ storage /a/ cmetzlaff / htda / benchmark6_1 / kappa_ DoubleMu_ Run2015D_ Sep2015_ 13TeV_ 1037. root ": { 3 " allocated ": 1451913018. 084321, 4 " maintained ": 1451912653. 3989639, 5 " score ": 19.4008, 6 " size ": 86581620 7 },[...] 8 },[...]} Listing 4.2: The collector output from worker nodes. The JSON file contains the name of the worker node, a full file name, file size, internal caching score and the point in time a file allocated and maintained last (further explained in Chapter 4.4). The time is given in UNIX time and the file size in B. Cache Coordinator The Cache Coordinator interface responds in the fashion of the cache worker nodes API. At the moment, two different pieces of information are extracted from the coordinator, the first being the lifetime of files in the cache. The coordinator retains 19

4 Local HappyFace Instance Figure 4.2: Plot showing the distribution of the collector run time. In total, 1979 collection cycles were completed between 10 Dec 2015 and 04 Jan 2016. The first and last bin are overflow bins. An unexpected error in the network connection of cms6 caused run times bigger than 120 s. Generally, a full cycle lasts less then 60 s. the point in time a file was deleted from the cache and how long this file stayed cached (used in module Cache Life Time). The second statistic is whether a job used cached files and ran on the right worker node (used in module Cache HitMiss). Ganglia Ganglia is a monitoring system for high performance computing systems such as the setup used at IEKP [29]. Ganglia gathers information on usage parameters such as CPU load or network usage for every machine in the network. Machines can also be organized in groups to give an overview, e.g. of all desktop machines. Ganglia plots measured parameters on the fly and offers a web front end for easy access to the monitored data. Plots are extracted with a bash script using wget. 4.1.2 Additional Components The upload component of the collector uses the Python package requests. Files are being uploaded into a dedicated folder on ekphappyface. The security authentication is handled via the Apache server running on ekphappyface. The collector is logging error messages such as unavailable worker nodes or failing uploads of files. A cronjob triggers a collection cycle every fifteen minutes. The average duration of a cycle is shown in Figure 4.2. 20

4.2 HappyFace Instance 4.2 HappyFace Instance The HappyFace instance is located on the ekphappyface machine. At the moment, the instance contains five different categories: Batch System Modules, Cache Modules and three categories showing Ganglia plots of the ekpsg/ekpsm nodes, the file server and the overall cluster. The aquire.py process is triggered four times per hour to roughly match the collector timing. 21

4 Local HappyFace Instance 4.3 Batch System Modules Three Batch System Modules monitor the current status of the batch system. They consits of 1. a module that monitors the current status of jobs, 2. a module that monitors the current site usage, 3. a module that displays the batch system usage during the last day. Jobs are user submitted computing tasks (Chapter 2.3.1). Slots on the other hand, are the most finely grained representation of computing resources where a job is able to run. Slots are provided by the different computing resources connected to the batch system. Jobs HTCondor enforces a certain job work flow. At first a job is submitted to the batch system and waits until a fitting slot is found. Then the job is executed based on the submitted configuration parameters and once finished, its status code and output information are returned to the user. In order to categorize jobs, HTCondor relies on status codes. The status codes as shown in Table 4.1 represent the state of a job at a given point in time. Jobs are very individual. Their run time can vary between 10 minutes up to 24 hours. Nearly all jobs are single core jobs so they use only one CPU, but it is also possible to process multi-core jobs. Slots Sites are built from several multi-core machines. Each CPU core in a site is assumed to be able to run one CPU intensive task. To represent this structure, a site offers slots to the batch system where each slot represents one CPU core and is able to run Table 4.1: Job status codes used in HTCondor. Code 5 and 6 can appear though it is quite unlikely. code definition explanation 1 idle Jobs that are either in queue or paused 2 running Jobs that are currently being executed in a slot 3 removed Jobs that were manually removed from HTCondor 4 completed Jobs that successfully finished their task 5 held Jobs that are on hold 6 transfer Jobs that are returning their output 22

4.3 Batch System Modules one job at a time. Most of the jobs use one core and 2 GB of RAM. For example, a machine with 16 cores is able to host and run a total of 16 jobs at the same time. Therefore the machine has 16 individual slots listed in HTCondor. The current status of a slot is also represented via a status code in HTCondor. There are only two common slot statuses, either a slot is claimed, or it is unclaimed. In theory HTCondor does differentiate between a state and an activity of a slot. Experience shows that all cases other than those two are negligible. In practice a slot is either claimed and runs a job, or it is unclaimed and idle. 4.3.1 Job Status The first module is a Job Status Module. It provides an overview on jobs that are currently processed by HTCondor. The module uses the parameters listed in 4.3. 1 { " ekpcms6. physik.uni - karlsruhe.de #296427.0#1451282723": // job ID 2 {" RAM ":"8", // RAM request by job in Gigabyte 3 " Status ":"1", // current status of job 4 " Cpu_1 ":"0.0", // time user CPU used 5 " LastJobStatus ":"0", // status of job before present status 6 " HostName ":" undefined ", // Name of Machine, job is running on 7 " User ":" sieber@physik.uni - karlsruhe.de", // User ID 8 " RequestedCPUs ":"1", // Number of Cores requested by job 9 " QueueDate ":"1451282723", // point in time the job was submitted 10 " JobStartDate ":" undefined ",// point in time the job started 11 " Cpu_2 ":"0.0" // time system CPU used 12 } 13 } Listing 4.3: Example of the data extracted for a single job The main component of this module is a jobs per user plot, as shown in Figure 4.3. To create the plot, jobs are sorted by user names and their current status. All common status codes given in Table 4.1 are included. "Removed" and "completed" jobs can show up if the collector collects data from condor_q during a correct management cycle of HTCondor. These jobs do not show up in the next iteration as they are no longer considered as currently processed by HTCondor. "Queued" jobs have a present Status of 1 and a LastJobStatus of 0. "Queued" jobs have not started running, thus are waiting to get started. The queue time of a job can be calculated via queue time = current time QueueDate. (4.1) 23

4 Local HappyFace Instance Figure 4.3: Plot is shown in the Job Status Module. This plot was created on 17 Dec 2015, 15:09. In total, there were 98 Jobs "running" and 50 Jobs "queued". It is also possible to determine the efficiency of jobs. The efficiency is defined as efficiency = Cpu_1 + Cpu_2 current time JobStartDate. (4.2) The efficiency ranges from one, a perfect job that uses the CPU all the time to zero, no CPU usage at all. The parameter must be interpreted with care: jobs that have just started may have a low efficiency reported because of timing issues between HTCondor, collector and HappyFace. The CPU time is not updated in real time by HTCondor which may result in a considerably worse efficiency reported although the job is running just fine. This effect disappears for jobs with longer run times. The actual RAM usage of a job is difficult to track. HTCondor provides several parameters concerning the RAM usage, but none of these seem to represent the genuine usage as it is difficult to consistently describe RAM usage. The requested RAM is the most accurate value HappyFace provides concerning this issue although the real RAM usage is most likely lower than this value. The module provides a detailed table that can be found in Appendix Table A.1. It displays the plotted data as well as the efficiency per user and the sites a certain user utilizes at the moment. The HappyFace color code is used to indicate critical values: If the efficiency of "running" user jobs is too low, the table column is marked 24

4.3 Batch System Modules Figure 4.4: Plot shown in the Site Status Module. This plot was created on 08 Jan 2016, 12:09 and displays how many slots are running per site and their status in red. The status of the module is influenced by multiple factors. The parameters used for the calculation of the status can be configured via the configuration parameters of the module. If the module status is not OK, the module returns an error message which explains why. The configuration parameters are listed in Appendix A.1. 4.3.2 Site Status The Site Status Module is designed similarly to the Job Status Module. It displays the details of different computing resources available to the batch system and their current status. The module provides a visual overview as shown in Figure 4.4. It identifies every slot that is online and is able to determine the number of actual active machines. In order to shorten site names, the module requires a list of site names as a configuration key. By using these, the module is able to match slot names like slot1@ekpcloudc9577fff-72a1-4bc8-9ee4-dd476a689bd2.ekpcloud to the site ekpcloud. This configuration parameter ensures that the sites are identified correctly, regardless of the name convention used in HTCondor. Additional configuration keys used for the status calculation can be found in Appendix A.1 A detailed table displays the plotted data, as well as the average load on claimed and unclaimed slots. It is very easy to identify malfunctioning slots as a claimed slot 25

4 Local HappyFace Instance (a) Job History Plot (b) Site History Plot Figure 4.5: Two plots used in the History Module. They were generated on 12 Jan 2016 and represent the usage of the IEKP resources in the last 24 hours. should have a high average load whilst an unclaimed slot should have an average load close to zero. The average load of slots that recently changed their status can be unrepresentative, similar to the efficiency of jobs. The HTCondor documentation does not explain which timing window is used to calculate this value. Colors are used to quickly identify sites with bad load values. Another feature is the HTCondor version list. This table shows which site uses which HTCondor client version. HTCondor client versions may differ from site to site but they are all compatible to each other. 4.3.3 History The History Module is the most complex module concerning the batch system. It combines information from condor_q and condor_history and allows to create a full batch system usage overview of the last day. The output of this module consists of two plots. One shows the job status of all jobs during the last day (Figure 4.5a), the other one shows the site usage during the last day (Figure 4.5b). The collector merges information from condor_history and condor_q into one data file. Essentially, six parameters are needed for the creation of the job history plot (all points in time rounded to hours): 1. QueueDate: Point in time a job entered the batch system 2. JobStartDate: Point in time, a job started "running", thus leaving the queue 3. CompletionDate: Point in time a job was completed. If a job is still "running" this value is zero. 26

4.3 Batch System Modules completion date second to last status last status!= 0 idle running finished finished finished held removed job no status idle idle running removed idle 0 running removed finished common rare highly unlikely removed finished held held removed removed removed Figure 4.6: Diagram showing all possible status configurations of a job and how likely they are. Some of them, like "removed" and after that "removed" do not make a lot of sense and appeared most likely because of bugged jobs or failing machines. The most common jobs are the ones that finish in normal fashion, so their last status is "finished", the second to last status is "running" and the completion date is not zero (2nd row). Another example are "queued" jobs. They have no completion date, the second to last status is not given and the last status in idle (5th row). 4. EnteredCurrentStatus: Point in time, a job entered his current status (if a job is completed, CompletionDate and EnteredCurrentStatus are identical). 5. JobStatus: The current/latest status of a job 6. LastJobStatus: The second last to status of a job. By utilizing these parameters, the module is able to create an accurate timetable of every job. This is done by assuming that jobs always have a queue time and finish successfully during their first run, as the HTCondor interface does not return more than the last two status codes. This is an accurate description for most of the jobs. Figure 4.6 shows all possible combinations of job statuses. At first, the completion date of a job is checked. This allows a separation between finished and currently active jobs. The two available status codes of a job and their corresponding timestamps allow the reconstruction of a jobs timetable. Lists with 25 entries, each representing 27

4 Local HappyFace Instance 20 hours ago 19 hours ago 18 hours ago Job Queued plot_data_queued[4]+= 1 Job Running plot_data_running[5]+= 1 Job Finished plot_data_finished[6]+= 1 Figure 4.7: Sequence diagram of an example job. The job entered the batch system at 20 hours ago, stays in the queue for one hour, ran for one hour and then finished. The list entries plot_data_queued[4], plot_data_running[5], and plot_data_finished [6] are increased by one (Index 0 represents 24 hours ago, Index 1 represents 23 hours ago etc.). one hour of the last day for every possible job status are used to combine the timetable data of each individual job. The lists get filled based on the information which status a job is in at a given point in time, as shown in an example 4.7. In the end, the plot is generated by plotting the status lists. The plot 4.5b utilizes the HostName parameter of every job, matches it with the given site names like in the Site Status module and plots the values. 28

4.4 Cache Modules 4.4 Cache Modules The Cache Modules monitor the status of the HPDA framework. They consist of 1. an overview module, 2. a module that monitors the efficiency of job distribution among the different machines, 3. a module that monitors the efficiency of dataset distribution among the different machines, 4. a module that monitors the lifetime of cached files. In theory, files that are used frequently get cached and are available as a local copy whereas files that are needed infrequently get stored on a remote file server. Caching a file means that it gets stored on a Solid-State-Drive (SSD) connected to a worker node. The framework must now detect, which files are required by a job in order to send the job to the right worker node. Otherwise the job would have no access to the cached file and there would not be any benefit in using a cache system at all. Two components are responsible for keeping the cache up to date. An allocation algorithm determines the theoretical distribution of files on the caches. This process is triggered every 5 to 10 minutes. The maintain algorithm is responsible for enforcing this distribution and checking the actual file status. This is done in a longer time interval. Score The Score is a parameter used to indicate how important a file is. Every time a job uses a certain file, the score of this file is increased. File scores also decrease, if files are not used over a longer period of time. When the cache is full the score determines which files have to be deleted and which files take up the free space. 4.4.1 Cache Details and Cache Summary These two modules use data from the Cache worker nodes. A summary table gives an overview of the overall status of the cache, how many files are cached, how much space is available and used and how many machines are available in the cluster. The details module features a table that shows how many files are stored on every machine. The plots generated for this module show size distribution, maintain and allocation time distributions and the distribution of score parameter of every file. The plots can be found in Appendix Figure A.1 29

4 Local HappyFace Instance Figure 4.8: Plot representing the locality rate and cachehit rate of jobs between 15 Dec 2015 and 22 Dec 2015. Jobs with a cachhit rate and locality rate of one are ideal. 4.4.2 Cache HitMiss The Cache HitMiss Module is a fine indicator of the overall performance of the HPDA framework and the caching algorithm. As explained before, the full benefit of the HPDA setup is achieved if all files needed are cached and a job is running on the right machine. These two conditions are represented by the cachehit rate and the locality rate. The Coordinator calculates these two rates for every job. The cachehit rate indicates how many of the files needed were cached. The value reaches from one, all files cached, to zero, no files cached. The locality rate represents how many files were cached on the machine the job ran on. A locality rate of one means all files were cached on the machine the job ran on; a locality rate of zero means no files were available in the cache connected to the executing machine. If the files are distributed on multiple caches or not all files are cached, values between one and zero are also possible. This data gets represented in a 2D scatter plot, as seen in Figure 4.8. Since the locality rate is dependent on the cachehit rate, only values located in the bottom triangle of the plot are possible. It is not possible to have a locality rate of 1 if only 50% of the files used are actually in the cache. The data for this plot is provided by the Cache Coordinator as well as the point in time a job was executed to allow a constraint on the period of time. 30

4.4 Cache Modules 4.4.3 Cache Life Time This module has a basic usage. It displays the life time of files in the cache. As previously mentioned, the basic functionality of a cache includes the possibility to determine which files can be deleted from cache in order to free up space for new files. The Cache Coordinator provides the information how long a deleted file stayed in the cache and when it was deleted. This information gets displayed in a histogram, as shown in Appendix Figure A.2. 4.4.4 Cache Distribution Cache organization and the distribution of files are crucial parts of a cache based computing system. In HPDA, files are grouped in so called datasets. The different datasets emerge from the file structure the users use: files in the same folder belong to the same dataset. These datasets normally consist of multiple separate files. The file numbers range from one to several thousand files per dataset. In HPDA, an algorithm is implemented to distribute the files in one dataset equally among all machines in the cluster. In an ideal scenario, every machine caches an equal sized part of every dataset. This distribution is done by file size, not by file count. 1 {" ekpsg02 ":{ 2 "/ storage /a/ cmetzlaff / htda / benchmark6_1 ": { 3 " file_count ": 40, 4 " size ": 61440}, [...] 5 " ds_count ": 14, 6 " error_ count ": 0, 7 " status ": " Aquisition successful "}, 8 " ekpsg04 ": { 9 "/ storage /a/ cmetzlaff / htda / benchmark6_1 ": { 10 " file_count ": 40, 11 " size ": 61440}, [...] 12 " ds_count ": 14, 13 " error_ count ": 0, 14 " status ": " Aquisition successful "},[...]} Listing 4.4: Example of the data used to determine the distribution of a dataset among the machines. In this case, the dataset benchmark6_1 is stored on ekpsg02 and ekpsg04 with both equal file size and equal amount of files. This would be an ideal distribution. The data is extracted from every worker node in the HPDA cluster, filtered and combined into one JSON file by the Collector instance. The data provided by the collector is shown in Listing 4.4. In order to test the distribution algorithm, the Cache Distribution Module calcu- 31

4 Local HappyFace Instance Figure 4.9: Plot showing the dataset distribution on 12 Jan 2016. There are a total of 25 datasets cached of which 2 were not completely read. In addition, one machine was not active so the actual distribution may differ. lates a metric to determine, how optimal a dataset distribution is metric = n ( ) optimal_size actual_size(k) 2. (4.3) dataset_total_size k=0 n is the total amount of machines in the cluster. The optimal_size is calculated via optimal_size = dataset_total_size n, (4.4) and the actual_size is the aggregated size parameter of the machine k. The value get normalized in order to ensure metric values between zero and one metric_norm = metric 1 1 n. (4.5) In order to visualize this metric, a 2D scatter plot is used. Figure 4.9 shows the metric plotted against the number of files in a dataset. Some metric values are not possible due to the fact that datasets with fewer files than machines in the cluster can not be distributed equally over all machines since files are not splitted. 32

5 Conclusion and Outlook The ekplocal HappyFace meta-monitoring instance, which was set up during this thesis, monitors the local computing setup at IEKP. This includes the HPDA caching system and the local batch system HTCondor. Multiple modules were specifically designed and developed to monitor this complex infrastructure. The new instance adds monitoring components to key infrastructures for day-to-day work at IEKP. A custom collector was implemented to aggregate data from different data sources and provide presorted sets of data for the HappyFace modules. The different modules allow a quick overview of key system parameters. The ekplocal project still offers room for improvements and extensions. Several features can be added to make this monitoring instance even more powerful. New developments in the monitored resources provide new features and unexpected correlations, so adaptations might be required. Additional monitoring targets such as the cloud manager ROCED [30] can easily be added due to the modular design of HappyFace. ROCED is being developed at IEKP and handles the allocation of cloud resources based on current demand. It is able to request, start and shut down virtual machines from external cloud resources and embed them into the current cluster setup. ROCED features an interface which can be used to access useful data and implement several HappyFace modules. In its current state, the ekplocal instance is running without major problems and provides a status report with new aggregated data every 15 minutes (a snapshot of the web out put is shown in Figure 5.1). It provides satisfying status information about the computing instance and grants a quick overview of the overall health of computing resources at IEKP. The new monitoring instance will help to improve the general performance of the computing resources at IEKP to ensure a smooth analysis process in the future. 33

5 Conclusion and Outlook Figure 5.1: Snapshot of the ekplocal HappyFace instance. It shows the Job Status Module and some detailed information on the current status. The navigation bar with the status of the different categories is shown at the top. This snapshot was taken 16 Feb 2016. 34

Appendix A Appendix A.1 Configuration Keys Default Keys Every module has a set of default configuration keys. These are 1 [module_name] 2 module = PythonScriptName 3 name = Name on frontend 4 description = description on frontend 5 instruction = instruction on frontend 6 type = rated 7 weight = 1.0 8 # Source Url 9 sourceurl = "link to source file" 10 # size of plot in y (if module features a plot) 11 plotsize_y = 5 12 # size of the plot in x 13 plotsize_x = 8.9 Listing A.1: default configuration keys Depending on the module, a number of different config keys are used. Job Status 1 # status parameters 2 # he minimum amount of jobs required to determine a status 3 jobs_min = 50 4 # how long jobs stay queued at max in hours for status 5 qtime_max = 48 6 # ratio between running and idle jobs for status 7 running_idle_ratio = 0.2 8 # how many long jobs for status 35

A Appendix 9 qtime_max_jobs = 20 10 # minimal efficency for status 11 min_efficency = 0.7 12 # differnet sites - input a python list with strings 13 sites = ["gridka", "ekpcms6", "ekpcloud", "ekpsg", "ekpsm","bwforcluster"] 14 # additional plotting parameters 15 # distance between biggest bar and right and of the plot in % 16 plot_right_margin = 0.1 17 # width of the bars in plot 18 plot_width = 0.6 19 # how much bars the plots shows at least before scaling bigger 20 min_plotsize = 3 21 # x-value, when to use a log Scale in Plot 22 log_limit = 500 Listing A.2: additional job status configuration keys jobs_min: The minimum amount of jobs required to determine a status qtime_max and qtime_max_jobs: If more jobs than qtime_max_jobs are longer queued than qtime_max and have not started yet, the module status turns critical. running_idle_ratio: The minimum ratio between running and idle jobs. If the ratio is below the given parameter, the status turn critical. min_efficiency: If the average efficiency is below this value, the module status turns critical. Site Status 1 # status parameters 2 # how many slots per machine should be running 3 machine_slot_min = 2 4 # how many slots must be running to determine status 5 slots_min = 20 6 # limit for claimed_unclaimed_ratio 7 claimed_unclaimed_ratio = 0.3 8 # weak Slots have a load below this value 9 weak_threshold = 0.5 10 # differnet sites - input a python list with strings 11 sites = ["gridka", "ekpcms6", "ekpcloud", "ekpsg", "ekpsm","bwforcluster"] 12 # additional plotting parameters 13 # how much bars the plots shows at least before scaling bigger 14 min_plotsize = 3 15 # x-value when to use a log Scale in Plot 36

A.2 Additional Plots and Tables 16 log_limit = 500 17 # width of the bars in plot 18 plot_width = 0.3 19 # distance between biggest bar and right and of the plot in % 20 plot_right_margin = 0.1 Listing A.3: additional site status configuration keys machine_slots_min: How many slots should be claimed per machine. slots_min : The minimum amount of slots required to determine a status. claimed_unclaimed_ratio: The minimum ratio between claimed and unclaimed slots. If the ratio is below the given parameter, the status turn critical. weak_threshold: Slots with a load below this value are considered weak slots if they are claimed. History 1 # number of hours in plot (maximum is 24, given by the constraint on the condor_history command) 2 plotrange = 24 3 # width of bars in plot 4 plot_width = 1 Listing A.4: additional history module configuration keys Cache Modules The cache modules use histograms to display data. Therefore an additional configuration parameter is provided. 1 # number of bins in histograms 2 nbins = 50 Listing A.5: additional cache module configuration keys A.2 Additional Plots and Tables 37

A Appendix Figure A.1: This plot shows the distributions used in the Cache Details Module. The Plot was created 10 Jan 2016, 15:24. As expected, the allocation time plot shows peaks for every machine. The other parameters differ depending on file size and machine. 38

A.2 Additional Plots and Tables Table A.1: Detailed table to Figure 4.3 User cheidecker@physik.uni-karlsruhe.de gfleig@physik.uni-karlsruhe.de Host Undefined, ekpsg, ekpsm ekpsg, ekpsm Queued Jobs 50 0 Idle Jobs 0 0 Running Jobs 34 64 Removed Jobs 0 0 Cores used 34 64 RAM used 2.9 GB 1.2 GB Efficiency 0.0 0.01 Figure A.2: The cache lifetime plot shows data from Sept 2015. After September, no files were automatically deleted from the cache, so this old data must be sufficient. 39

A Appendix Figure A.3: Snapshot of the ekplocal HappyFace instance. It shows the Caching Details Module. 40