LHCb on-line / off-line computing. 4

Size: px

Start display at page:

Download "LHCb on-line / off-line computing. 4"

Stephanie Armstrong
5 years ago
Views:

Off-line computing LHCb on-line / off-line computing Domenico Galli, Bologna INFN CSN Lecce, 24923 We plan LHCb-Italy off-line computing resources to be as much centralized as possible Put as much

1 Off-line computing LHCb on-line / off-line computing Domenico Galli, Bologna INFN CSN Lecce, We plan LHCb-Italy off-line computing resources to be as much centralized as possible Put as much computing power as possible in CNAF Tier- To minimize system administration manpower To optimize resources exploitation Distributed for us means distributed among CNAF and other European Regional Centres Virtual drawback: strong dependence on CNAF resource sharing The improvement following the setup of Tier-3 in major INFN sites for parallel nt-ples analysis should be evaluated later LHCb on-line / off-line computing 2 23 Activities In 23 LHCb-Italy contributed to DC3 (production of MC samples for TDR) 47 Mevt / 6 d 32 Mevt minimum bias; Mevt inclusive b; 5 signal samples, whose size is 5 to kevt 8 Computing centres involved 23 Activities (II) Italian contribution to DC3 has been obtained using limited resources (4kSi2, ie GHz PIII s) Larger contibutions (Karlsruhe D, Imperial College, UK) come from the huges, dinamically allocated, resources of these centres DIRAC, LHCb distributed MC production system, has been used to produce 366 jobs; 85% of them run out of CERN with 92% mean efficiency Italian contribution: 5% (should be 5%) LHCb on-line / off-line computing 3 LHCb on-line / off-line computing 4

23 Activities (III) DC3 has also been used to validate LHCb distributed analysis model Distribution to Tier- centres of signal and bg MC samples stored at CERN during production Samples has been

2 23 Activities (III) DC3 has also been used to validate LHCb distributed analysis model Distribution to Tier- centres of signal and bg MC samples stored at CERN during production Samples has been pre-reduced based on kinematical or trigger criteria Selection algorithms for specific decay channels (~3) has been executed 23 Activities (IV) To perform high statistics data samples analysis the PVFS distributed file system has been used MB/s aggregate I/O using Base-T Ethernet connection (to be compared with 5 MB/s of a typical Base-T NAS) Events has been classified by means of tagging algorithms LHCb-Italy contributed to implementation of selection algorithms for B decays in 2 charged pions/kaons LHCb on-line / off-line computing 5 LHCb on-line / off-line computing 6 23 Activities (V) Software Roadmap Analysis work by LHCb-Italy has been included in Reoptimized Detector Design and Performance TDR (2 hadrons channel + tagging) 3 LHCb internal notes has been written: CERN-LHCb/23-23: Bologna group, Selection of B/Bs h+h- decays at LHCb ; CERN-LHCb/23-24: Bologna group, CP sensitivity with B/Bs h+h- decays at LHCb CERN-LHCb/23-5: Milano group, LHCb flavour tagging performance LHCb on-line / off-line computing 7 LHCb on-line / off-line computing 8

3 DC4 (April-June 24) Physics Goals Demonstrate performance of HLTs (needed for computing TDR) Large minimum bias sample + signal Improve B/S estimates of optimisation TDR Large bb sample + signal Physics improvements to generators DC4 Computing Goals Main goal: gather information to be used for writing LHCb computing TDR Robustness test of the LHCb software and production system Using software as realistic as possible in terms of performance Test of the LHCb distributed computing model Including distributed analyses Incorporation of the LCG application area software into the LHCb production environment Use of LCG resources as a substantial fraction of the production capacity LHCb on-line / off-line computing 9 LHCb on-line / off-line computing DC4 Production Scenario Generate (Gauss, SIM output): 5 Million events minimum bias 5 Million events inclusive b decays 2 Million exclusive b decays in the channels of interest Digitize (Boole, DIGI output): All events, apply L+L trigger decision Reconstruct (Brunel, DST output): Minimum bias and inclusive b decays passing L and L trigger Entire exclusive b-decay sample Store: SIM+DIGI+DST of all reconstructed events LHCb on-line / off-line computing Goal: Robustness Test of the LHCb Software and Production System First use of the simulation program Gauss based on Geant4 Introduction of the new digitisation program, Boole With HLTEvent as output Robustness of the reconstruction program, Brunel Including any new tuning or other available improvements Not including mis-alignment/calibration Pre-selection of events based on physics criteria (DaVinci) AKA stripping Performed by production system after the reconstruction Producing multiple DST output streams Further development of production tools (Dirac etc) eg integration of stripping eg Book-keeping improvements eg Monitoring improvements LHCb on-line / off-line computing 2

4 Goal: Test of the LHCb Computing Model Distributed data production As in 23, will be run on all available production sites Including LCG Controlled by the production manager at CERN In close collaboration with the LHCb production site managers Distributed data sets CERN: Complete DST (copied from production centres) Master copies of pre-selections (stripped DST) Tier: Complete replica of pre-selections Master copy of DST produced at associated sites Master (unique!) copy of SIM+DIGI produced at associated sites Goal: Incorporation of the LCG Software Gaudi will be updated to: Use POOL (persistency hybrid implementation) mechanism Use certain SEAL (general framework services) services eg Plug-in manager All the applications will use the new Gaudi Should be ~transparent but must be commissioned NB: POOL provides existing functionality of ROOT I/O And more: eg location independent event collections But incompatible with existing TDR data May need to convert it if we want just one data format Distributed analysis LHCb on-line / off-line computing 3 LHCb on-line / off-line computing 4 Needed Resources for DC4 requirement is times what was needed for DC3 Current resource estimates indicate DC4 will last 3 months Assumes that Gauss is twice slower than SICBMC Currently planned for April-June Resources request to Bologna Tier- for DC4 power: 2 ksi2 (5 GHz PIII ) Disk: 5 TB Tape: 5 TB GOAL: use of LCG resources as a substantial fraction of the production capacity We can hope for up to 5% Storage requirement: 6TB at CERN for complete DST 9TB distributed among TIER for locally produced SIM+DIGI+DST up to TB per TIER for pre-selected DSTs LHCb on-line / off-line computing 5 LHCb on-line / off-line computing 6

5 Tier- Grow in Next Years Online Computing LHCb-Italy has been involved in online group to design the L/HLT trigger farm Sezione di Bologna [ksi2] Disk [TB] G Avoni, A Carbone,, U Marconi, G Peco, M Piccinini, V Vagnoni Sezione di Milano T Bellunato, L Carbone, P Dini Sezione di Ferrara A Gianoli Tape [TB] LHCb on-line / off-line computing 7 LHCb on-line / off-line computing 8 Online Computing (II) Lots of changes since the Online TDR abandoned Network Processors included Level- DAQ have now Ethernet from the readout boards destination assignment by TFC (Timing and Fast Control) Main ideas the same large gigabit Ethernet Local Area Network to connect detector sources to destinations simple (push) protocol, no event-manager commodity components wherever possible everything controlled, configured and monitored by ECS (Experimental Control System) LHCb on-line / off-line computing 9 DAQ Architecture Level- Traffic Links 44 khz 55- GB/s Gb Ethernet Level- Traffic HLT Traffic Mixed Traffic Front-end Electronics FE FE FE FE FE FE FE FE FE FE FE FE TRM es Links 88 khz Storage System Readout Network Links GB/s s ~8 s LHCb on-line / off-line computing 2 Multiplexing Layer L-Decision Sorter Farm HLT Traffic 323 Links 4 khz 6 GB/s 29 es 32 Links TFC System

6 Following the Data-Flow Storage System Gb Ethernet Level- Traffic HLT Traffic Mixed Traffic FE FE FE FE FE FE FE FE 2 FE FE FE FE 2 Readout Network 94 Links 7 GB/s 94 s L L Trigger D 2 ~8 s 2 LHCb on-line / off-line computing 2 Front-end Electronics B TRM Sorter L-Decision HLT Yes ΦΚ s Farm L TFCYes L System Yes Design Studies Items under study: Physical farm implementation (choice of cases, cooling, etc) Farm management (bootstrap procedure, monitoring) Subfarm Controllers (event-builders, load-balancing queue) Ethernet es Integration with TFC and ECS System Simulation LHCb-Italy is involved in Farm management, Subfarm Controllers and their communication with Subfarm Nodes LHCb on-line / off-line computing 22 Tests in Bologna To begin the activity in Bologna we started (August 23) from scratch by trying to transfer data through Base-T (gigabit Ethernet on copper cables) from PC to PC and to measure performances As we plan to use an unreliable protocol (RAW Ethernet, RAW IP or UDP) because reliable ones (like TCP, which retransmit datagrams not acknowledged) introduce unpredictable latency, so, together with throughput and latency, we need to benchmark also data loss Tests in Bologna (II) Previous results In IEEE823 standard specifications, for m long cat5e cables, the BER (Bit Error Rates) is said to be < - Previous measures, performed by A Barczyc, B Jost, N Neufeld using Network Processors (not real PCs) and m long cat5e cables showed a BER < -4 Recent measures (presented A Barczyc at Zürich, 8923), performed using PCs gave a frame drop rate O( -6 ) Many data (too much for L!) get lost inside kernel network stack implementation in PCs LHCb on-line / off-line computing 23 LHCb on-line / off-line computing 24

7 Tests in Bologna (III) Transferring data on Base-T Ehernet is not as trivial as it was for Base-TX Ethernet A new bus (PCI-X) and new chipsets (eg Intel E76, 875P) has been designed to support gigabit NIC data flow (PCI bus and old chipsets have not enough bandwidth to support gigabit NIC at gigabit rate) Linux kernel implementation of network stack has been rewritten 2 times since kernel 24 to support gigabit data flow (networking code is 2% of the kernel source) Last modification imply the change of the kernel-to-driver interface (network driver must be rewritten) Standard Linux RedHat 9A setup uses back-compatibility stuff and looses packets No many people are interested in achieving very low packet loss (except for video streaming) Also a DATATAG group is working on packet losses (M Rio, T Kelly, M Goutelle, R Hughes-Jones, JPMartin-Flatin, A map of the networking code in Linux Kernel 242, draft 8, 8 August 23) Tests in Bologna Results Summary Throughput was always higher than expected (957 Mb/s of IP payload measured) while data loss was our main concern We have understood, first (at least) in the LHCb collaboration, how to send IP datagram at gigabit/second rate from Linux to Linux on Base-T Ethernet without datagram loss (4 datagrams lost / 2x datagrams sent) This required: use the appropriate software: NAPI kernel ( 242 ) NAPI-enabled drivers (for Intel e driver, recompilation with a special flag set was needed) kernel parameters tuning (buffer & queue length) Base-T flow control enabled on NIC LHCb on-line / off-line computing 25 LHCb on-line / off-line computing 26 Test-bed 2 x PC with 3 x Base-T interfaces each Motherboard: SuperMicro X5DPL-iGM Dual Pentium IV Xeon 24 GHz, GB ECC RAM Chipset Intel E75 4/533 MHz FSB (front side bus) Bus Controller Hub Intel P64H2 (2 x PCI-X, 64 bit, 66//33 MHz) Ethernet controller Intel 82545EM: x Base-T interface (supports Jumbo Frames) Plugged-in PCI-X Ethernet Card: Intel Pro/ MT Dual Port Server Adapter Ethernet controller Intel 82546EB: 2 x Base-T interfaces (supports Jumbo Frames) Base-T 8 ports switch: HP ProCurve 68 6 Gbps backplane: non-blocking architecture latency: < 25 µs (LIFO 64-byte packets) throughput: 9 million pps (64-byte packets) switching capacity: 6 Gbps Cat 6e cables max 5 MHz (cfr 25 MHz Base-T) Test-bed (II) lhcbcn Base-T switch 2 lhcbcn2 echo > /proc/sys/net/ipv4/conf/all/arp_filter to use only one interface to receive packet owning to a certain network (354, and ) uplink LHCb on-line / off-line computing 27 LHCb on-line / off-line computing 28

Test-bed (III) SuperMicro X5DPL-iGM Motherboard (Chipset Intel E75) LHCb on-line / off-line computing 29 Chipset internal bandwidth is granted 64 Gb/s min LHCb on-line / off-line computing 3

calls setsockopt(so_sndbuf) & setsockopt(so_rcvbuf) set the buffer size to twice the requested size, while Linux calls getsockopt(so_sndbuf) & getsockopt(so_rcvbuf) return the actual the buffer size,

8 Test-bed (III) SuperMicro X5DPL-iGM Motherboard (Chipset Intel E75) LHCb on-line / off-line computing 29 Chipset internal bandwidth is granted 64 Gb/s min LHCb on-line / off-line computing 3 Benchmark Software We used 2 benchmark software: Netperf 22p4 UDP_STREAM Self-made basic sender & receiver programs using UDP & RAW IP We discovered a bug in netperf on Linux platform: since Linux calls setsockopt(so_sndbuf) & setsockopt(so_rcvbuf) set the buffer size to twice the requested size, while Linux calls getsockopt(so_sndbuf) & getsockopt(so_rcvbuf) return the actual the buffer size, then when netperf iterate to achieve the requested precision in results, it doubles the buffer size each iteration, using the same variable for both the sistem calls Benchmark Environment Kernel smp GigaEthernet driver: e version 543-k (RedHat 9A) version 526 recompiled with NAPI flag enabled System disconnected from public network Runlevel 3 (X stopped) Daemons stopped (crond, atd, sendmail, etc) Flow control on (on both NICs and switch) Numer of descriptors allocated by the driver rings: 256, 496 IP send buffer size: (x2) Bytes IP receive buffer size: (x2), (x2) Bytes Tx queue length, 6 LHCb on-line / off-line computing 3 LHCb on-line / off-line computing 32

9 First Results Linux RedHat 9A, Kernel 242, Default Setup, no Tuning First benchmark results about datagram loss showed big fluctuations which, in principle, can due to packet queue reset, other process, interrupts, soft_irqs, broadcast network traffic, etc Resulting distribution is multi-modal Mean loss: datagram lost every 2 datagram sent Too much for LHCb L!!! /333 /3 / Lost datagram fraction for datagram runs 26 runs x -3 LHCb on-line / off-line computing 33 5 Mean value: 595 = /963 Flow control on uplink off Tx-ring: 256 descriptors Rx-ring: 256 descriptors Send buffer: B Recv buffer: B Tx queue length: Driver e: 543-k (no NAPI) 6 5 First Results Linux RedHat 9A, Kernel 242, Default Setup, no Tuning (II) / Mean value: 595 = /963 Flow control on uplink off Tx-ring: 256 descriptors Rx-ring: 256 descriptors Send buffer: B Recv buffer: B Tx queue length: Driver e: 543-k (no NAPI) /8696 /7474 /45872 /333 6 / Lost datagram fraction for datagram runs 26 runs x -4 LHCb on-line / off-line computing Mean value: 595 = /963 4 Flow control on uplink off 3 4 Tx-ring: 256 descriptors /3 Rx-ring: 256 descriptors / Send buffer: B Recv buffer: B 2 Tx queue length: Driver e: 543-k (no NAPI) Lost datagram fraction for datagram runs 26 runs x -3 /44643 / Flow control on 5 Mean value: 595 = /963 6 Lost datagram fraction for datagram runs 26 runs x -3 uplink off /42735 Tx-ring: 256 descriptors 2 Rx-ring: 256 descriptors Send buffer: B Recv buffer: B /4652 Tx queue length: 5 Driver e: 543-k (no NAPI) We think that peak behavior is due to kernel queues resets (all queue packets silently dropped when queue is full) Changes in Linux Network Stack Implementation 2 22: netlink, bottom halves, HFC (harware flow control) As few computation as possible while in interrupt context (interrupt disabled) Part of the processing deferred from interrupt handler to bottom halves to be executed at later time (with interrupt enabled) HFC (to prevent interrupt livelock): as the backlog queue is totally filled, interrupt are disable until backlog queue is emptied Bottom halves execution strictly serialized among s; only one packet at a time can enter the system : softnet, softirq softirqs are software thread that replaces bottom halves possible parallelism on SMP machines (NB: back-port): NAPI (new application program interface) interrupt mitigation technology (mixture of interrupt and polling mechanisms) Interrupt livelock Given the interrupt rate coming in, the IP processing thread never gets a chance to remove any packets off the system There are so many interrupts coming into the system such that no useful work is done Packets go all the way to be queued, but are dropped because the backlog queue is full System resourced are abused extensively but no useful work is accomplished LHCb on-line / off-line computing 35 LHCb on-line / off-line computing 36

10 NAPI (New API) NAPI is a interrupt mitigation mechanism constituted by a mixture of interrupt and polling mechanisms: Polling: useful under heavy load introduces more latency under light load abuses the by polling devices that have no packet to offer Interrupts: improve latency under light load make the system vulnerable to livelock as the interrupt load exceed the MLFFR (Maximum Loss Free Forwarding Rate) LHCb on-line / off-line computing 37 cat 5e cable Packet Reception in Linux kernel 249 (softnet) and 242 (NAPI) Softnet (kernel 249) interrupt cat 5e cable DMA engine NIC DMA engine interrupt int handler disable int alloc sk_buff fetch (DMA) netif_rx() enable int rx-ring descriptor point to kernel memory push cpu_raise_softirq() (schedule a softirq) rx-ring descriptor # backlog queue # backlog queue if full do empty it completely (to avoid inerrupt livelock, HFC) softirq handler net_rx_action() kernel thread add poll_list int handler device point to eth disable int pointer eth NIC alloc sk_buff fetch (DMA) kernel memory netif_rx_schedule() softirq handler enable int cpu_raise_softirq() net_rx_action() (schedule a softirq) NAPI (kernel 242) kernel thread LHCb on-line / off-line computing 38 read pop ip_rcv() poll ip_rcv() further processing further processing NAPI (II) Under low load, before the MLFFR is reached, the system converges toward an interrupt driven system: packets/interrupt ratio is lower and latency is reduced Under heavy load, the system takes its time to poll devices registered Interrupts are allowed as fast as the system can process them : packets/interrupt ratio is higher and latency is increased NAPI (III) NAPI changes driver-to-kernel interfaces all network drivers should be rewritten In order to accommodate devices not NAPI-aware, the old interface (backlog queue) is still available for the old drivers (back-compatibility) Backlog queues, when used in back-compatibility mode, are polled just like other devices LHCb on-line / off-line computing 39 LHCb on-line / off-line computing 4

11 True NAPI vs Back-Compatibility Mode NAPI DMA engine interrupt rx-ring descriptor add poll_list read cat 5e int handler device point to eth cable disable int pointer eth poll NIC alloc sk_buff fetch (DMA) kernel memory netif_rx_schedule() softirq handler enable int cpu_raise_softirq() (schedule a softirq) net_rx_action() ip_rcv() kernel thread NAPI kernel with NAPI driver # backlog queue rx-ring descriptor interrupt push # backlog queue cat 5e cable DMA engine NIC add poll_list int handler backlog point to backlog disable int eth alloc sk_buff fetch (DMA) kernel memory netif_rx() softirq handler enable int cpu_raise_softirq() net_rx_action() (schedule a softirq) kernel thread LHCb on-line / off-line computing 4 NAPI kernel with old (not NAPI-aware) driver read ip_rcv() poll further processing further processing The Intel e Driver Even in the last version of e driver (526) NAPI is turned off by default (to allow the usage of the driver also in kernels 249) To enable NAPI, e 526 driver must be recompiled with the option: make CFLAGS_EXTRA=-DCONFIG_E_NAPI LHCb on-line / off-line computing 42 Best Results Maximum trasfer rate (udp 496 byte datagrams): 957 Mb/s Mean datagram lost fraction (@ 957 Mb/s): 2x - (4 datagram lost for 2x 4k-datagrams sent) corresponding to BER 62x -5 (using m cat6e cables) if data loss is totally due to hardware CRC errors LHCb on-line / off-line computing 43 To be Tested to Improve Further kernel 25 fully preemptive (real time) sysenter & sysexit (instead of int x8) for context switching following system calls (3-4 times faster) Asynchronous datagram receiving Jumbo frames Ethernet frames whose MTU (Maximum Transmission Unit) is 9 instead of 5 Less IP datagram fragmentation in packets Kernel Mode Linux ( KML is a technology that enables the execution of ordinary userspace programs inside kernel space Protection-by-software (like in Java bytecode) instead of protection-by-hardware System calls become function calls (32 time faster than int x8, 36 time faster than sysenter/sysexit) LHCb on-line / off-line computing 44

12 Milestones 824 Streaming benchmarks: Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet with loopback cable Test of switch performance (streaming throughput, latency and packet loss, using standard frames and jumbo frames) Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet for 2 or 3 simultaneous connections on the same PC Test of event building (receive 2 message stream and send joined messages stream) 224 (Sub Farm Controller) to nodes communication: Definition of -to-nodes communication protocol Definition of queueing and scheduling mechanism First implementation of queueing/scheduling procedures (possibly zero-copy) Milestones (II) OS test (if performance need to be improved) kernel Linux 2553 KML (kernel mode linux) Design and test of bootstrap procedures: Measure of the rate of failure of simultaneous boot of a cluster of PCs, using pxe/dhcp and tftp Test of node switch on/off and powe cycle using ASF Design of bootstrap system (rate nodes/proxy servers/servers, sofware alignment among servers) Definition of requirement for the trigger software: error trapping timeout LHCb on-line / off-line computing 45 LHCb on-line / off-line computing 46

Reliability of datagram transmission on Gigabit Ethernet at full link load

Reliability of datagram transmission on Gigabit Ethernet at full link load LHCb Technical Note Issue: 1 Revision: 0 Reference: LPHE 2005-06 / LHCB 2004-030 DAQ Created: 17 th Oct 2003 Last modified: 31