High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC PDF Free Download

High-density Grid storage system optimization at ASGC Shu-Ting Liao ASGC Operation team ISGC 211

Outline Introduction to ASGC Grid storage system Storage status and issues in 21 Storage optimization Summary 2

Introduction ASGC provides high intensive data services for both WLCG Tier-1 (Taiwan-LCG2) and Tier-2 (TW-FTT) center by CASTOR and DPM. Using more than 12 disk servers to support ~5.8 PB of inbound and outbound data for ATLAS and CMS experiments during 21 data taking. Optimizing capacity and storage efficiency in a complex storage architecture and limited space of datacenter is a big challenging task. 3

T1 Disk Utilization Average T1 Storage Utilization : 54.34% Mar. 211 New procured 2PB disk online 4

T1 Tape Utilization Average of T1 Tape Utilization : 27.51% May 21 Tape HA with another robot were implemented Feb. 211 Reached 4PB tape capacity 5

ATLAS Transfers at ASGC From 21-1-1 to 21-12-31: 1.73 PB flow into ASGC T1 65 TB flow into ASGC T2 6

CMS Transfers at ASGC From 21-1-1 to 21-12-31: 2.77 PB flow In/Out ASGC T1 753 TB flow In/Out ASGC T2 7

Storage issue in 21 Over 5% operation issues come from storage system. Disk servers configuration: Bottleneck on those disk server with fewer CPU cores, memory and limited bandwidth. Using one blade server connect to an array, the cost are high on blade server, array controller, rack space and power consumption. 8

Storage System Optimization Storage system upgrade: Disk server IBM HS22 Blade CPU: Intel Xeon CPU L564 2.27GHz Memory: 24G 1GbE Backend arrays Dual controller JBOD x 3, 6Gb SAS 4U-24Bay x 4 8Gb/s x 2 FC-Host Storage optimization: RAID5 Multipathing and LUN affinity 1GbE 8Gb Fibre DAS New configuration Tuned storage controller, OS (io-scheduler, TCP buffer ) and storage middleware (CASTOR and DPM) 1Gb/2GbE 4Gb Fibre DAS Old configuration 11

Disk server Performance 1GbE Observation while testing by ATLAS data transfers Peak network ~ 95MB/s Peak arrays IO ~ 121MB/s Completed 4916 file transfers with 6.66TB data flow in few hours, 2 errors (1 SRM_FAILURE; 1 globus_xio ) It can handle more than 2 concurrent GridFTP transfers 8Gb Fibre DAS Arrays IO 12

Storage Optimization Case I Controller

Write Test Each client copies a file TO the disk server Using rfcp(rfio) Each client copies a DIFFERENT file 35 clients simultaneously start copying files

Low disk throughput 1M x35 5 1 15 2 25 3 35 4 5 1 15 2 25 3 35 4 45 5 2:45:51 2:46:2 2:46:13 2:46:24 2:46:35 2:46:46 2:46:57 2:47:8 2:47:19 2:47:3 2:47:41 2:47:52 2:48:3 2:48:14 2:48:25 2:48:36 2:48:47 2:48:58 2:49:9 2:49:2 2:49:31 2:49:42 2:49:53 2:5:4 2:5:15 2:5:26 2:5:37 2:5:48 2:5:59 2:51:1 2:51:21 2:51:32 network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughput_accumulated(mb/s) Low disk throughput MB/s 2 4 6 8 1 2:45:51 2:46:2 2:46:13 2:46:24 2:46:35 2:46:46 2:46:57 2:47:8 2:47:19 2:47:3 2:47:41 2:47:52 2:48:3 2:48:14 2:48:25 2:48:36 2:48:47 2:48:58 2:49:9 2:49:2 2:49:31 2:49:42 2:49:53 2:5:4 2:5:15 2:5:26 2:5:37 2:5:48 2:5:59 2:51:1 2:51:21 2:51:32 cpu_system(%) cpu_wait(%) cpu_idle(%) 15 * Single controller

Improving disk throughput 5 1 15 2 25 3 35 4 5 1 15 2 25 3 35 4 45 5 8:15:1 8:15:15 8:15:2 8:15:25 8:15:3 8:15:35 8:15:4 8:15:45 8:15:5 8:15:55 8:16: 8:16:5 8:16:1 8:16:15 8:16:2 8:16:25 8:16:3 8:16:35 8:16:4 8:16:45 8:16:5 8:16:55 8:17: 8:17:5 8:17:1 8:17:15 8:17:2 8:17:25 8:17:3 network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughputaccumulated(mb/s) 2 4 6 8 1 8:15:1 8:15:15 8:15:2 8:15:25 8:15:3 8:15:35 8:15:4 8:15:45 8:15:5 8:15:55 8:16: 8:16:5 8:16:1 8:16:15 8:16:2 8:16:25 8:16:3 8:16:35 8:16:4 8:16:45 8:16:5 8:16:55 8:17: 8:17:5 8:17:1 8:17:15 8:17:2 8:17:25 8:17:3 cpu_system(%) cpu_wait(%) cpu_idle(%) Better disk throughput by updating controller firmware MB/s 16 * Single controller

Storage Optimization Case II Multipath

Turbulent disk throughout 1M x35 5 1 15 2 25 3 35 4 1 2 3 4 5 6 7 8 9 3:26:14 3:26:18 3:26:22 3:26:26 3:26:3 3:26:34 3:26:38 3:26:42 3:26:46 3:26:5 3:26:54 3:26:58 3:27:2 3:27:6 3:27:1 3:27:14 3:27:18 3:27:22 3:27:26 3:27:3 3:27:34 3:27:38 3:27:42 3:27:46 3:27:5 3:27:54 3:27:58 network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughputaccumulated(mb/s) 2 4 6 8 1 3:26:14 3:26:18 3:26:22 3:26:26 3:26:3 3:26:34 3:26:38 3:26:42 3:26:46 3:26:5 3:26:54 3:26:58 3:27:2 3:27:6 3:27:1 3:27:14 3:27:18 3:27:22 3:27:26 3:27:3 3:27:34 3:27:38 3:27:42 3:27:46 3:27:5 3:27:54 3:27:58 cpu_system(%) cpu_wait(%) cpu_idle(%) Turbulent disk throughout MB/s 18 * Dual controller

6:42:3 6:42:7 6:42:11 6:42:15 6:42:19 6:42:23 6:42:27 6:42:31 6:42:35 6:42:39 6:42:43 6:42:47 6:42:51 6:42:55 6:42:59 6:43:3 6:43:7 6:43:11 6:43:15 6:43:19 6:43:23 6:43:27 6:43:31 6:43:35 6:43:39 6:43:43 6:43:47 6:43:51 6:42:3 6:42:7 6:42:11 6:42:15 6:42:19 6:42:23 6:42:27 6:42:31 6:42:35 6:42:39 6:42:43 6:42:47 6:42:51 6:42:55 6:42:59 6:43:3 6:43:7 6:43:11 6:43:15 6:43:19 6:43:23 6:43:27 6:43:31 6:43:35 6:43:39 6:43:43 6:43:47 6:43:51 Improving Disk Throughput 1M x35 8 7 6 5 4 3 2 1 4 35 3 25 2 15 1 5 * Dual controller network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughput_accumulated(mb/s) MB/s 1 8 6 4 2 Consistent network and disk throughput by changing from Linux default multipathing to VTrak multipathing cpu_system(%) cpu_wait(%) cpu_idle(%) 19

Multipath note Using VTrak multipathing and setting up ALUA Enable LUN affinity 2

Storage Optimization Case III CASTOR and DPM

8:45:41 8:45:56 8:46:11 8:46:26 8:46:41 8:46:56 8:47:11 8:47:26 8:47:41 8:47:56 8:48:11 8:48:26 8:48:41 8:48:56 8:49:11 8:49:26 8:49:42 8:49:57 8:5:12 8:5:27 8:5:42 8:5:57 8:51:12 8:51:27 8:51:42 8:51:57 8:52:12 8:52:27 8:52:42 8:52:57 8:53:12 8:53:27 8:53:42 8:53:57 8:45:41 8:45:56 8:46:11 8:46:26 8:46:41 8:46:56 8:47:11 8:47:26 8:47:41 8:47:56 8:48:11 8:48:26 8:48:41 8:48:56 8:49:11 8:49:26 8:49:42 8:49:57 8:5:12 8:5:27 8:5:42 8:5:57 8:51:12 8:51:27 8:51:42 8:51:57 8:52:12 8:52:27 8:52:42 8:52:57 8:53:12 8:53:27 8:53:42 8:53:57 CASTOR- Poor efficiency in writing small size files 1 9 8 7 6 5 4 3 2 1 KB/s 1 8 6 4 2 7 6 5 4 3 2 1 1K files X 35 disk_thoughput(kb/s) network_thoughput(kb/s) disk_thoughputaccumulated(mb/s) Transfer stuck during next Job scheduling cpu_system(%) cpu_wait(%) cpu_idle(%) 22

3:41:4 3:41:5 3:41:6 3:41:7 3:41:8 3:41:9 3:41:1 3:41:11 3:41:12 3:41:13 3:41:14 3:41:15 3:41:16 3:41:17 3:41:18 3:41:19 3:41:2 3:41:21 3:41:22 3:41:23 3:41:24 3:41:25 3:41:26 3:41:27 3:41:28 3:41:29 3:41:4 3:41:5 3:41:6 3:41:7 3:41:8 3:41:9 3:41:1 3:41:11 3:41:12 3:41:13 3:41:14 3:41:15 3:41:16 3:41:17 3:41:18 3:41:19 3:41:2 3:41:21 3:41:22 3:41:23 3:41:24 3:41:25 3:41:26 3:41:27 3:41:28 3:41:29 CASTOR LSF Tuned 16 14 12 1 8 6 4 2 6 5 4 3 2 1 1K files X 35 disk_thoughput(kb/s) network_thoughput(kb/s) disk_thoughputaccumulated(mb/s) KB/s 1 8 6 4 2 Transfers become more smooth by tune CASTOR LSF cpu_system(%) cpu_wait(%) cpu_idle(%) 23

LSF tuning note lsb.queues NEW_JOB_SCHED_DELAY CHUNK_JOB_SIZE 24

4:28:14 4:28:17 4:28:2 4:28:23 4:28:26 4:28:29 4:28:32 4:28:35 4:28:38 4:28:41 4:28:44 4:28:47 4:28:5 4:28:53 4:28:56 4:28:59 4:29:2 4:29:5 4:29:8 4:29:11 4:29:14 4:29:17 4:29:2 4:29:23 4:29:26 4:29:29 4:29:32 4:29:35 DPM Tuning- 1K x 35 6 5 4 3 2 1 8 6 4 2 DPNS threads 2 Duration: 1m 22s disk_thoughput(kb/s) network_thoughput(kb/s) disk_thoughput_accumulated(mb/s) 12 1 8 6 4 2 12 1 8 6 4 2 DPNS threads 5 Duration: 44s disk_thoughput(kb/s) network_thoughput(kb/s) disk_thoughput_accumulated(mb/s) 1 8 6 4 2 12 1 8 6 4 2 DPNS threads 8 Duration: 47s disk_thoughput(kb/s) network_thoughput(kb/s) disk_thoughput_accumulated(mb/s) 25

7:36:19 7:36:22 7:36:25 7:36:28 7:36:31 7:36:34 7:36:37 7:36:4 7:36:43 7:36:46 7:36:49 7:36:52 7:36:55 7:36:58 7:37:1 7:37:4 7:37:7 7:37:1 7:37:13 7:37:16 7:37:19 7:37:22 7:37:25 7:37:28 7:37:31 7:37:34 7:37:37 7:37:4 6:55:12 6:55:15 6:55:18 6:55:21 6:55:24 6:55:27 6:55:3 6:55:33 6:55:36 6:55:39 6:55:42 6:55:45 6:55:48 6:55:51 6:55:54 6:55:57 6:56: 6:56:3 6:56:6 6:56:9 6:56:12 6:56:15 6:56:18 6:56:21 6:56:24 6:56:27 6:56:3 11:2:29 11:2:32 11:2:35 11:2:38 11:2:41 11:2:44 11:2:47 11:2:5 11:2:53 11:2:56 11:2:59 11:21:2 11:21:5 11:21:8 11:21:11 11:21:14 11:21:17 11:21:2 11:21:23 11:21:26 11:21:29 11:21:32 11:21:35 11:21:38 11:21:41 11:21:44 11:21:47 11:21:5 11:21:53 11:21:56 11:21:59 DPM Tuning- 1M x 35 4 3 2 1 4 3 2 1 DPNS threads 2 Duration: 1m 32s network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughputaccumulated(mb/s) 5 4 3 2 1 4 3 2 1 DPNS threads 5 Duration: 1m 18s network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughputaccumulated(mb/s) 5 4 3 2 1 4 3 2 1 DPNS threads 8 Duration: 1m 18s network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughputaccumulated(mb/s) 26

DPM tuning note /etc/sysconfig/dpnsdaemon 27

DPM V.S. CASTOR- 1K x 35 1 2 3 4 5 6 5 1 15 3:41:4 3:41:5 3:41:6 3:41:7 3:41:8 3:41:9 3:41:1 3:41:11 3:41:12 3:41:13 3:41:14 3:41:15 3:41:16 3:41:17 3:41:18 3:41:19 3:41:2 3:41:21 3:41:22 3:41:23 3:41:24 3:41:25 3:41:26 3:41:27 3:41:28 3:41:29 disk_thoughput(kb/s) network_thoughput(kb/s) disk_thoughputaccumulated(mb/s) Duration: 25s CASTOR 2 4 6 8 1 4:2:41 4:2:56 4:3:11 4:3:26 4:3:41 4:3:56 4:4:11 4:4:26 4:4:41 4:4:56 4:5:11 4:5:26 4:5:41 cpu_system(%) cpu_wait(%) cpu_idle(%) 28 2 4 6 8 1 12 2 4 6 8 1 12 6:23:58 6:24: 6:24:2 6:24:4 6:24:6 6:24:8 6:24:1 6:24:12 6:24:14 6:24:16 6:24:18 6:24:2 6:24:22 6:24:24 6:24:26 6:24:28 6:24:3 6:24:32 6:24:34 6:24:36 6:24:38 6:24:4 6:24:42 disk_thoughput(kb/s) network_thoughput(kb/s) disk_thoughput_accumulated(mb/s) 2 4 6 8 1 6:23:58 6:24:2 6:24:6 6:24:1 6:24:14 6:24:18 6:24:22 6:24:26 6:24:3 6:24:34 6:24:38 6:24:42 cpu_system(%) cpu_wait(%) cpu_idle(%) Duration: 44s DPM

1 2 3 4 1 2 3 4 5 4:57:47 4:57:48 4:57:49 4:57:5 4:57:51 4:57:52 4:57:53 4:57:54 4:57:55 4:57:56 4:57:57 4:57:58 4:57:59 4:58: 4:58:1 4:58:2 4:58:3 4:58:4 4:58:5 4:58:6 4:58:7 4:58:8 4:58:9 4:58:1 4:58:11 network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughputaccumulated(mb/s) DPM V.S. CASTOR- 1M x 35 1 2 3 4 5 1 15 2 25 3 35 11:2:29 11:2:33 11:2:37 11:2:41 11:2:45 11:2:49 11:2:53 11:2:57 11:21:1 11:21:5 11:21:9 11:21:13 11:21:17 11:21:21 11:21:25 11:21:29 11:21:33 11:21:37 11:21:41 11:21:45 11:21:49 11:21:53 11:21:57 12:22:1 network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughputaccumulated(mb/s) Duration: 1m 32s DPM 2 4 6 8 1 11:2:29 11:2:35 11:2:41 11:2:47 11:2:53 11:2:59 11:21:5 11:21:11 11:21:17 11:21:23 11:21:29 11:21:35 11:21:41 11:21:47 11:21:53 11:21:59 cpu_system(%) cpu_wait(%) cpu_idle(%) CASTOR Duration: 24s 29 2 4 6 8 1 4:57:47 4:57:5 4:57:53 4:57:56 4:57:59 4:58:2 4:58:5 4:58:8 4:58:11 cpu_system(%) cpu_wait(%) cpu_idle(%)

1 2 3 4 1 2 3 4 5 8:15:1 8:15:15 8:15:2 8:15:25 8:15:3 8:15:35 8:15:4 8:15:45 8:15:5 8:15:55 8:16: 8:16:5 8:16:1 8:16:15 8:16:2 8:16:25 8:16:3 8:16:35 8:16:4 8:16:45 8:16:5 8:16:55 8:17: 8:17:5 8:17:1 8:17:15 8:17:2 8:17:25 8:17:3 network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughputaccumulated(mb/s) 1 2 3 4 1 2 3 4 5 6 7 2:14:23 2:14:27 2:14:31 2:14:35 2:14:39 2:14:43 2:14:47 2:14:51 2:14:55 2:14:59 2:15:3 2:15:7 2:15:11 2:15:15 2:15:19 2:15:23 2:15:27 2:15:31 2:15:35 2:15:39 2:15:43 2:15:47 2:15:51 2:15:55 2:15:59 2:16:3 2:16:7 2:16:11 network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughput_accumulated(mb/s) DPM Duration: 1m 51s Duration: 2m 21s CASTOR 2 4 6 8 1 2:14:23 2:14:31 2:14:39 2:14:47 2:14:55 2:15:3 2:15:11 2:15:19 2:15:27 2:15:35 2:15:43 2:15:51 2:15:59 2:16:7 cpu_system(%) cpu_wait(%) cpu_idle(%) 2 4 6 8 1 8:15:1 8:15:2 8:15:3 8:15:4 8:15:5 8:16: 8:16:1 8:16:2 8:16:3 8:16:4 8:16:5 8:17: 8:17:1 8:17:2 8:17:3 cpu_system(%) cpu_wait(%) cpu_idle(%) DPM V.S. CASTOR- 1M x 35 3

5 1 15 2 25 3 35 4 1 2 3 4 5 6 7 8 6:21:42 6:22:25 6:23:8 6:23:51 6:24:34 6:25:17 6:26: 6:26:43 6:27:26 6:28:9 6:28:52 6:29:35 6:3:18 6:31:1 6:31:44 6:32:27 6:33:1 6:33:53 6:34:36 6:35:19 6:36:2 6:36:45 6:37:28 6:38:11 6:38:54 6:39:37 6:4:2 6:41:3 6:41:46 6:42:29 network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughput_accumulated(mb/s) DPM DPM V.S. CASTOR 1G x 35 1 2 3 4 1 2 3 4 5 6 1:1:43 1:2:34 1:3:25 1:4:16 1:5:7 1:5:58 1:6:49 1:7:4 1:8:31 1:9:22 1:1:13 1:11:5 1:11:56 1:12:47 1:13:38 1:14:29 1:15:2 1:16:11 1:17:2 1:17:53 1:18:44 1:19:35 1:2:26 1:21:17 1:22:8 1:22:59 1:23:5 1:24:41 1:25:32 1:26:23 network_thoughput(mb/s) disk_thoughput(mb/s) disk_thoughputaccumulated(mb/s) mem_used(mb) 5 1 6:21:42 6:23:2 6:24:22 6:25:42 6:27:2 6:28:22 6:29:42 6:31:2 6:32:22 6:33:42 6:35:2 6:36:22 6:37:42 6:39:2 6:4:22 6:41:42 cpu_system(%) cpu_wait(%) cpu_idle(%) 5 1 1:1:43 1:3:33 1:5:23 1:7:13 1:9:3 1:1:53 1:12:44 1:14:34 1:16:24 1:18:14 1:2:4 1:21:54 1:23:44 1:25:34 cpu_system(%) cpu_wait(%) cpu_idle(%) Duration: 2m 4s Duration: 24m 49s CASTOR 31

Summary and Next Step Optimized configuration based on recent hardware was able to reach close to full speed performance. Improvements in several aspects by new configuration: Better throughput (~95MB/s in peak) has been seen during testing Save Rack space and Power 7U blade chassis can handle more than 2PB spaces 21W V.S 29W Save Money Reduce #controllers and blade servers New configuration with 2.3PB were online production on Mar. 17, keep monitoring to ensure the performance. Next step --> reconfigure and optimize of old disk arrays 32

Thank You for Your Attention!

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011