RAS Enhancement Activities for Mission-Critical Linux Systems

Similar documents
Enhancement Activities on the Current Upstream Kernel for Mission-Critical Systems

Seiji Aguchi. Development Status of Troubleshooting Features, Tracing, Message Logging in Linux Kernel 5/20/2014

SystemTap for Enterprise

Adding Inter-event Capabilities to the Linux Trace Event Subsystem

Debugging Kernel without Debugger

Analyzing Kernel Behavior by SystemTap

2006/7/22. NTT Data Intellilink Corporation Fernando Luis Vázquez Cao. Copyright(C)2006 NTT Data Intellilink Corporation

Maintaining Linux Long Term & Adding Specific Features in Telecom Systems. Keika Kobayashi NEC Communication Systems Sep 29, Japan2010

System Wide Tracing User Need

The control of I/O devices is a major concern for OS designers

kdump: usage and internals

Ftrace - What s new. Since my last explanation of ftrace (from v3.18) Steven Rostedt 25/10/ VMware Inc. All rights reserved.

Verification & Validation of Open Source

Low-Overhead Ring-Buffer of Kernel Tracing in a Virtualization System

Proposal of Live Dump

March 10, Linux Live Patching. Adrien schischi Schildknecht. Why? Who? How? When? (consistency model) Conclusion

RAS and Memory Error Reporting with perf. Robert Richter 2nd CERN Advanced Performance Tuning workshop November 21, 2013

Testing real-time Linux: What to test and how.

Intro to Segmentation Fault Handling in Linux. By Khanh Ngo-Duy

The Ephemeral Smoking Gun

Evaluation of Real-time Performance in Embedded Linux. Hiraku Toyooka, Hitachi. LinuxCon Europe Hitachi, Ltd All rights reserved.

Operating System System Call & Debugging Technique

CSCE Introduction to Computer Systems Spring 2019

Improving Linux Development with better tools. Andi Kleen. Oct 2013 Intel Corporation

Operating Systems (234123) Spring 2017 (Homework Wet 1) Homework 1 Wet

Kernel Internals. Course Duration: 5 days. Pre-Requisites : Course Objective: Course Outline

Improving Linux development with better tools

Validating Core Parallel Software

Overhead Evaluation about Kprobes and Djprobe (Direct Jump Probe)

Applying User-level Drivers on

CSE 4/521 Introduction to Operating Systems. Lecture 24 I/O Systems (Overview, Application I/O Interface, Kernel I/O Subsystem) Summer 2018

How to get realistic C-states latency and residency? Vincent Guittot

Decoding Those Inscrutable RCU CPU Stall Warnings

Using kgdb and the kgdb Internals

The Note/1 Driver Design

EECS 3221 Operating System Fundamentals

EECS 3221 Operating System Fundamentals

Embedded Linux Conference 2010

Appendix A GLOSSARY. SYS-ED/ Computer Education Techniques, Inc.

Scalability Efforts for Kprobes

CSE 120. Overview. July 27, Day 8 Input/Output. Instructor: Neil Rhodes. Hardware. Hardware. Hardware

No Crash Dump? No Problem! Light-weight remote kernel crash reporting for settop boxes

Ftrace Profiling. Presenter: Steven Rostedt Red Hat

VMware vrealize operations Management Pack FOR. PostgreSQL. User Guide

LinuxCon North America 2016 Investigating System Performance for DevOps Using Kernel Tracing

The State of Kernel Debugging Technology. Jason Wessel - Product Architect for WR Linux Core Runtime - Kernel.org KDB/KGDB Maintainer

I/O Management Software. Chapter 5

Linux Security Summit Europe 2018

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

Application Fault Tolerance Using Continuous Checkpoint/Restart

LINUX TRACE TOOLS. Understanding the deep roots of new-age kernel instrumentation

Inferring Temporal Behaviours Through Kernel Tracing

Decoding Those Inscrutable RCU CPU Stall Warnings

Driver Tracing Interface

Groking the Linux SPI Subsystem FOSDEM Matt Porter

The Device Driver Interface. Input/Output Devices. System Call Interface. Device Management Organization

CS5460/6460: Operating Systems. Lecture 24: Device drivers. Anton Burtsev April, 2014

Lecture 2: Architectural Support for OSes

CS2506 Quick Revision

EzBench, a tool to help you benchmark and bisect the Graphics Stack s performance

Hardware Latencies How to flush them out (A use case) Steven Rostedt Red Hat

20-EECE-4029 Operating Systems Fall, 2015 John Franco

Operating System Structure

CEC 450 Real-Time Systems

Understanding Real Time Linux. Alex Shi

1 Virtualization Recap

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

1. Background. 2. Demand Paging

Linux Kernel Exploitation. Where no user has gone before

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

Analysis of User Level Device Driver usability in embedded application

File Systems. Before We Begin. So Far, We Have Considered. Motivation for File Systems. CSE 120: Principles of Operating Systems.

SCSI overview. SCSI domain consists of devices and an SDS

Kexec: Soft-Reboot and Crash-Dump Analysis for Linux and Xen

Operating System Structure

Adventures In Real-Time Performance Tuning, Part 2

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Dynamic Tracing and Instrumentation

LinuxCon 2010 Tracing Mini-Summit

MARUTHI SCHOOL OF BANKING (MSB)

CSE 153 Design of Operating Systems

Chapter 13: I/O Systems

CSE 153 Design of Operating Systems Fall 18

Keeping Up With The Linux Kernel. Marc Dionne AFS and Kerberos Workshop Pittsburgh

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. February 22, Prof. Joe Pasquale

Validating Core Parallel Software

Low overhead virtual machines tracing in a cloud infrastructure

9/19/18. COS 318: Operating Systems. Overview. Important Times. Hardware of A Typical Computer. Today CPU. I/O bus. Network

Release Notes CCURDSCC (WC-AD3224-DS)

ESE 333 Real-Time Operating Systems 163 Review Deadlocks (Cont.) ffl Methods for handling deadlocks 3. Deadlock prevention Negating one of four condit

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

Virtual File System -Uniform interface for the OS to see different file systems.

The Kernel Report. (Plumbers 2010 edition) Jonathan Corbet LWN.net

SYSTEM CALL IMPLEMENTATION. CS124 Operating Systems Fall , Lecture 14

File Systems: Consistency Issues

Introduction and Overview

Memory Management: The process by which memory is shared, allocated, and released. Not applicable to cache memory.

Memory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time.

Linux Madwifi. Tung Le, July 2010

Chapter 12: I/O Systems

Transcription:

RAS Enhancement Activities for MissionCritical Linux Systems Hitachi Ltd. Yoshihiro YUNOMAE

01 MissionCritical Systems We apply Linux to missioncritical systems. Banking systems/carrier backend systems/train management systems and more People(consumers/providers) expect stable operation for longterm use. Don't frequently change the system configuration Changing the system introduces the risk for illegal operation. "RAS" requirements are needed. 1

02 RAS Reliability To identify problems before release e.g. Bug fixing, Testing Availability To continue the operation even if a problem occurs e.g. HA cluster system Serviceability To find out the root cause of the problem certainly in order to be able to solve it permanently e.g. Logging, Tracing, Memory dump Do the systems satisfy these requirements in current upstream kernel? Will talk about 'R' and 'S' 2

Activities 1. Fix a deadlock problem on NMI dump (R) 2. Improve data reception latency on serial devices (R) 3. Save names of more processes in ftrace (S) 4. Solve the printk message fragmentation problem (S) 3

Activities 1. Fix a deadlock problem on NMI dump (R) 2. Improve data reception latency on serial devices (R) 3. Save names of more processes in ftrace (S) 4. Solve the printk message fragmentation problem (S) 4

11 Memory dump deadlock introduction We get memory dump via Kdump when serious problems, which induce panic or oops, occur. Kdump Kernel crash dumping feature based on Kexec 1. In 1 st kernel, kernel panic occurs. 2. Execute crash_kexec() in panic() and save the memory 3. Boot 2 nd kernel (capture kernel) and copy /proc/vmcore When kernel panic occurs via NMI, Kdump operation sometimes stops before booting 2 nd kernel. 1 st kernel panic 2 nd kernel capture /proc/vmcore crash_kexec() boot disk NW media 5

12 Memory dump deadlock reason The cause of the stop is deadlock on ioapic_lock in NMI context. panic()>crash_kexec()>>disable_ioapic() > >ioapic_read_entry() ioapic_read_entry() { raw_spin_lock_irqsave(&ioapic_lock, flags); eu.entry = ioapic_read_entry(apic, pin); raw_spin_unlock_irqstore(&ioapic_lock, flags); } The scenario is 1. Get ioapic_lock for rebalancing IRQ (irq_set_affinity) 2. Inject NMI while locking ioapic_lock 3. Panic caused by NMI occurs 4. Try to execute Kdump 5. Deadlock in ioapic_read_entry() 6

13 Memory dump deadlock fixing Fixed this problem by initializing ioapic_lock before disable_io_apic(): native_machine_crash_shutdown() { #ifdef CONFIG_X86_IO_APIC + /* Prevent crash_kexec() from deadlocking on ioapic_lock. */ + ioapic_zap_locks(); disable_io_apic(); #endif } This problem has been already fixed in current kernel. (from kernel3.11) 7

Activities 1. Fix a deadlock problem on NMI dump (R) 2. Improve data reception latency on serial devices (R) 3. Save names of more processes in ftrace (S) 4. Solve the printk message fragmentation problem (S) 8

21 Serial RX interrupt frequency introduction Serial devices are mainly used not only for embedded systems but also for missioncritical systems. Maintenance Sensor feedback Serial communication has the specialty that it is resistant for noise. For a control system(one of missioncritical systems), longdistance communication is needed. If we have a sensor which sends small data packages each time and must control a device based on the sensor feedback, the RX interrupt should be triggered for each packages. Sensor 1small data pkg Server 2RX int. dev dev App 4feedback 3read 9

22 Serial RX interrupt frequency problem A test requirement of a system was that the serial communication time between send and receive has to be within 3msec. When we measured the time on 16550A, it took 10msec. It did not change even if the receiver application was operated as a real time application. We analyzed this by using event tracer of ftrace. Hard IRQ of the serial device interrupts once each 10msec, so this is caused by a HW specification or the device driver. timestamp[sec] ~10msec <idle>0 [001] 2689.160668: irq_handler_entry: irq=4 name=serial <idle>0 [001] 2689.170653: irq_handler_entry: irq=4 name=serial <idle>0 [001] 2689.180634: irq_handler_entry: irq=4 name=serial <idle>0 [001] 2689.190620: irq_handler_entry: irq=4 name=serial 10

23 Serial RX interrupt frequency reason HW spec of 16550A 16bytes FIFO buffer Flow Control Register(FCR) 2bit register Changeable RX interrupt trigger of 1, 4, 8, or 14 bytes for the FIFO buffer (0b00=1byte, 0b01=4bytes, 0b10=8bytes, 0b11=14bytes) In Linux, the trigger is hardcoded as 8bytes. [PORT_16550A] = {.name = "16550A",.fifo_size = 16,.tx_loadsz = 16,.fcr = UART_FCR_ENABLE_FIFO UART_FCR_R_TRIG_10,.flags = UART_CAP_FIFO, }, For 9600baud, an interrupt per 10msec is consistent. 8bytes trigger (start + octet + stop * 2 + parity) / 9600(baud) = 1/800 (sec/byte) = 1.25(msec/byte) 1bit 8bit 1bit * 2 1bit 1.25(msec/byte) * 8(byte) = 10msec 11

24 Serial RX interrupt frequency temporary fixing Changed FCR as a test [PORT_16550A] = {.name = "16550A",.fifo_size = 16,.tx_loadsz = 16,.fcr = UART_FCR_ENABLE_FIFO UART_FCR_R_TRIG_00,.flags = UART_CAP_FIFO, }, 1byte trigger Result The interrupt frequency is once each 1.25msec. timestamp[sec] <idle>0 [001] 3216.436959: irq_handler_entry: irq=4 name=serial <idle>0 [001] 3216.438209: irq_handler_entry: irq=4 name=serial <idle>0 [001] 3216.439454: irq_handler_entry: irq=4 name=serial <idle>0 [001] 3216.440706: irq_handler_entry: irq=4 name=serial 1.25msec We need a configurable RX interrupt trigger. 12

25 Serial RX interrupt frequency tunable patch Added new I/F to the serial driver Tunable RX interrupt trigger frequency High frequency(1byte trigger) low latency Low frequency(14byte trigger) low CPU overhead Usability problems of 1 st /2 nd patch: Using ioctl(2) (c.f. using echo command is better) Interrupt frequency can be changed only after opening serial fd Cannot read FCR value (FCR is a writeonly register) Change the ioctl(2) to sysfs I/F after discussion in a Linux community Set the interrupt trigger byte (if val is invalid, nearest lower val is set.) # echo 1 > rx_trig_byte /* 1byte trigger*/ User can read/write the trigger any time. The driver keeps FCR value if user changes interrupt trigger. This new feature will be able to be used from kernel3.17. 13

Activities 1. Fix a deadlock problem on NMI dump (R) 2. Improve data reception latency on serial devices (R) 3. Save names of more processes in ftrace (S) 4. Solve the printk message fragmentation problem (S) 14

31 PIDprocess name table in ftrace introduction ftrace is inkernel tracer for debugging and analyzing problems. ftrace records PIDs and process names in order to specify process context of an operation. If process name is indicated in a trace result, a user can understand who executed the program by doing grep with the process name. In the trace file, process names are sometimes output as <>: namepid <...>2625 [002]... 209630.888186: sys_write(fd: 8, buf: 7fd0ef836968, count: 8) <...>2625 [002]... 209630.888186: sys_enter: NR 1 (8, 7fd0ef836968, 8, 20, 0, a41) <...>2625 [002]... 209630.888186: kfree: call_site=ffffffff810e410c ptr= (null) <...>2625 [002]... 209630.888187: sys_write > 0x8 15

32 PIDprocess name table in ftrace reason ftrace has saved_cmdlines file storing the PIDprocess name mapping table. It stores the list of 128 processes that hit a tracepoint. # cat saved_cmdlines 13 ksoftirqd/1 10009 python 1718 gnomepanel 500 jbd2/sda58 If the number of processes that hit a tracepoint exceeds 128, the oldest process name is overwritten. How does ftrace manage this table? 16

33 PIDprocess name table in ftrace current Read trace file (get process name from PID) struct trace_entry{ int pid; } 1. Get a pid member in trace_entry structure of each trace event 17

34 PIDprocess name table in ftrace current Read trace file (get process name from PID) struct trace_entry{ int pid; } pid=1045 <PID> <map> 0 1 2 3 1045 1046 32767 32768 map_pid_to_cmdline[] 3 120 32 2. Get map# from map_pid_to_cmdline[]. The size of the array is PID_MAX_DEFAULT+1. 18

35 PIDprocess name table in ftrace current Read trace file (get process name from PID) struct trace_entry{ int pid; } pid=1045 <PID> <map> 0 1 2 3 1045 1046 32767 32768 map_pid_to_cmdline[] 3 3. Get process name from saved_cmdlines[][]. saved_cmdlines 120 can hold 128 process names. 32 saved_cmdlines[][] <map> <process name> 0 rcu_bh 1 awk 2 bash 3 sleep 126 cat 127 kworker/0:1 19

36 PIDprocess name table in ftrace current Read trace file (get process name from PID) struct trace_entry{ int pid; } pid=1046 <PID> <map> 0 1045 1046 32767 32768 map_pid_to_cmdline[] 1 2 3 Get process name of PID=1046, but 3 120 No map#, so it cannot find process name. The process name is shown as <...>. 32 20

37 PIDprocess name table in ftrace current Store map information in saved_cmdlines 0 1 2 3 4 5 126 127 map_cmdline_to_pid[] <idx> <PID> 123 582 456 780 838 1045 1321 4049 Record PID by rotation <PID> <map> 0 0 rcu_bh Managing the number of process names stored 1 in saved_cmdlines[][] 1 awk 2 bash 1045 1046 32767 32768 map_pid_to_cmdline[] 2 3 3 if ksoftirqd/0 (PID=1046) hits tracepoint, 120 32 saved_cmdlines[][] <map> <process name> 3 sleep 126 cat 127 kworker/0:1 21

38 PIDprocess name table in ftrace current Store map information in saved_cmdlines 0 1 2 3 4 5 126 127 map_cmdline_to_pid[] <idx> <PID> 123 582 1 456 Overwrite this element 2 from 1045 to 1046 780 3 838 1046 1321 4049 <PID> <map> 0 1045 1046 32767 32768 map_pid_to_cmdline[] 3 120 32 saved_cmdlines[][] <map> <process name> 0 rcu_bh 1 awk 2 bash 3 sleep 126 cat 127 kworker/0:1 22

39 PIDprocess name table in ftrace current Store map information in saved_cmdlines 0 1 2 3 4 5 126 127 map_cmdline_to_pid[] <idx> <PID> 123 582 456 780 838 1046 1321 4049 <PID> <map> 0 1 2 Delete old map# in order to avoid double 3 mapping 1045 1046 32767 32768 map_pid_to_cmdline[] 120 32 saved_cmdlines[][] <map> <process name> 0 rcu_bh 1 awk 2 bash 3 sleep 126 cat 127 kworker/0:1 23

310 PIDprocess name table in ftrace current Store map information in saved_cmdlines 0 1 2 3 4 5 126 127 map_cmdline_to_pid[] <idx> <PID> 123 582 456 780 838 1046 1321 4049 <PID> <map> 0 1 2 3 Overwrite the process name in map#3 1045 1046 32767 32768 map_pid_to_cmdline[] 3 120 32 saved_cmdlines[][] <map> <process name> 0 rcu_bh 1 awk 2 bash 3 ksoftirqd/0 126 cat 127 kworker/0:1 24

310 PIDprocess name table in ftrace current Store map information in saved_cmdlines 0 1 2 3 4 5 126 127 map_cmdline_to_pid[] <idx> <PID> 123 582 456 780 838 1046 1321 4049 <PID> <map> 0 1 2 3 1045 1046 Develop 32767 tunable 120 I/F 32 32768 map_pid_to_cmdline[] 3 saved_cmdlines[][] <map> <process name> 0 rcu_bh 1 awk 2 bash 3 ksoftirqd/0 126 cat 127 kworker/0:1 25

311 PIDprocess name table in ftrace tunable patch We added the changeable I/F 'saved_cmdlines_size' to expand the max number of saved process names. Read/write saved_cmdlines_size For write, all saved_cmdlines information are cleared. Max size: PID_MAX_DEFAULT(32768) If we set 32768, all process names can be stored. # cat saved_cmdlines_size 128 /* defalut value*/ // Switch to new saved_cmdlines buffers # echo 1024 > saved_cmdlines_size # cat saved_cmdlines_size 1024 /* Store 1024 process names */ This new feature can be used from kernel3.16. 26

Activities 1. Fix a deadlock problem on NMI dump (R) 2. Improve data reception latency on serial devices (R) 3. Save names of more processes in ftrace (S) 4. Solve the printk message fragmentation problem (S) 27

41 printk fragmentation problem introduction printk message outputs error logging or debugging information in kernel. We handle automatically printk messages in user space in order to detect that the system has became unstable. We want the kernel to output printk as expected. printk messages are sometimes mixed with similar messages. It is difficult to automatically handle an event from mixed messages. mixed kernel error messages in SCSI layer Which process does this [110781.736171] sd 2:0:0:0: [sdb] message belong to? [110781.736170] sd 3:0:0:0: [sdc] Unhandled sense code [110781.736172] sd 3:0:0:0: [sdc] [110781.736175] Result: hostbyte=did_ok driverbyte=driver_sense [110781.736177] sd 3:0:0:0: [sdc] [110781.736178] Sense Key : Medium Error [current] [110781.736187] Sense Key : Recovered Error [110781.736189] [current] 28

42 printk fragmentation problem introduction Mixed messages can occur when multiple printk() are executed at the same time. <CPU0> <CPU1> printk("sd 2:0:0:0: [sdb] n"); break into printk("sd 3:0:0:0: [sdc] n"); printk("sense Key : Medium Error n"); printk("sense Key : Recovered Error n"); [110781.736171] sd 2:0:0:0: [sdb] [110781.736177] sd 3:0:0:0: [sdc] [110781.736178] Sense Key : Medium Error [current] [110781.736187] Sense Key : Recovered Error 29

43 printk fragmentation problem Solution How to solve 1. Store all continuous messages in local buffer as temporary, and execute printk This idea is rejected by SCSI community. https://lkml.org/lkml/2014/5/20/742 To store continuous messages, we need big buffer. This can induce buffer overflow for deep nesting. Of course, memory allocation is invalid. 2. Add information necessary to merge all fragmented printk messages This idea is also rejected. https://lkml.org/lkml/2014/5/19/285 The community said this problem should be fixed for each subsystem. 3. Use traceevents of ftrace to output atomically only for SCSI layer This is not for all printk messages. 30

44 printk fragmentation problem Key idea traceevents can be atomically stored to ring buffer. Kernel preemption is disabled. A ring buffer a CPU We don't need to concern about mixed traceevent. Use trace_seq_printf() for traceevent Add event information using not only macros but functions scsitrace.c has already used this, but it does not have error messages. We added new three traceevents for SCSI error messages. scsi_show_result: output driverbyte, hostbyte scsi_print_sense: output sense key with asc and ascq scsi_print_command: output SCSI command 31

45 printk fragmentation problem Result A result of dmesg in current kernel [ 6379.535874] sd 2:0:0:0: [sda] [ 6379.538376] Result: hostbyte=did_ok driverbyte=driver_sense [ 6379.542083] sd 2:0:0:0: [sda] [ 6379.544556] Sense Key : Medium Error [current] [ 6379.549988] sd 2:0:0:0: [sda] [ 6379.552408] Add. Sense: Unrecovered read error [ 6379.574040] sd 3:0:0:0: [sdb] [ 6379.576576] Result: hostbyte=did_ok driverbyte=driver_sense [ 6379.580299] sd 3:0:0:0: [sdb] [ 6379.582727] Sense Key : Medium Error [current] A result of ftrace with our patch scsi_show_result: [sda] result=(driver=driver_sense host=did_ok) scsi_print_sense: [sda] Sense Key (Medium Error [current]) Add. Sense (Unrecovered read error) scsi_show_result: [sdb] result=(driver=driver_sense host=did_ok) scsi_print_sense: [sdb] Sense Key (Medium Error [current]) Add. Sense (Unrecovered read error) atomic atomic 32

46 printk fragmentation problem current status Current patch https://lkml.org/lkml/2014/8/8/221 Any comments are welcome! 33

5 Summary We are doing community activities for realizing Linux which satisfies RAS requirements for missioncritical systems. Bug fixing (avoid the deadlock on Kdump) Add features (tunable serial RX trigger, tunable saved_cmdlines, and SCSI traceevents) These activities can be used for not only missioncritical systems but also other systems. For example, fragmented printk is a big problem for a support division of system integrators. 34

Any questions? 35

Legal statements Linux is a registered trademark of Linus Torvalds. All other trademarks and copyrights are the property of their respective owners. 37