Enhancement Activities on the Current Upstream Kernel for Mission-Critical Systems

Size: px

Start display at page:

Download "Enhancement Activities on the Current Upstream Kernel for Mission-Critical Systems"

Amberly Hutchinson
6 years ago
Views:

1 Enhancement Activities on the Current Upstream Kernel for MissionCritical Systems Hitachi Ltd. Yoshihiro YUNOMAE

2 01 MissionCritical Systems We apply Linux to missioncritical systems. Banking systems/carrier backend systems/train management systems and more People(consumers/providers) expect stable operation for longterm use. Don't frequently change the system configuration Changing the system introduces the risk for illegal operation. "RAS" requirements are needed. 1

3 02 RAS Reliability To identify problems before release e.g. Bug fixing, Testing Availability To continue the operation even if a problem occurs e.g. HA cluster system Serviceability To find out the root cause of the problem certainly in order to be able to solve it permanently e.g. Logging, Tracing, Memory dump Do the systems satisfy these requirements in current upstream kernel? Will talk about 'R' and 'S' 2

4 Activities 1. Fix a deadlock problem on NMI dump (R) 2. Improve data reception latency on serial devices (R) 3. Save names of more processes in ftrace (S) 4. Solve the printk message fragmentation problem (S) 3

5 Activities 1. Fix a deadlock problem on NMI dump (R) 2. Improve data reception latency on serial devices (R) 3. Save names of more processes in ftrace (S) 4. Solve the printk message fragmentation problem (S) 4

Execute crash_kexec() in panic() and save the memory 3.

6 11 Memory dump deadlock introduction We get memory dump via Kdump when serious problems, which induce panic or oops, occur. Kdump Kernel crash dumping feature based on Kexec 1. In 1 st kernel, kernel panic occurs. 2. Execute crash_kexec() in panic() and save the memory 3. Boot 2 nd kernel (capture kernel) and copy /proc/vmcore When kernel panic occurs via NMI, Kdump operation sometimes stops before booting 2 nd kernel. 1 st kernel panic 2 nd kernel capture /proc/vmcore crash_kexec() boot disk NW media 5

7 12 Memory dump deadlock reason The cause of the stop is deadlock on ioapic_lock in NMI context. panic()>crash_kexec()>>disable_ioapic() > >ioapic_read_entry() ioapic_read_entry() { raw_spin_lock_irqsave(&ioapic_lock, flags); eu.entry = ioapic_read_entry(apic, pin); raw_spin_unlock_irqstore(&ioapic_lock, flags); } The scenario is 1. Get ioapic_lock for rebalancing IRQ (irq_set_affinity) 2. Inject NMI while locking ioapic_lock 3. Panic caused by NMI occurs 4. Try to execute Kdump 5. Deadlock in ioapic_read_entry() 6

8 13 Memory dump deadlock fixing Fixed this problem by initializing ioapic_lock before disable_io_apic(): native_machine_crash_shutdown() { #ifdef CONFIG_X86_IO_APIC + /* Prevent crash_kexec() from deadlocking on ioapic_lock. */ + ioapic_zap_locks(); disable_io_apic(); #endif } This problem has been already fixed in current kernel. (from kernel3.11) 7

9 Activities 1. Fix a deadlock problem on NMI dump (R) 2. Improve data reception latency on serial devices (R) 3. Save names of more processes in ftrace (S) 4. Solve the printk message fragmentation problem (S) 8

10 21 Serial RX interrupt frequency introduction Serial devices are mainly used not only for embedded systems but also for missioncritical systems. Maintenance Sensor feedback Serial communication has the specialty that it is resistant for noise. For a control system(one of missioncritical systems), longdistance communication is needed. If we have a sensor which sends small data packages each time and must control a device based on the sensor feedback, the RX interrupt should be triggered for each packages. Sensor 1small data pkg Server 2RX int. dev dev App 4feedback 3read 9

11 22 Serial RX interrupt frequency problem A test requirement of a system was that the serial communication time between send and receive has to be within 3msec. When we measured the time on 16550A, it took 10msec. It did not change even if the receiver application was operated as a real time application. We analyzed this by using event tracer of ftrace. Hard IRQ of the serial device interrupts once each 10msec, so this is caused by a HW specification or the device driver. timestamp[sec] ~10msec <idle>0 [001] : irq_handler_entry: irq=4 name=serial <idle>0 [001] : irq_handler_entry: irq=4 name=serial <idle>0 [001] : irq_handler_entry: irq=4 name=serial <idle>0 [001] : irq_handler_entry: irq=4 name=serial 10

12 23 Serial RX interrupt frequency reason HW spec of 16550A 16bytes FIFO buffer Flow Control Register(FCR) 2bit register Changeable RX interrupt trigger of 1, 4, 8, or 14 bytes for the FIFO buffer (0b00=1byte, 0b01=4bytes, 0b10=8bytes, 0b11=14bytes) In Linux, the trigger is hardcoded as 8bytes. [PORT_16550A] = {.name = "16550A",.fifo_size = 16,.tx_loadsz = 16,.fcr = UART_FCR_ENABLE_FIFO UART_FCR_R_TRIG_10,.flags = UART_CAP_FIFO, }, For 9600baud, an interrupt per 10msec is consistent. 8bytes trigger (start + octet + stop * 2 + parity) / 9600(baud) = 1/800 (sec/byte) = 1.25(msec/byte) 1bit 8bit 1bit * 2 1bit 1.25(msec/byte) * 8(byte) = 10msec 11

13 24 Serial RX interrupt frequency temporary fixing Changed FCR as a test [PORT_16550A] = {.name = "16550A",.fifo_size = 16,.tx_loadsz = 16,.fcr = UART_FCR_ENABLE_FIFO UART_FCR_R_TRIG_00,.flags = UART_CAP_FIFO, }, 1byte trigger Result The interrupt frequency is once each 1.25msec. timestamp[sec] <idle>0 [001] : irq_handler_entry: irq=4 name=serial <idle>0 [001] : irq_handler_entry: irq=4 name=serial <idle>0 [001] : irq_handler_entry: irq=4 name=serial <idle>0 [001] : irq_handler_entry: irq=4 name=serial 1.25msec We need a configurable RX interrupt trigger. 12

14 25 Serial RX interrupt frequency tunable patch Added new I/F to the serial driver Tunable RX interrupt trigger frequency High frequency(1byte trigger) low latency Low frequency(14byte trigger) low CPU overhead Usability problems of 1 st /2 nd patch: Using ioctl(2) (c.f. using echo command is better) Interrupt frequency can be changed only after opening serial fd Cannot read FCR value (FCR is a writeonly register) Change the ioctl(2) to sysfs I/F after discussion in a Linux community Set the interrupt trigger byte (if val is invalid, nearest lower val is set.) # echo 1 > rx_int_trig /* 1byte trigger*/ User can read/write the trigger any time. The driver keeps FCR value if user changes interrupt trigger. 13

15 26 Serial RX interrupt frequency current status This new feature is under discussion. Sysfs I/F may be changed. But this feature (tunable RX trigger) will be merged soon. Current patch [PATCH V6] serial/uart/8250: Add tunable RX interrupt trigger I/F of FIFO buffers 14

16 Activities 1. Fix a deadlock problem on NMI dump (R) 2. Improve data reception latency on serial devices (R) 3. Save names of more processes in ftrace (S) 4. Solve the printk message fragmentation problem (S) 15

17 31 PIDprocess name table in ftrace introduction ftrace is inkernel tracer for debugging and analyzing problems. ftrace records PIDs and process names in order to specify process context of an operation. If process name is indicated in a trace result, a user can understand who executed the program by doing grep with the process name. In the trace file, process names are sometimes output as <>: namepid <...>2625 [002] : sys_write(fd: 8, buf: 7fd0ef836968, count: 8) <...>2625 [002] : sys_enter: NR 1 (8, 7fd0ef836968, 8, 20, 0, a41) <...>2625 [002] : kfree: call_site=ffffffff810e410c ptr= (null) <...>2625 [002] : sys_write > 0x8 16

18 32 PIDprocess name table in ftrace reason ftrace has saved_cmdlines file storing the PIDprocess name mapping table. It stores the list of 128 processes that hit a tracepoint. # cat saved_cmdlines 13 ksoftirqd/ python 1718 gnomepanel 500 jbd2/sda58 If the number of processes that hit a tracepoint exceeds 128, the oldest process name is overwritten. How does ftrace manage this table? 17

19 33 PIDprocess name table in ftrace current Read trace file (get process name from PID) struct trace_entry{ int pid; } 1. Get a pid member in trace_entry structure of each trace event 18

20 34 PIDprocess name table in ftrace current Read trace file (get process name from PID) struct trace_entry{ int pid; } pid=1045 <PID> <map> map_pid_to_cmdline[] Get map# from map_pid_to_cmdline[]. The size of the array is PID_MAX_DEFAULT+1. 19

21 35 PIDprocess name table in ftrace current Read trace file (get process name from PID) struct trace_entry{ int pid; } pid=1045 <PID> <map> map_pid_to_cmdline[] 3 3. Get process name from saved_cmdlines[][]. saved_cmdlines 120 can hold 128 process names. 32 saved_cmdlines[][] <map> <process name> 0 rcu_bh 1 awk 2 bash 3 sleep 126 cat 127 kworker/0:1 20

22 36 PIDprocess name table in ftrace current Read trace file (get process name from PID) struct trace_entry{ int pid; } pid=1046 <PID> <map> map_pid_to_cmdline[] Get process name of PID=1046, but No map#, so it cannot find process name. The process name is shown as <...>

23 37 PIDprocess name table in ftrace current Store map information in saved_cmdlines map_cmdline_to_pid[] <idx> <PID> Record PID by rotation <PID> <map> 0 0 rcu_bh Managing the number of process names stored 1 in saved_cmdlines[][] 1 awk 2 bash map_pid_to_cmdline[] if ksoftirqd/0 (PID=1046) hits tracepoint, saved_cmdlines[][] <map> <process name> 3 sleep 126 cat 127 kworker/0:1 22

24 38 PIDprocess name table in ftrace current Store map information in saved_cmdlines map_cmdline_to_pid[] <idx> <PID> Overwrite this element 2 from 1045 to <PID> <map> map_pid_to_cmdline[] saved_cmdlines[][] <map> <process name> 0 rcu_bh 1 awk 2 bash 3 sleep 126 cat 127 kworker/0:1 23

25 39 PIDprocess name table in ftrace current Store map information in saved_cmdlines map_cmdline_to_pid[] <idx> <PID> <PID> <map> Delete old map# in order to avoid double 3 mapping map_pid_to_cmdline[] saved_cmdlines[][] <map> <process name> 0 rcu_bh 1 awk 2 bash 3 sleep 126 cat 127 kworker/0:1 24

26 310 PIDprocess name table in ftrace current Store map information in saved_cmdlines map_cmdline_to_pid[] <idx> <PID> <PID> <map> Overwrite the process name in map# map_pid_to_cmdline[] saved_cmdlines[][] <map> <process name> 0 rcu_bh 1 awk 2 bash 3 ksoftirqd/0 126 cat 127 kworker/0:1 25

27 310 PIDprocess name table in ftrace current Store map information in saved_cmdlines map_cmdline_to_pid[] <idx> <PID> <PID> <map> Develop tunable 120 I/F map_pid_to_cmdline[] 3 saved_cmdlines[][] <map> <process name> 0 rcu_bh 1 awk 2 bash 3 ksoftirqd/0 126 cat 127 kworker/0:1 26

28 311 PIDprocess name table in ftrace tunable patch We added the changeable I/F 'nr_saved_cmdlines' to expand the max number of saved process names. Read/write nr_saved_cmdlines For write, all saved_cmdlines information are cleared. Max size: PID_MAX_DEFAULT(32768) If we set 32768, all process names can be stored. # cat nr_saved_cmdlines 128 /* defalut value*/ # echo 1024 > nr_saved_cmdlines /* Switch to new saved_cmdlines buffers*/ # cat nr_saved_cmdlines 1024 /* Store 1024 process names */ 27

29 312 PIDprocess name table in ftrace current status Current patch Any comments are welcome! [PATCH V2 0/2] ftrace: Introduce the new I/F "nr_saved_cmdlines" 28

30 Activities 1. Fix a deadlock problem on NMI dump (R) 2. Improve data reception latency on serial devices (R) 3. Save names of more processes in ftrace (S) 4. Solve the printk message fragmentation problem (S) 29

31 41 printk fragmentation problem introduction printk message outputs error logging or debugging information in kernel. We handle automatically printk messages in user space in order to detect that the system has became unstable. We want the kernel to output printk as expected. printk messages are sometimes mixed with similar messages. It is difficult to automatically handle an event from mixed messages. mixed kernel messages Which process does this message belong to? [ ] sd 2:0:0:0: [sdb] [ ] sd 3:0:0:0: [sdc] Unhandled sense code [ ] sd 3:0:0:0: [sdc] [ ] Result: hostbyte=did_ok driverbyte=driver_sense [ ] sd 3:0:0:0: [sdc] [ ] Sense Key : Medium Error [current] [ ] Sense Key : Recovered Error [ ] [current] 30

32 42 printk fragmentation problem introduction Mixed messages can occur when multiple printk() are executed at the same time. <CPU0> <CPU1> printk("sd 2:0:0:0: [sdb] n"); break into printk("sd 3:0:0:0: [sdc] n"); printk("sense Key : Medium Error n"); printk("sense Key : Recovered Error n"); [ ] sd 2:0:0:0: [sdb] [ ] sd 3:0:0:0: [sdc] [ ] Sense Key : Medium Error [current] [ ] Sense Key : Recovered Error We named this problem "the owner problem". 31

33 43 printk fragmentation problem introduction Continuous printk messages with KERN_CONT can be also fragmented. [ ] sd 2:0:0:0: [sdb] Divide [ ] sd 3:0:0:0: [sdc] CDB: [ ] Read(10): [ ] Add. Sense: Failure prediction threshold exceeded [ ] 00 [ ] e We named this problem "the reconstruction problem". To solve this problem, we must solve the owner problem first. Owner information is needed. reconstruction [ ] sd 2:0:0:0: [sdb] [ ] sd 3:0:0:0: [sdc] CDB: [ ] Read(10): e [ ] Add. Sense: Failure prediction threshold exceeded 32

34 44 printk fragmentation problem Solution How to solve 1. Store all continuous messages in local buffer as temporary, and execute printk There is some related work about error messages in SCSI layer. Acceptance activities take a lot of time 1018(pr_cont) + 555(KERN_CONT) (just printk("") without " n")@3.15rc1+ problem not solved fundamentally Always have to take care about this problem 2. Add information necessary to merge all fragmented printk messages What kind of information can we merge those by? 33

35 45 printk fragmentation problem Key idea What kind of information can we merge those by? Add context information (especially, PID and IRQ context) to each message Solve the owner problem A tool can separate each process's message from mixed messages PID: Different CPU's process or not / Kernel preemption IRQ: In interrupt (hard/soft/nmi) context or not Add fragment flag Solves the reconstruction problem A tool can rebuild fragmented messages Don't add the flag for messages with ' n' This message itself is not fragmented. 34

36 45 printk fragmentation problem Key idea What kind of information can we merge those by? Add context information (especially, PID and IRQ context) to each message Solve the owner problem A tool can separate each process's message from mixed messages PID: Different CPU's process or not / Kernel preemption IRQ: In interrupt (hard/soft/nmi) context or not Add fragment flag But, don't Solves want the reconstruction to change dmesg problem and syslog messages. A How tool can do rebuild we fragmented tackle this messages problem? Don't add the flag for messages with ' n' This message itself is not fragmented. 35

37 46 printk fragmentation problem No msg change Key idea: Use the header information of /dev/kmsg /dev/kmsg is user I/F for accessing the printk buffer. /dev/kmsg has header and text fields separated by a ';'. 6,808, ,;Switched to clocksource tsc header text Header field is separated by a ','. <prefix>,<sequence#>,<timestamp>,<fragment flag>;<text> We can add any commaseparated flags before a ';' according to the documentation. Future extensions might add more comma separated values before the terminating ';'. Documentation/ABI/testing/devkmsg We add context information and flag to the header of /dev/kmsg. 36

38 47 printk fragmentation problem approach(1) Our apporach (1) Add process context information (PID and interrupt context) We can understand relationship of mixed messages. This does not indicate where the message is fragmented. Test: Run 2 kernel threads which output same 3 printk messages. pr_info("a1"); printk("a2"); printk("a3 n"); current result 6,3321, ,;A1 6,3322, ,c;A1 4,3323, ,+;A2 4,3324, ,+;A3 4,3325, ,c;A2 4,3326, ,+;A3 normally <header>;a1a2a3 <header>;a1a2a3... Who output this message? Who does the message belong to? 37

39 48 printk fragmentation problem approach(1) Our apporach (1) Add process context information (PID and interrupt context) We can understand relationship of mixed messages. This does not indicate where the message is fragmented. Test: Run 2 kernel threads which output same 3 printk messages. pr_info("a1"); printk("a2"); printk("a3 n"); normally <header>;a1a2a3 <header>;a1a2a3... applying our approach(1) 6,2308, ,,622/;A1 6,2309, ,c,621/;A1 4,2310, ,+,621/;A2 4,2311, ,+,621/;A3 4,2312, ,c,622/;A2 4,2313, ,+,622/;A3 Add context information <PID>/<interrupt context> s: softirq h: hardirq n: NMI : no interrupt context 38

40 49 printk fragmentation problem approach(2) Our apporach (2) Current "fragment flag" Three types c: first fragment of a line +: all following fragments : no fragment Just hint whether fragmentation occurred or not (i.e. sometimes incorrect) Users can roughly understand where messages were divided. Note, that these hints about continuation lines are not necessarily correct, and the stream could be interleaved with unrelated messages, but merging the lines in the output usually produces better human readable results. Documentation/ABI/testing/devkmsg 39

41 410 printk fragmentation problem approach(2) Our apporach (2) Test: Run three kernel threads A, B, and C which output following 3 messages. pr_info("{a, B, C}1"); printk("{a, B, C}2"); printk("{a, B, C}3 n"); normally Result by applying only approach(1) 6,225186, ,,2288/;B1 6,225187, ,c,2289/;C1 4,225188, ,+,2289/;C2 4,225189, ,+,2289/;C3 6,225190, ,c,2289/;C1 4,225191, ,+,2289/;C2C3 4,225192, ,,2288/;B2 6,225193, ,c,2287/;A1 4,225194, ,+,2287/;A2 4,225195, ,+,2287/;A3 4,225196, ,,2288/;B3 <header>;a1a2a3 <header>;b1b2b3 <header>;c1c2c3 No fragment? c: first fragment of a line +: all following fragments : no fragment No fragment? No fragment? 40

42 411 printk fragmentation problem approach(2) Our apporach (2) Change the meaning of "fragment flag" If a message is fragmented, add 'f' flag. If a message has ' n', don't add any flag. (i.e. '') Do not use 'c'/'+' flag. Current ABI is changed, but users can automatically connect fragmented messages by using new fragment flag. Result by applying our approach (1) and (2) 6,8219, ,f,538/;B1 6,8220, ,f,539/;C1 4,8221, ,f,539/;C2 4,8222, ,,539/;C3 4,8223, ,f,538/;B2 6,8224, ,f,537/;A1 4,8225, ,,538/;B3 4,8226, ,,537/;A2A3 fragmented still fragmented The end of the fragmented messages 41

43 412 printk fragmentation problem current status Current patch Any comments are welcome! [RFC PATCH 0/2] printk: Add context information to kernel messages from /dev/kmsg 42

44 5 Summary We are doing community activities for realizing Linux which satisfies RAS requirements for missioncritical systems. Bug fixing (avoid the deadlock on Kdump) Add features (tunable serial RX trigger, tunable saved_cmdlines, and fragmented printk support) These activities can be used for not only missioncritical systems but also other systems. For example, fragmented printk is a big problem for a support division of system integrators. 43

45 Any questions? 44

47 Legal statements Linux is a registered trademark of Linus Torvalds. All other trademarks and copyrights are the property of their respective owners. 46

RAS Enhancement Activities for Mission-Critical Linux Systems

RAS Enhancement Activities for MissionCritical Linux Systems Hitachi Ltd. Yoshihiro YUNOMAE 01 MissionCritical Systems We apply Linux to missioncritical systems. Banking systems/carrier backend systems/train