Energy-aware Fault-tolerant and Real-time Wireless Sensor Network for Control System Thesis Proposal Wenchen Wang Computer Science, University of Pittsburgh Committee: Dr. Daniel Mosse, Computer Science, University of Pittsburgh (Advisor) Dr. Rami Melhem, Computer Science, University of Pittsburgh Dr. Youtao Zhang, Computer Science, University of Pittsburgh Dr. Daniel Cole, Mechanical Eng and Materials Science, University of Pittsburgh
Outline Background and Motivation Wireless control systems Major challenges Thesis statement Preliminary work Fault-tolerant network design Network reconfiguration: time-correlated faults Proposed work Network reconfiguration: space-correlated faults Real-time network flow scheduling Timeline 2
Background and Motivation
Wireless Control Systems Ehealth, smart home, power grid etc Background and Motivation 4
Wired vs. Wireless Control System (WCS) Wired Control System Actuator control signal Plant Remote Controller Sensors measurements Not easy to do deployment and maintenance Wireless Control System (WCS) Actuator Plant Wireless Network Sensors control signal Remote Controller measurements Background and Motivation 5
Wired vs. Wireless Control System (WCS) Wired Control System Actuator control signal Plant Remote Controller Sensors measurements Not easy to do deployment and maintenance Wireless Control System (WCS) Actuator control signal Plant Delay and Message Loss Remote Controller Sensors measurements Network Imperfections Background and Motivation 6
Major Challenges of WCS Instability [Zhang CS 01, Jusuf ICCSII 12] When the physical system is unstable, the plant or the device can be damaged and leads to serious safety issues and financial loss. Performance Degradation [Li ICCPS 16] Network imperfections can induce additional error, network-induced error Wired control system output Wireless control system output Network-induced error Background and Motivation 7
Current Solutions Control system solution [De AC 08, Shi IJC 10] Network solution Fault-tolerance [Han RTAS 11] Real-time scheduling [Hong ECRTS 15] Network and control system co-design solution Limited works Simulator development [Li ICCPS 15] Redesign network protocol [Gatsis ICCPS 16] Limitations No study from network perspective to address control system stability issue No research addressing time/space-correlated link failures in WCS Lack of research on the impact of network real-time performance on control quality Background and Motivation 8
My proposal Actuator Plant Sensors Wireless Energy Consumption Network control signal measurements Controller Instability P1: how do we guarantee control system stability? Performance Degradation P2: how do we reduce network-induced error for a single control system? P3: how do we reduce the total network-induced error for multiple control systems? Background and Motivation 9
Thesis statement Is it possible to build a power-aware fault-tolerant real-time wireless sensor network for control system? P1: how do we guarantee control system stability? Fault-tolerant network design (completed) [Wang IRI 16] P2: how do we reduce network-induced error for a single control system? Network reconfiguration: time-correlated faults (completed) [Wang RTAS 17: WiP, Wang ECRTS 17 submitted] P2: how do we reduce network-induced error for a single control system? Network reconfiguration: space-correlated faults (future) P3: how do we reduce total network-induced error for multiple control systems? Real-time network flow scheduling (future) [Wang RTAS 17: WiP] Background and Motivation 10
11 Fault-tolerant Network Design (Completed) P1: how do we guarantee control system stability?
Background Based on a fault-tolerant wireless protocol: ridesharing [Gobriel SECON 06] TDMA scheduling A node has one primary parent and multiple backup parents Link failures Link success ratio (LSR) Link fails with probability, (1-LSR) Network reliability Delivery ratio (DR) Preliminary Work 1 12
Background: Our Control System Primary heat exchanger system (PHX) in a small modular reactor (SMR) of a nuclear power plant (NPP) Transfer power from inside the reactor to the outside Temperature and mass flow rate Preliminary Work 1 13
Problem Statement Control system stability requirement, network health (NH): NH = p 1 network 2 + p 2 network + p 3 (1 DR) network delay delivery ratio where p 1, p 2, p 3 are constants when NH 0, the control system is stable. Objective To satisfy the stability requirement: NH 0 Minimum energy consumption Solution Fault-tolerant node placement design Computation model design to select the best node placement with minimum number of relay nodes Preliminary Work 1 14
Fault-tolerant Node Placement Design K-connected region K-edge disjoint paths from sensors to virtual roots Consume fewer nodes, less flexible Relay region One line of primary nodes Several lines of backup nodes Nodes placed as close as possible More flexible, consume more nodes Node placement set creation Activate backup paths/ backup nodes Preliminary Work 1 15
Computation Model Network health estimation on the node placement set Delivery ratio Network delay Choose best node placement design for a given average LSR NH 0 Minimum number of relay nodes -> minimum energy consumption Preliminary Work 1 16
Computation Model: Delivery Ratio Estimation Expected number of messages received by remote controller (RC) m DR= i=1 (p RC i i), p RC i is the probability of received i messages by the remote controller, m is the total number of messages sent from sensors State: message-receiving situation for a level sorted array m 0, m 1,, m n, p i probability, depends on LSR level l 2 1 3 1,2,3, 0.2 Preliminary Work 1 17
Computation Model: Delivery Ratio Estimation Expected number of messages received by remote controller (RC) m DR= i=1 (p RC i i), p RC i is the probability of received i messages by the remote controller, m is the total number of messages sent from sensors State: message-receiving situation for a level sorted array m 0, m 1,, m n p RC i calculation Probabilities of final states at RC level are corresponding to p RC i (1 i m), p i probability, depends on LSR states of level (l+1) 0 1 State-generation Intermediate states of level l p 1 2 p 2 3 p 3 4 p 4 5 p 5 6 State-combination Final states of level l p 1 + p 3 7 p 2 + p 4 8 p 5 9 Preliminary Work 1 18
Computation Model: Network Delay and NH Worst-case network delay ( network ) estimation network = slot N Total #nodes TDMA scheduling time slot NH estimation NH = p 1 network 2 + p 2 network + p 3 (1 DR) Node placement selection with minimum number of nodes, given LSR Preliminary Work 1 19
Evaluation Computation Model Up to 3 lines of backup nodes Up to 4-edge disjoint paths Simulation Up to 7 lines of backup nodes TOSSIM simulator [Levis SenSys 03] Metrics DR Meaning Delivery ratio Network health Minimum number of nodes of computation model results Minimum number of nodes of simulation results Comparison Preliminary Work 1 20
Computation Model Results The inflection points happen when there is a complete line of backup nodes. The slope decreases when adding more lines of backup nodes. With the NH computation results, we can estimate best node placement design for given LSR Preliminary Work 1 21
Simulation Results Adding more nodes does not always help When there are 52 nodes, DR reaches maximum NH decreases Preliminary Work 1 22
Minimum Number of Nodes Comparison: Computation vs. Simulation RSSI LSR LSR stdv MinCMR MinSR Diff -64 0.93 0.020 26 26 0% -70 0.88 0.024 29 30-3.4% -76 0.82 0.031 33 32 3.0% -82 0.77 0.035 37 39-5.4% -84 0.71 0.037 46 42 8.7% Computation model is accurate with average 4.1% difference. Preliminary Work 1 23
24 Network Reconfiguration: Time-correlated Faults (Completed) P2: how do we reduce network-induced error for a single control system?
Background Sensors sense and send measurements periodically to the controller with sensing sampling period Controller calculates control signal with control sampling period Actuator Plant Sensors control signal Wireless Network measurements sensing sampling period 0.05s Controller Control sampling period 0.1s Preliminary Work 2 25
Problem Statement Trade-off between delivery ratio and delay Higher delivery ratio -> more redundant nodes -> more delay Optimal network configuration Time-correlated link failures [Baccour TOSN 12] Network reconfiguration Objective: network-induced error reduction for a single control system Solution Network reconfiguration framework Preliminary Work 2 26
Network Reconfiguration Framework Input: network configuration set The network node placement set Offline Optimal network configuration table indexed by LSR values. Online LSR estimation at run time Centralized network reconfiguration Preliminary Work 2 27
Offline Computation Network imperfection model Define total induced delay to the control system estimation as consecutive message losses sensing sampling period n loss ~ dr = network + n loss ssp csp csp Control sampling period Preliminary Work 2 28
Offline Computation Network imperfection model Define total induced delay to the control system estimation as consecutive message losses sensing sampling period = network + n loss ssp csp csp n loss ~ dr Estimate for each network placement Control sampling period Optimal network placement Given LSR, placement with minimum estimation Optimal network placement table indexed by LSR values Preliminary Work 2 29
Online Reconfiguration Remote Controller Network LSR estimation Estimated LSR Estimated LSR LSR Estimate node placement 0.8 Placement 1 0.5 Placement 8 0.2 Placement 20 Optimal estimated placement Online reconfiguration algorithm New node placement Preliminary Work 2 30
Online Reconfiguration Remote Controller Network LSR estimation Estimated LSR Estimated LSR LSR Estimate node placement 0.8 Placement 1 0.5 Placement 8 0.2 Placement 20 Optimal estimated placement Online reconfiguration algorithm New node placement Preliminary Work 2 31
Online Reconfiguration LSR Estimation During LSR interval (LSRI), each node will record its own average LSR over all its receiving links Every LSRI, each node sends out its own LSR. Parent node will average all its children s LSRs and its own LSR. RC estimates average LSR over all the links. LSRI = 20s T = [0s, 19s] 3 LSR 3 T = 20s 3 LSR 3 = (LSR 1 + LSR 2 + LSR 3 )/3 1 2 LSR 1 LSR 2 1 2 LSR 1 LSR 2 Preliminary Work 2 32
Online Reconfiguration Centralized Reconfiguration algorithms 1. Direct Jump to Optimal (DO) 2. Multiplicative Increase Conservative Decrease (MICD) 3. Adaptive Control (AC) # nodes # nodes # nodes estimate 30 30 30 current 20 20 20 t 1 t 2 t 3 t 4 time t 1 t 2 t 3 t 4 time t 1 t 2 t 3 t 4 time DO MICD AC Preliminary Work 2 33
Online Reconfiguration Centralized Reconfiguration algorithms 1. Direct Jump to Optimal (DO) 2. Multiplicative Increase Conservative Decrease (MICD) 3. Adaptive Control (AC) # nodes # nodes # nodes current 30 30 30 estimate 20 20 20 t 1 t 2 t 3 t 4 time t 1 t 2 t 3 t 4 time t 1 t 2 t 3 t 4 time DO MICD AC Considering consecutive losses (CL) Add k more nodes, whenever there are m consecutive losses CL-DO, CL-MICD and CL-AC Preliminary Work 2 34
Evaluation Case study: one PHX Simulator: WCPS [Li ICCPS 15] Offline simulation Static RSSI Online simulation Dynamic RSSI: dynamic LSR over time LSRI Metrics RMS error (RMSE): network-induced error (comparing with wired control system) Network lifetime (days) Preliminary Work 2 35
Offline Table Number of optimal nodes increases, as the LSR decreases Preliminary Work 2 36
Network Imperfection Model vs. Offline Simulation Results Network imperfection model is accurate Network induced delay is statistically correlated with the power output RMSE (Pearson correlation r = 0.993, p < 0.001) Preliminary Work 2 37
Network lifetime (days) Power output RMSE (MW) Online Results: sensitivity analysis of LSRI 0.3 0.25 0.2 0.15 0.1 190 170 2 4 8 12 16 20 LSRI static30 DO AC MICD CL-DO CL-AC CL-MICD static30 DO AC 150 130 2 4 8 12 16 20 LSRI Best static scheme performs worse than the dynamic schemes MICD CL-DO CL-AC CL-MICD Preliminary Work 2 38
Network lifetime (days) Power output RMSE (MW) Online Results: sensitivity analysis of LSRI 0.3 0.25 0.2 0.15 0.1 190 170 2 4 8 12 16 20 LSRI static30 DO AC MICD CL-DO CL-AC CL-MICD static30 DO AC 150 130 2 4 8 12 16 20 LSRI MICD CL-DO CL-AC CL-MICD Best static scheme performs worse than the dynamic schemes LSRI value affects the performance of schemes without considering CL Preliminary Work 2 39
Network lifetime (days) Power output RMSE (MW) Online Results: sensitivity analysis of LSRI 0.3 0.25 0.2 0.15 0.1 190 170 2 4 8 12 16 20 LSRI static30 DO AC MICD CL-DO CL-AC CL-MICD static30 DO AC 150 130 2 4 8 12 16 20 LSRI MICD CL-DO CL-AC CL-MICD Best static scheme performs worse than the dynamic schemes LSRI value affects the performance of schemes without considering CL Schemes considering CL are not affected by the LSRI values Preliminary Work 2 40
41 Network Reconfiguration: Space-correlated Faults (Future) P2: how do we reduce network-induced error for a single control system?
Motivation and Problem Statement Spatial link failures caused by Interference Sources (IS) affect the network reliability [Low CIMCA 05, Fadel CC 15] -> control system performance Mobile phone, WiFi, radio jammer A Mobile IS has not been fully researched in WSN. Objective: network-induced error reduction for a single control system Proposed Work 1 42
Methodology Build a space-correlated fault model with one moving IS With a certain speed Determines which links fail with what probability Study strategies to tolerate space-correlated link failures Distributed network reconfiguration algorithm Conduct a case study in NPP with a single PHX Compare network reconfiguration strategies with baseline of the second prelim work Proposed Work 1 43
44 Real-time Network Flow Scheduling (Future) P3: how do we reduce total network-induced error for multiple control systems?
Motivation: Observations Test the network-induced error on one PHX Different reference functions with one ramp power change amount (PCA) power change duration (PCD) Different delivery ratio and delay Ramp30 PCA: 10 MW PCD: 30s Proposed Work 2 45
Power output RMSE Power output RMSE Motivation: Observations 1 0.8 0.6 0.4 0.2 0 RMSEs are similar 15 30 45 60 75 90 105 120 PCD (s) delay=0.1s delay=0.2s delay=0.3s delay=0.4s delay=0.5s For reference functions with shorter PCDs, the network delay becomes a more significant factor. PCA: 10 MW; DR: 0.9 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 8 6 4 2 PCA (MW) PCD: 30s; DR: 0.9 RMSEs are similar delay=0.1s delay=0.2s delay=0.3s delay=0.4s delay=0.5s For reference functions with higher PCAs, the network delay becomes more significant factor. Proposed Work 2 46
Motivation: NPP demands Multiple Small Modular Reactors (SMRs) in an NPP Different PHX may have different power demands Dynamic application demands -> different reference functions over time Cross-layer real-time scheduling Inject the application demands into the network layer to change measurement deadlines dynamically. Assign smaller deadlines for more urgent application demands (smaller PCDs or larger PCAs) Proposed Work 2 47
Required Power Problem Statement Network flow A set of m end-to-end network flows F = F 1, F 2,, F m F i associates with one source s i, a destination d i, a period p i, and a deadline, D i Control systems application demands Control systems have different reference functions with multiple ramps ref1 ref2 t 0 t 1 t 2 t 3 t 4 time Objective: reduce total network-induced errors for multiple control systems: error = n i=1 RMSE i t 5 Proposed Work 2 48
Methodology Define the deadline for each network flow, according to the offline control system analysis Related to PCA and PCD Study a cross-layer real-time scheduling algorithm to schedule network flows dynamically. Conduct a case study in an NPP with three PHXs and evaluate the results on WCPS Proposed Work 2 49
50 Summary and Timeline
Summary Challenges Problems Solutions Instability Performance Degradation stability guarantee Network-induced error reduction for a single control system Network-induced error reduction for multiple control systems Fault-tolerance Network Design Network reconfiguration: time-correlated faults Network reconfiguration: space-correlated faults Real-time network flow scheduling Summary and Timeline 51
Timeline Date Content Deliverable results May. - Aug. 2017 Sep. Dec. 2017 Jan. Feb. 2018 March. 2018 April. 2018 Dynamic network flow scheduling algorithm design and implementation on WCPS Measure the performance of WCS with dynamic network flow scheduling Finish the implementation of bitvector protocol [Wang ICESS 15] and spacecorrelated fault model on WCPS Come up with a network reconfiguration algorithm and implement it on WCPS Measure the performance of WCS with network reconfiguration mechanism Network deadline formulation and a WCS with the function of dynamic network flow scheduling A paper for publication A WCS with a fault-tolerance protocol to deal with space correlated link failures A WCS with the function of network reconfiguration for space-correlated link failures A paper for publication May. Jun. 2018 Thesis writing Thesis ready for defense Jul. Aug. 2018 Thesis revising Completed thesis Summary and Timeline 52
Energy-aware Fault Tolerance and Real-time Wireless Sensor Network for Control System Challenges Problems Solutions Instability Performance Degradation stability guarantee Network-induced error reduction for a single control system Network-induced error reduction for multiple control systems Fault-tolerance Network Design Network reconfiguration: timecorrelated faults Network reconfiguration: space-correlated faults Real-time network flow scheduling
54 Backup Slides
Contributions and Impact A computation model to satisfy control system stability with minimum energy consumption A network reconfiguration framework to address time-correlated and space-correlated link failures in wireless control system Exploration of cross-layer network flow scheduling to enhance overall performance of multiple control systems Summary and timeline 55
Comparison: Computation vs. Simulation 56
Offline Computation Network imperfection model Transform network delay and message losses to total network induced delay T used T sensed, define total network induced delay as consecutive message losses sensing sampling period Remote controller = network + n loss ssp 0 = 0.2 M 0 csp 1 = 0.2 2 = 0.3 3 = 0.4 4 = 0.2 5 = 0.2 M 1 M 1 M 4 M 5 M 1 csp Control sampling period Each network configuration corresponds to different estimation for different LSR values Sensors 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 M 0 M 1 M 2 M 3 M 4 M 5 57
Offline Computation Network imperfection model Transform network delay and message losses to total network induced delay T used T sensed, define total network induced delay as consecutive message losses sensing sampling period = network + n loss ssp csp csp Control sampling period Each network configuration corresponds to different estimation for different LSR values n n loss = i 1 dr i ( 1 dr i t) i=0 58
Network lifetime (days) Power output RMSE (MW) Online Results: sensitivity analysis of LSRI 0.3 0.25 0.2 0.15 0.1 200 190 180 170 160 150 140 130 2 4 8 12 16 20 LSRI 2 4 8 12 16 20 LSRI static30 DO AC MICD CL-DO CL-AC CL-MICD static30 DO AC MICD CL-DO CL-AC CL-MICD Static scheme is worse than the dynamic schemes LSRI will affect the performance of schemes without considering CL Schemes considering CL are not affected by the LSRI values 59
Power output RMSE (MW) Sensitivity analysis of α values 0.18 0.17 0.16 0.15 0.14 0.13 0.12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 α values AC CL-AC 60
Online Results: AC vs CL-AC (LSRI=2s) CL-* schemes add more nodes in the network, when there are consecutive losses 61
Interference Source Examples Interference Source An operator walks around with a mobile phone [Baccour TOSN 12] A mobile robot connected with WiFi [Lin RTSS 09] A mobile radio jammer [Wei FGCS 16] Interference example: office building [Lin RTSS 09], >20% difference PDR: packet reception ratio Proposed work 1 62
Interference Source Examples Microwave interference on IEEE 802.15.4 [Guo 12 TIM] PER: packet error ratio, 1-PDR Proposed work 1 63
Distributed Network Reconfiguration Algorithm Primary node in each level will decide how many nodes to be activated or deactivated Compare with centralized algorithms in prelim work2 More Accuracy Reconfiguration according to average LSR estimation is not enough; Local information will improve space-correlated faults detection and tolerance. Low overhead: save network bandwidth No need to send LSR estimation to the remote controller periodically No need to broadcast new configuration to all the nodes in the network Proposed work 1 64
Power output RMSE Motivation: observations 1 0.8 0.6 0.4 0.2 0 0.9 0.8 0.7 0.6 0.5 Delivery ratio PCD: 30s; PCA: 10MW delay=0.1s delay=0.2s delay=0.3s delay=0.4s delay=0.5s Delay has more significant effect on the control system performance Set a deadline according to the application demands Small deadline for reference functions with less PCD or more aggressive PCA Cross-layer dynamic schedule the network flows 65