A HOLISTIC APPROACH TO MULTIHOP ROUTING IN SENSOR NETWORKS ALEC LIK CHUEN WOO

Size: px

Start display at page:

Download "A HOLISTIC APPROACH TO MULTIHOP ROUTING IN SENSOR NETWORKS ALEC LIK CHUEN WOO"

Aubrey Peters
5 years ago
Views:

1 A HOLISTIC APPROACH TO MULTIHOP ROUTING IN SENSOR NETWORKS by ALEC LIK CHUEN WOO B.S. in University of California, Berkeley 1998 M.S. in University of California, Berkeley 2001 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor David Culler, Chair Professor Eric Brewer Professor Steve Glaser Fall 2004

2 The dissertation of ALEC LIK CHUEN WOO is approved: Chair Date Date Date University of California, Berkeley Fall 2004

4 1 Abstract A HOLISTIC APPROACH TO MULTIHOP ROUTING IN SENSOR NETWORKS by ALEC LIK CHUEN WOO Doctor of Philosophy in Computer Science University of California, Berkeley Professor David Culler, Chair The dynamic and lossy nature of wireless communication poses major challenges to reliable, self-organizing multihop networks. Non-ideal link characteristics are especially problematic with the primitive, low-power radio transceivers found in sensor networks and raise new issues that routing protocols must address. We redefine the basic notion of wireless connectivity in terms of probabilistic links, and demonstrate that link statistics can be captured dynamically through an efficient yet adaptive link estimator. This probabilistic notion of connectivity changes the usual concept of a neighbor and introduces new problems with neighborhood management: the neighbor table on a sensor node is of fixed size and cannot always be used to gather link statistics about all neighbors, yet the process of selecting the most competitive neighbors requires a comparison with the link statistics of those neighbors that are not in the table. Together, link estimation and neighborhood management build a probabilistic connectivity graph which can be exploited by a routing algorithm to increase

5 2 reliability. Together, these three processes constitute our holistic approach to routing. We study and evaluate link estimation, neighborhood table management, and reliable routing protocol techniques, focusing on the many-to-one, periodic data collection workload commonly found in sensor network applications today. Our final system uses a variant of an exponentially weighted moving average estimator, frequency based table management, and minimum transmission cost-based routing. Our analysis ranges from large-scale, high-level simulations to in-depth empirical experiments and emphasizes the intricate interactions between the routing topology and the underlying connectivity graph, which underscores the need for a whole-system approach to the problem of routing in wireless sensor networks. Professor David Culler Dissertation Committee Chair

6 TO MY PARENTS: MARY and SIU SHAN i

7 ii Contents List of Figures List of Tables v ix 1 Introduction 1 2 Background Sensor Networking Platform and Implications Hardware Platform (Mica Motes) Software Platform (TinyOS) TinyOS Network Architecture Implications Design Space of Routing in Sensor Networks Network-Wide Dissemination Tree-based Routing Any-to-Any Routing Implications Detailed Roadmap Related Work: A High-Level Picture Packet Radio Mobile Ad Hoc Networks (MANET) Sensor Networks Understanding Link Characteristics Connectivity, Range, and Link Dynamics Physical Connectivity and Communication Range Time Variations Obstructions and Mobility Irregular Connectivity Cell Implications: Connectivity and Hop-Count Modeling the Observed Link Characteristics Binomial Approximation of Stationary Packet Loss Dynamics

8 iii 3.4 Synthetic Trace Generation Effective Channel Capacity: Single and Multihop Received Signal Strength and Link Quality Related Work Connectivity: A Probabilistic Perspective Characterizing Connectivity using Link Estimators Link Estimation as Part of Network Self-Organization Estimator Design Framework and Methodology Metrics of Evaluation Error, Stability, and Memory Relationship Confidence Interval Approximation Estimator Design and Evaluation Terminology Tuning Objectives Candidate Estimator Design and Evaluation Candidate Estimator Comparisons Stable Estimators Agile Estimators Performance based on Empirical Traces Confidence Interval Estimation with WMEWMA Alternative Estimation Techniques Related Work Summary and Multihop Routing Implications Neighborhood Management under Limited Memory Dense and Fuzzy Neighborhoods Challenges of Neighborhood Discovery under Limited Memory An On-line Neighborhood Selection Process Adaptive Down-sampling Insertion Policy Cache-Based Eviction and Reinforcement Frequency-Based Eviction and Reinforcement Evaluation Methodology Results Effect of Adaptive Down-Sampling Eviction and Reinforcement Policy Other Goodness Metrics Related Work Multihop Routing Implications Cost-Based Routing Distributed Tree Building Process Overview of the System Routing Architecture Underlying System Issues

9 iv Rate of Parent Change Packet Snooping Counting-To-Infinity Problem Cycles Duplicate Packet Elimination Queue Management Relationship to Link Estimation Cost Metrics for Connectivity-Based Routing Related Work Table-Driven Routing Source-Initiated On-Demand Routing Summary Summary Evaluation Evaluation Methodology Candidate Routing Protocols Evaluation Metrics Network Graph Analysis Effect of Neighborhood Management using Routing Cost Packet Level Simulations A Packet-Level Simulator Simulation Results on Routing Empirical Experiments Experiments over an Indoor 5x10 Grid Network (Mica) Results over a 30-node Irregular Indoor Mica Network Results over an Irregular Indoor Mica2 Network Network Instability under Congested Traffic Techniques to Mitigate Network Instability Out-bound Estimation Decay Window Spreading Route Update Messages Estimator Tuning and Confidence Interval Technique Evaluation Link Estimation of the Root Node and Stability Adaptivity and Stability Summary Concluding Remarks 207 Bibliography 213

10 v List of Figures 1.1 Sensornet hardware platform evolution over time Map of Great Duck Island and the locations of all the motes deployed in the year of A Mica mote Design layout of a SPEC mote Connectivity of a cell measured using 150 motes on an open tennis court with RFM power setting of A holistic view showing the cross-layer interactions of routing Reception probability of all links in a network, with a line topology on a tennis court. Note that each link pair appears twice to indicate link quality in both directions After 20 minutes, the sender is moved from 15 ft to 8 ft from the receiver and remained stationary for four hours Link quality variation over a 7 hour period in an indoor laboratory environment Obstruction effects on packet loss behavior. A person deliberately stands beside the receiver in the interval minutes Movement effects on packet loss behavior. Transmitter is deliberately moved to different distances at various times Cell connectivity of a node in a grid with 8-foot spacing as generated by our link quality model Quantile of empirical data against quantile of binomial distribution Time series comparison of empirical traces with simulated traces Channel capacity of the Mica/RFM platform using TinyOS 1.0 radio stack Channel capacity of the Mica2/Chipcon platform using different versions of the TinyOS radio stack Relationship of RSSI signal strength and link quality on the Mica2/Chipcon platform Example showing strong RSSI values may not be a good indicator for link quality

11 vi 4.1 General framework of passive link estimators ˆP (t) for different estimators at both stable and agile configuration ˆP (t) for different estimators at both stable and agile configuration Output from the stable WMEWMA estimator using empirical data input Confidence interval estimation with respect to the WMEWMA(30,0.5) estimator for different link quality Illustration of the potential neighbors of a center node in a dense network. The darker shaded region shows the effective region while the lighter region shows the transitional region. The cross indicates the center node Downsampling process Insertion and reinforcement in Frequency algorithm Cumulative distributive function showing the link quality distribution of the 207 neighbors of a center node in a 80x80 grid network with 4 feet spacing using our empirical link model Contour plot on yield of the FREQUENCY algorithm for different cell densities and table size with no down sampling for insertion Contour plot on yield of the FREQUENCY algorithm for different cell densities and table sizes with down sampling rate of 50% for insertion Number of good neighbors maintainable at different densities with a table size of 40 entries Yield for different table sizes and cell densities Distributed tree building algorithm framework Distributed tree building algorithm framework with link estimation incorporated Message flow chart to illustrate the core components for implementing our routing subsystem Typical data structure of the neighbor table. ROUTE TABLE SIZE determines the size of the neighbor table Hop distribution from graph analysis of a 400 node network with 8 feet grid size Path reliability to tree root from graph analysis of a 400 node network with 8 feet grid size Insertion and reinforcement in Frequency algorithm using routing cost difference Percentage of time spent in the neighbor table of the different neighbors vs. their difference in routing cost relative to the receiving node running the FREQUENCY algorithm. The cross indicates that node is chosen as the parent Percentage of time spent in the neighbor table of the different neighbors and their difference in routing cost relative to the receiving node running the FREQUENCY algorithm with routing cost filtering. The cross indicates that node is chosen as the parent

12 7.6 Screen shot of the packet-level simulator Hop distribution from simulations Cumulative distributive function of the distances of all the links in the network using MT over graph analysis and packet level simulation Path reliability over distance from simulations Stability from simulations End-to-End success rate over distance from simulations Deployment on the foyer in the Hearst Mining building Indoor reception probability of all links of a network in a line topology at low transmit power setting (70) in the foyer Hop distribution for the indoor 50-node deployment Average Hop over Distance Contour Plot for MT at power 70 for the indoor 50-node deployment Non-sink node next hop link quality for MT in the foyer End-to-end success rate over distance in the foyer Actual and expected routing cost as computed using the MT cost function Stability of the entire network in the foyer End-to-end success rate versus hop in an office environment Stability for MT in an office environment End-to-end success rate of MT on Mica2 deployed in an office environment Link estimation of a node to its neighbor over time in an office environment node network stability under congested load. (Original) Network-wide link estimation changes on the logical connectivity graph over time. (Original) node network end-to-end success rate under congested load Route instability of a node: distribution of time spent on different parents (a) and the parent distribution of all the route switches of the node (b) Variations of link quality estimations of the different parents selected by a node over an experiment with congested traffic Variations of link quality estimations of the different parents selected by a node over an experiment, with congested traffic and the overflow error fixed node network stability under congested load with overflow error fixed Empirical cumulative distributive functions of the parent switching cost difference of a 21-node network under congested load, with and without the overflow error Network-wide link estimation changes on the logical connectivity graph over time node network end-to-end success rate under congested load node network stability under congested load with stabilizing techniques Empirical cumulative distributive function of the parent switching cost difference of a 21-node network under congested traffic, with stabilizing techniques including confidence interval filtering, larger parent switching threshold, phase-shifted route update messages, and OutBoundDecayW indow tolerating up to 6 consecutive losses vii

13 7.36 Network-wide link estimation changes on the logical connectivity graph over time node network end-to-end success rate under congested load Link quality of the tree root as estimated by a near-by node using the minimum data rate relaxation under congested load Link quality of the tree root as estimated by a near-by node under congested traffic load, with the relaxation in link estimation removed node network stability under congested load, with relaxation in link estimation of the tree root removed Empirical cumulative distributive function of the parent switching cost difference of a 21-node network under congested load, with relaxation in link estimation of the tree root removed Network-wide link estimation changes on the logical connectivity graph over time node network end-to-end success rate under congested load node network stability under congested load, with the parent switching threshold relaxed to its original setting (0.75 transmission) Network-wide link estimation changes on the logical connectivity graph over time node network end-to-end success rate under congested load Network-wide link estimation changes on the logical connectivity graph over time node network end-to-end success rate under congested load, with a periodic interfering traffic node network stability under congested load, with a periodic interfering traffic node network stability under congested load, with one of the node disabled in the middle of the experiment viii

14 ix List of Tables 2.1 TinyOS Media Access Control Parameters on Mica and Mica Summary of the differences among the different wireless networks Definition of p(t) to model Figure Terminology used for describing link estimator design Simulation results of all estimators in stability settings Simulation results of all estimators in agility settings

15 x Acknowledgments David Culler is the best advisor I could ever ask for in my life. He has transformed me from an undergraduate student, a novice in computer science, to a researcher working closely at the forefront of the field. The completion of this thesis would not have been possible without his guidance over these years. David always pushed me hard to pursue a wider and deeper understanding to problems and often challenged my designs and assumptions. I have also learned how to articulate my ideas, argue and compromise with others. He spent endless hours with me, improving my communication skills in both speaking and writing. He also recommended that I take my first acting class in my life! I regard him as my father figure for life in general. Terence Tong has demonstrated the best qualities one can expect from a Berkeley undergraduate student. He is intelligent, responsible, and very dedicated to conducting research. He has contributed many days and nights in simulating, building, and running the systems with me. Without his help, the completion of this work would have taken longer. I would also like to thank the TinyOS/NEST team: Jason Hill, Philip Levis, Sam Madden, Joe Polastre, Cory Sharp, Robert Szewczyk, and Kamin Whitehouse. We have been working hard together on many demos, papers, TinyOS releases, and tutorials. Most importantly, the whole process was full of fun and we have established life-long friendships. I had an opportunity to help start the Intel Berkeley Research Laboratory, which allowed me to collaborate with many great researchers: Philip Buonadonna, Brent Chun,

16 xi Kevin Fall, David Gay, Wei Hong, Alan Mainwaring, and Matt Welsh. I am lucky to have had many friends to support me over these years to help me get through the tough times. I would like to thank Horton Hua, Freddy Mang, Allen Miu, Ada Poon, Wilson So, Hayden So, and Victor Wen. This work was supported, in part, by the Defense Department Advanced Research Projects Agency (grants F C-1895 and N ), the National Science Foundation under Grant No , California MICRO program, and Intel Corporation. Research infrastructure was provided by the National Science Foundation (grant EIA ).

17 1 Chapter 1 Introduction The information technology revolution over the last forty years has been driven by the miniaturization of technology following the prediction of Moore s Law. Not only does computing become more powerful as transistor density keeps increasing exponentially, but devices with the same computing power are also shrinking in size. With new fabrication techniques that create micro electro-mechanical structures (MEMS), low-power microscopic sensors can be manufactured at a very low cost. By joining CMOS technology and advancement in MEMS, it is possible to embed intelligence with sensing capability all on a tiny platform. Together, these developments help to bring the vision of potentially dust-size computing platforms into reality. Low-cost CMOS-based RF radios have become adequately low-power to support low data rate communication on these tiny nodes. The result is a new platform, called sensor networks, that is capable of performing wireless communications, some local processing, data storage, and sensing, all within the physical size of a typical coin. Future platforms will have the potential to fit within a cubic millimeter of volume.

18 2 Besides having a small physical size, this new computing platform of a network of sensors is very different from traditional computing. Nodes are not expected to support a user or even have any user interfaces. They are stand-alone devices, with limited resources in memory, computation power, and energy. With wireless capability, they are not expected to be plugged into a wire infrastructure, where power and data bandwidth can be abundant. In fact, with their small physical sizes, they can easily be embedded into the physical environment to collect interesting information. Although each of these devices is a tiny computation platform of its own, it can support powerful services in an aggregated form by interacting and collaborating with each other. In particular, these platforms can collaborate and perform local processing to infer interesting phenomena over noisy information from the environment. By self-organizing into a network, they can propagate interesting data to nodes that demand it, and move data to an infrastructure for higher level processing. All in all, this new platform provides a new tier of computing that will make information technology more pervasive and bring it closer to the physical environment. Recent effort in research and development has rapidly advanced the field of sensor networks. Figure 1.1 shows the evolution of the hardware platforms, called motes, developed at UC Berkeley. While many mote generations are built from off-the-shelf components, a newer generation of motes, such as the SPEC mote, demonstrates the possibility of creating an integrated sensor node on a single chip. Although such a small platform is still in its infancy, all the other sensor-node platforms are already in production, and supported by many kinds of sensing hardware, such as light, temperature, humidity, acceleration, etc. On the software front, there also exists effort to create an operating system cus-

3 WeC 1/00 Rene 11/00 Mica 1/02 Mica2 9/02 Mica2dot 9/02 SPEC 5/03 Figure 1.1: Sensornet hardware platform evolution over time.

Such a system is called TinyOS [37], which provides a programming and runtime environment with flexible hardware abstractions, network stacks,

Both the hardware platforms and the TinyOS operating system are readily available to researchers for developing new applications and systems to

The potential applications of this new computing technology are rich and span many different disciplines, including scientific research,

For scientific research, sensor network technology can be a wide-area monitoring tool that allows scientists to collect potentially long-term

19 3 WeC 1/00 Rene 11/00 Mica 1/02 Mica2 9/02 Mica2dot 9/02 SPEC 5/03 Figure 1.1: Sensornet hardware platform evolution over time. tomized for this new computing platform. Such a system is called TinyOS [37], which provides a programming and runtime environment with flexible hardware abstractions, network stacks, and light-weight concurrency support. Both the hardware platforms and the TinyOS operating system are readily available to researchers for developing new applications and systems to advance this new computing paradigm. The potential applications of this new computing technology are rich and span many different disciplines, including scientific research, military usage, consumer markets, and applications in the interest of society. For scientific research, sensor network technology can be a wide-area monitoring tool that allows scientists to collect potentially long-term data for understanding both microscopic and macroscopic phenomenon in the physical environment. The sensor nodes are expected to be low-cost enough such that many of them can be used for monitoring data in high resolution over a targeted area. Various wildlife habitat monitoring projects, such as The Great Duck Island project at UC Berkeley [53, 74], James Reserve project at UCLA [18], and ZebraNet project at Princeton [46], have started to collaborate with biologists and open a new way to passively monitor wildlife with sensor networks. Figure 1.2 shows

4 Single hop weather Single hop burrow Multi hop weather Multi hop burrow 10m Figure 1.2: Map of Great Duck Island and the locations of all the motes deployed in the year of 2003.

20 4 Single hop weather Single hop burrow Multi hop weather Multi hop burrow 10m Figure 1.2: Map of Great Duck Island and the locations of all the motes deployed in the year of a picture of Great Duck Island and the relative positions of the 100 motes deployed on the island. The project used this thesis work to successfully build a multihop network to collect habitat information over the island. For military purposes, sensor networks can be deployed over an open field to passively collect information about intruding soldiers or vehicles. Since these devices are small, they are difficult to discover. Furthermore, the networking capability allows them to control and collect data over large areas using multihop communications. Interesting potential military applications include detecting and tracking enemy vehicle movements or even automated pursuit. Recent work in this kind of application can be found in [11, 15, 71].

21 5 In commercial applications, sensor networks can provide intelligent indoor lighting and temperature control in buildings to conserve energy. Profiling electrical energy usage on the outlets at home provides a novel approach to understanding energy consumption distribution such that consumers can obtain feedback for more economical energy usage. Precision agriculture can rely on sensor networks to optimize watering schedules and increase yield per unit area. Asset management is yet another potential application: sensor networks can monitor and track important assets during transit or while in storage. These different examples show a wide variety of potential applications that can take advantage of the sensor network technology. The point is to demonstrate that research in this new computing paradigm can impact our lives through many different potential applications. As compared with like mobile wireless networks, the different application scenarios and resource limitations of sensor networks require a different kind of networking support. First, a sensor network system is likely to be deployed in an uncontrolled environment, where nodes would fail or would be obstructed from each other due to environmental effects and changes over time. Second, lack of an infra-structural support requires a different network topology formation compared to the common single-hop wireless local area networks. Third, constraint in energy on these nodes can only support short-range communications. Therefore, a multihop networking topology is required for sensor networks, where nodes locally communicate with nearby neighbors using short-range communication; nodes would relay messages for communication that goes beyond immediate neighbors. For example, a multihop networking topology would be required for nodes to propagate messages

22 6 to a remote gateway for higher-level processing or archival purposes. Maintaining such a topology can be challenging. For scalability reasons, distributed, local rules should be used rather than a centralized approach. Constraints in memory limit the amount of state a routing protocol can maintain on each node. Running in an uncontrolled environment requires the system to be able to adapt robustly to failures and environmental changes without the need of a network administrator. Thus, having the system self-organize into a reliable network for multihop communication and self-adapt to potential changes is one of the most fundamental system building blocks for sensor networks. Such an ad hoc, self-organizing routing problem is not an entirely novel research topic as there exists a rich literature in packet radio networks and mobile computing. Nevertheless, the problem needs to be revisited in the context of sensor networks for various reasons. First, the lossy, short-range wireless radios can break assumptions made about connectivity at the routing layer, which can hinder both the robustness and reliability of the routing protocol. Second, tight resource constraints together with lossy characteristics introduce new challenges that routing protocol cannot neglect. Third, the traffic assumption in sensor networks is very different from that in traditional wireless computing. Finally, there is still no comprehensive systematic study that is specifically tailored towards routing issues and performances using real sensor networking nodes and traffic pattern. The major contribution of this thesis is to provide a thorough study of achieving a robust and reliable multihop wireless networking system using the Berkeley sensor networking platform. In particular, the routing process must use only simple, distributed local rules and must address many of the issues unique to this computing platform, includ-

23 7 ing limitation of memory, bandwidth, and processing power. Since the low-power CMOS based radios used in most of the sensor networking platforms carry very different connectivity characteristics from what the networking community usually assumes, the challenge is to identify these differences and understand their implications for protocol design. These implications will lead us to a new understanding of the wireless routing problem for sensor networks, identify important subproblems and their interactions, introduce important metrics to study, and impact the overall approach to studying the routing process as a whole-system design problem. We ground our study on extensive empirical measurements and experimentations. The usual concept of the communication range is defined by the distance where a sharp falloff of connectivity occurs. Before this fall-off, communication is considered to be reliable. In reality, we identify that the RF communication range on the sensor nodes actually consists of three distinct regions: effective, transitional, and clear. In particular, the transitional region, is a region where link quality can vary significantly; it also constitutes a large portion of the communication range. We therefore advocate a probabilistic view on link connectivity and use such a perspective throughout the whole routing process. We argue that the process of routing should be separated into three subproblems: link quality characterization, neighborhood management, and cost-based routing. Each of these subproblems is a local process that a node must perform to achieve reliable routing. We carefully study each of these local processes and understand their interactions. Together they provide an effective routing solution as a whole. The solution is implemented in TinyOS and evaluated using actual sensor nodes in different scales. The system is released to the community. The

24 8 experience gained in this thesis can provide valuable guidance for future development of more advanced routing systems for sensor networks. This thesis is organized into 8 chapters. Chapter 2 provides the background overview of sensor networks, the platform that we based our study on, and points out the corresponding implications for protocol design. Chapter 3 presents an empirical study of the wireless characteristics found on our sensor-net platform, which motivates the need to treat the problem of routing as three local processes. We then discuss each of the three subproblems in the next three chapters: the link estimation process in Chapter 4, neighborhood management in Chapter 5, and routing in Chapter 6. In Chapter 7, we combine the three subproblems and evaluate the system as a whole using large-scale, high-level simulations and in-depth empirical experiments. We conclude and discuss future work in Chapter 8.

25 9 Chapter 2 Background We begin by first describing the resource constraints found on a typical sensor networking system available today. These constraints guide our design decisions for the rest of the chapters. Although these constraints may be relaxed as technology advances, designing under these constraints will allow our work to cope with more extreme platforms in the future such as the SPEC mote [38]. We also provide a brief overview of the TinyOS operating system and the details of its network architecture that we used for our protocol development and empirical study. This chapter also introduces the design space of multihop routing in sensor networks by analyzing the different requirements arising from a set of important sensor networking applications today. By surveying the routing protocols in the literature, we discuss how they fit into the design space, and motivate why it is necessary to revisit the problem of routing and identify important subproblems unique to sensor networks. Finally, we present a more detailed roadmap describing a holistic approach to routing, from the lowest level of

26 2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 10 defining connectivity to the network-level of reliable multihop communication. 2.1 Sensor Networking Platform and Implications The sensor networking open platform developed by the NEST project at UC Berkeley [4] provides both hardware (motes) and software systems for researchers to conduct sensor networking research. We used the Mica and Mica2 hardware platforms in our study; these motes can be purchased from Crossbow [22]. They are supported by the TinyOS [37] open-source operating system, which also provides a complete suite of programming and development tools. In this section, we describe in detail our hardware and software platform, the network architecture in TinyOS, and the platform implications for sensor networking protocol design Hardware Platform (Mica Motes) We used two generations of Berkeley Mica motes [36], Mica and Mica2. Except for the different RF radios, they are similar in terms of their physical sizes and resource limitations. On the Mica platform, each node consists of an 8-bit, 4MHz Atmel Atmega103 microprocessor with 128kB of programmable memory and 4kB of data memory [6]. It follows the Harvard architecture, with separated program memory and data memory. The program memory can store read-only data. The network device is a RF Monolithics 916MHz transceiver [5], using amplitude shift keying (ASK) modulation at the physical layer. The processor is capable of driving the radio to deliver 40 kbps of raw data. The RF transmit

2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 11 Atmel Processor 4MHz Clock 916MHz Antenna 51-pin Extension Bus On/Off Switch Leds Figure 2.1: A Mica mote.

27 2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 11 Atmel Processor 4MHz Clock 916MHz Antenna 51-pin Extension Bus On/Off Switch Leds Figure 2.1: A Mica mote. power of the radio is tunable in TinyOS, with 0 being the maximum transmit power at 1.5dBm and 100 yielding no communication range at all. Each node also has a standard UART interface, allowing it to be configured as a base station for relaying data to a PC computer. Batteries are typically used to power the entire sensor node, yielding a lifetime of about a week if the node is always on for processing and communication (assuming a 10mA of current consumption over a battery with 1800mAh capacity.) Figure 2.1 shows the form factor of a Mica node, with the major parts on the top side of the node labeled. The second generation of Mica platform (Mica2) uses an Atmega128L microprocessor [12], with a faster processor clock running at 7.38MHz, but the amount of programmable

28 2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 12 and data memory remains the same. The network device is a Chipcon CC1000 FSK based RF transceiver [2], driven by the processor to deliver 38.4 kbps of raw data. Unlike the Mica, both the RF baseband frequency and transmit power are tunable. Our study uses the 433MHz version of Mica Software Platform (TinyOS) TinyOS s design philosophy is to support the natural sensor networking needs for high concurrency and efficient modularity over a very limited platform while allowing designers the flexibility to innovate new protocols or experiment with new extreme hardware platforms. TinyOS uses a component-based programming model, with every component providing and using a set of well defined interfaces. Programs (applications) and the entire operating systems are built by wiring together customized or standard components as a component graph. Since each system functionality is implemented as a component, programmers can change the system behavior simply by replacing or modifying the components. This makes innovating new protocols easy, since no predefined mechanisms or protocols are hidden or required by TinyOS. Programming in TinyOS uses a dialect of the C programming language called nesc [30], which comes with the TinyOS release. The language directly supports the componentbased programming model of TinyOS. By holistically analyzing both the application and the system as one component graph, cross-layer optimization for efficiency and code sizes can be done more easily. Furthermore, static analysis beyond traditional compiler features are also supported. For example, at compile time, potential race conditions can be detected

29 2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 13 by the nesc compiler to enhance overall system robustness TinyOS Network Architecture TinyOS provides an Active Message (AM) abstraction at the link layer. The packet format is simple, with a 5-byte header, a message payload, and a 2-byte CRC checksum. The header contains the destination address field (2 bytes), an AM handler field (1 byte), a group ID (1 byte), and packet length (1 byte). Although the default maximum packet size is small, only 36 bytes, almost all the applications are satisfied by this maximum packet length; they either generate very little sensory data per packet or perform their own data aggregations or fragmentations at a level above. Note that the destination address is used for link addressing and the promiscuous mode for packet sniffing is also possible. The AM handler acts as a dispatch mechanism by specifying the correct higher-level handler to invoke for each packet reception. It is analogous to a network port. This form of dispatch is naturally supported by the nesc language using parameterized interfaces. Link-layer acknowledgments are supported to acknowledge both link-layer broadcast (on Mica) and unicast messages. However, we only use unicast message acknowledgments in our study. Message buffer allocation is done statically above the network layer; no copying is done across the entire network stack for either transmission or reception, except for exchanging data down at the hardware register with the radio. It is important to note that once the network layer has accepted the message buffer for reception or transmission, the application must not modify the buffer to avoid buffer corruption until the message buffer s control is returned back through SendDone event or Receive event in TinyOS.

30 2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 14 Below the AM layer, the TinyOS radio stack provides different levels of support in software, depending on where the hardware/software boundary lies, which is governed by the choice of the radio. In traditional computing, many of this low-level support, such as the MAC layer, resides at the chip s firmware and often cannot be accessed or changed. However, the flexibility of the hardware/software boundary in TinyOS allows protocol designers to probe down and change low-level protocols or even registers inside the radio. This is exemplified by the radio stacks of the two Mica platforms. At the high level, the two stacks support the common packet-level interface of Active Messages. However, below the packet-level abstractions, the two radio stacks are quite different in both architecture and implementation. Although the default TinyOS CSMA-based MAC protocol and link-layer acknowledgment semantics is similar on the two platforms, the underlying mechanisms and the choice of parameters are different. We discuss these differences in the next section; more detailed discussion can be found in [48]. Mica Radio Stack On the Mica, the RFM radio only provides a bit-level interface. Therefore, the processor must encode and decode each byte using a DC-balanced scheme. The default byte encoding scheme is Single-bit-Error-Correction-and-Double-bit-Error-Detection (SECDED), which has a 1-to-3 encoding overhead ratio. For each packet successfully received with a correct CRC checksum, a link-level acknowledgment is sent by the receiver. A simple CSMA-based MAC is employed [77]; it adds a random delay before listening for an idle channel and backs off with a random delay over a predefined window when the channel

31 2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 15 is busy. Since there is no direct support for carrier sensing and received signal strength indicator (RSSI) reading by the RFM, which only provides the raw baseband signal, carrier sensing is done by monitoring incoming data bits. Mica2 Radio Stack On the Mica2, the CC1000 radio provides a byte-level interface and performs its own bit-level encoding using the standard Manchester encoding scheme. An empirical forward-error-correction study on this radio suggests that any additional byte-level encoding does not provide significant benefits to justify the extra processing cost [43]. Therefore, the default is to use the hardware-based Manchester encoding with no forward error correction. Another major difference is the way carrier sensing for detecting an idle channel is done on the CC1000. The Mica2 radio stack improves carrier sensing by performing automatic gain control and on-line estimation of the noise floor in software and compares it against the sampled baseband energy at the time of idle channel detection. The Mica radio stack lacks this capability because the RFM radio does not support an accurate received signal strength indicator (RSSI) on the baseband, even though RSSI values can be obtained in the radio stacks on both platforms. Note that RSSI values reported on both Mica and Mica2 are obtained through the processor s 10-bit ADC channel; calibration and conversion is required if dbm units are desired. Although the MAC protocol is similar between the two stacks, the different mechanisms in carrier sensing and data movement make the two MAC layers very different. Furthermore, the parameters used in the two MAC layers are also different as they are empirically tuned in each case; the parameters are summarized in

32 2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 16 Mote Mica Mica2/(TinyOS 1.13) (Granularity) (raw bits) (raw bytes) Max. Initial MAC Backoff 3.2 ms 4.1 ms Max. Congest Backoff 3.2 ms 6.6 ms Ack. Overhead ms 6.6 ms Table 2.1: TinyOS Media Access Control Parameters on Mica and Mica2. Table Implications These observations of the sensor networking platforms today show that a sensor node is limited in resources across several dimensions: compute power (1-4 MIPS), data memory (1s-10s of kb), data bandwidth (10s of kbps), and available energy (battery size). This implies that protocol design for this space must be simple and keep as little state as possible due to limited memory. Furthermore, it must minimize communication overhead since it is costly in both energy and bandwidth. In fact, the amount of bandwidth available for multihop communication is 3 times lower than the channel capacity in a single cell because a packet needs to occupy a communication cell 3 times during a multihop relay. Another interesting observation is an imbalance in the ratio between compute power and available memory on a sensor node. According to Amdahl-Case Rule, a balanced system should have 1 MByte of memory for each MIPS (millions of instructions per second) of processor power and each Mbps (mega bits per second) of bandwidth. However, in our sensor network platforms, the ratio of memory is at least 3 orders of magnitude less than processing power. One reason is that memory takes up a significant amount of chip real

2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 17 I/O Pins Memory Banks Frequency Synthesizer ADC Crystal Driver CPU and Accelerators 900 MHZ transmitter Figure 2.2: Design layout of a SPEC mote.

33 2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 17 I/O Pins Memory Banks Frequency Synthesizer ADC Crystal Driver CPU and Accelerators 900 MHZ transmitter Figure 2.2: Design layout of a SPEC mote. estate, which affects the size of a sensor node, especially for a mote-on-a-chip platform such as SPEC [38]. Figure 2.2 shows a picture of the layout of the SPEC chip. Visually, the memory consumes about 20%-30% of the chip area, but the size of the memory (3Kb) is three orders of magnitude smaller than expected, given the processor speed (4MHz). It is expected that as technology improves over time, miniaturization will allow more space for memory. However, we believe that the same imbalance may still exist since every siliconbased functionality will scale down at the same rate. The use of a low-cost and low-power CMOS based RF transceiver as a primary communication device has significant implications for protocol design. Figure 2.3 shows an empirical connectivity graph of a typical communication cell of a node in a 150-node network using the RFM radio. The node at the center indicated by a cross transmits;

34 2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 18 other nodes count the number of packets successfully received. The contour map illustrates packet success rate or link quality fall off in term of percent number of packets received from the sender node under no other traffic interference. The graph shows connectivity is very noisy over a large portion of the communication cell, with only a small number of nodes able to receive the senders packets well. That is, connectivity is not a clear cut concept between connected and not connected; it is probabilistic and can span from 0% to 100%. This empirical observation is important because it breaks the typical assumption of the circular-disc connectivity model, while communication is good up to some radius r and non-existent beyond; we call this boolean connectivity. As a result, protocols that rely on such an assumption can suffer significantly in real deployment. We use a typical multihop routing process in sensor networks as an example. Based on observing packets from other nodes and performing a set of local rules, which may generate additional packets, the network must form and maintain a multihop routing topology to support some higher level communication pattern, such as data collection or aggregation into a specific node. For example, the data sink node could announce its desire to receive data and its network depth from itself, namely zero. Nodes that receive this packet can determine their network depth (one) and start generating packets, which carry this new network depth and the node s own address along with the data. A periphery of depth-two nodes learns about nodes of depth one and can start sourcing data for the sink. However, these packets will have to be routed through one of the nearer (or lesser-depth) nodes. Each node hears packets from several neighbors and chooses one with a smaller depth as its parent to route its traffic over the next hop. Progressively, more

35 2.1. SENSOR NETWORKING PLATFORM AND IMPLICATIONS 19 distant nodes learn of parents and a spanning tree is formed and continually maintained as data flows toward the sink. It will route around obstacles and find alternative paths when nodes fail or join. The problem with this elegant, simple algorithm, and with many of its variants, is that low-power radio communication is lossy as shown in Figure 2.3 and highly variable due to external sources of interference in the spectrum, contention with other nodes, multipath effects, obstructions, and other changes in the environment, as well as node mobility. Therefore, the assumption of good connectivity upon hearing a message made by this simple algorithm would break down, and simply hearing a packet is not a good enough basis for determining that two nodes are connected. This approach can yield poor reliability in shortest path routing, such as the simple scheme discussed above, since long links with low reliability are likely to be selected for communication as they tend to yield shortest paths. Since end-to-end reliability is a product of link loss rate at each hop, selecting unreliable links would have an exponential effect on the end-to-end packet success rate. In general, it is likely to be much better to take more hops using more reliable (typically shorter) links. Therefore, it is essential to build a routing topology upon reliable links, which is the main focus of this thesis. Before we continue, we should establish a definition framework to help clarify our explanations in the rest of the thesis. The usual concept of connectivity is defined relative to whether a receiver can hear a sender; we call this physical connectivity. In Figure 2.3, physical connectivity is scoped by the contour lines. Boolean connectivity assumes all links with physical connectivity are good. However, as illustrated in Figure 2.3, this assumption

36 DESIGN SPACE OF ROUTING IN SENSOR NETWORKS Figure 2.3: Connectivity of a cell measured using 150 motes on an open tennis court with RFM power setting of 70. is invalid in sensor networks and raises a question on the definition of the communication range. The usual concept of the communication range is defined relative to a small biterror-rate with free space propagation between a sender and a receiver. However, Figure 2.3 shows that with such a large variation in link quality across the different receivers, the communication range becomes a fuzzy concept. In Chapter 3, we will return to this issue and describe how we characterize the communication range into three different regions. 2.2 Design Space of Routing in Sensor Networks In sensor networks, the design space of routing is driven by the communication scenario required by specific applications. Although sensor networking is still in its infancy,

37 2.2. DESIGN SPACE OF ROUTING IN SENSOR NETWORKS 21 researchers have already been developing a vast set of potential applications. The communication scenarios in these applications can be quite varied, but they can be mainly classified into few-to-many data dissemination, many-to-one tree-based routing, and any-to-any routing Network-Wide Dissemination Network-wide dissemination is one of the basic forms of data communication in sensor networks. One important scenario, especially for a query system in a sensor network, is to disseminate interest in data from one or a few source nodes to the whole or a subset of the network. For example, an application may only be interested in data that matches well with some application specific predicates under a certain sampling rate. With the interest disseminated throughout the network, only nodes that have discovered the interested data would need to report. Another general usage is for issuing commands for applicationspecific control as done in an automated pursuit application [71] or general network-wide retasking. The usual mechanism to support dissemination is to flood the entire network, an approach taken by a few of the important sensor networking querying systems, such as Directed Diffusion [40] and TinyDB [52]. For retasking with small updates, Trickle [60] uses some randomized local rules to control the send rate for scaling and expedite the rate of dissemination. For general retasking, Deluge [39] extends the work in Trickle to support dissemination of large data objects reliably. More advanced dissemination protocols that are under development attempt to exploit the geographical locations or semantic information on each node to make dissemination efficient by reducing redundant or useless retransmissions.

38 2.2. DESIGN SPACE OF ROUTING IN SENSOR NETWORKS 22 For example, [51] uses semantic information of a range query to suppress query propagations to nodes that have data outside of the range of the query. Many routing protocols use network-wide dissemination as a mechanism in to discover and build a network topology. In particular, the reverse paths of the dissemination establish a routing tree topology towards the origin of the dissemination. Many mobile ad hoc routing protocols takes this approach to create routing paths on-demand. We will discuss them further later in this chapter Tree-based Routing Tree-based routing supports a vast set of data-collection applications, such as environmental or habitat monitoring. In tree-based routing, each node can potentially be both a router and a data source. The data will either be forwarded to a common destination or multiple destinations. These destinations are often called the sink nodes. To support such a communication pattern, all the nodes must self-organize themselves into a network with a spanning-forest topology, composed of different trees with the tree roots being the sink nodes. The sink node is a base station or bridge where data is collected for processing on a more powerful machine, such as a PC computer. Sensor-network querying systems, such as Cougar [79] or TinyDB [52], have been developed to support in-network processing and aggregations over a network with tree-based topology. Directed diffusion [40] also operates in this form of communication scenario, with multihop routing trees built for each data sink node. Multiple sink nodes can coexist because each sink node may be interested in different data, and thus they form different trees to collect it. In this case, a forest of multiple trees can be built by running tree-based routing in each instance.

39 2.2. DESIGN SPACE OF ROUTING IN SENSOR NETWORKS Any-to-Any Routing Unlike tree-based routing where the final destination of traffic is always the sink node, any-to-any routing supports data delivery from any node to any node in the network similar to Internet routing. A few important in-network storage systems in the literature e.g., [66, 49] rely on this any-to-any communication infrastructure. Similar to tree-based routing, every node is both a router and data source; however, since the destination can be any node, a network-wide addressing and discovery scheme is necessary, which could be challenging given that nodes may be moved or fail, and that new nodes may join. The common approach in the literature is to use physical geographical information to build a coordinate system for network-wide naming. An alternative approach is to use a virtual coordinate system based on connectivity information as discussed in [65]. In both cases, different nodes may carry the same network address since they may be at the same location or scope. Since the primary usage scenario is for in-network storage, this naturally takes local redundancy into account. This implies that the final destination can be a cluster of nodes rather than a single location. Nevertheless, tree-based routing can be classified as a subproblem of any-to-any routing. In solving for any-to-any routing, the issues of networkwide naming and discovery must also be resolved. Node mobility can further complicate these issues Implications The design space of routing in sensor networks is different from that in the Internet and MANET (Mobile Ad Hoc Networks). The Internet is a wide-area wired network with

40 2.2. DESIGN SPACE OF ROUTING IN SENSOR NETWORKS 24 applications generating many independent flows of traffic originating from and destined to anywhere in the network. Node failures or link congestions can occur within the Internet, but the complications from wireless links do not exist. MANET is a local area network; its traffic pattern consists of many pairs of independent traffic flows. The main research challenge for this kind of network is mobility handling. In contrast, a sensor network system typically operates as a collaborative system, with many data sources routing related traffic to interested sink nodes. Since packets are often application data units, in-network processing is a key technique to minimize energy through data aggregation. A simple and robust spanning-forest routing mechanism matches very well with many of the intended traffic models in sensor networks and also opens up tree-based aggregation opportunities. While any-to-any routing can also provide tree-based routing, it is a challenge to build and maintain an any-to-any routing topology efficiently in a scalable way. In addition, only a few specialized systems in the network storage design space today require such a routing support. Therefore, a tree-based or spanning-forest routing support seems to be an emerging common networking abstraction, which reinforces the results studied in [48]. Another implication is the emerging need to maintain a stable network topology to perform in-network processing. Node mobility is much less of a concern in sensor networks as compared with traditional mobile computing since nodes are relatively static. While the routing system must cope with node failures and connectivity changes due to environmental effects, frequent route changes may incur a high overhead for high level in-network processing algorithms to adapt. Therefore, maintaining a stable adaptive routing topology becomes

41 2.3. DETAILED ROADMAP 25 one of the important criteria for routing in sensor networks. 2.3 Detailed Roadmap From the platform and routing implications that we have discussed in this chapter, we learn that designing a routing system for sensor networking carries a different set of challenges from that in the Internet and MANET. This thesis seeks to revisit the problem of routing in the sensor network context by taking a holistic approach. The implications in this chapter allow us to conceptualize the challenges involved and lead us to identify and isolate important routing subproblems and understand their interactions. The following is a detailed road map illustrating the evolution process; we identify critical problems across the different layers and take a whole-system perspective to evaluate and explore the intricate interactions among these layers and the global effects. We begin, in Chapter 3, by providing a rich set of empirical observations indicating the lossy and noisy nature of wireless connectivity on our sensor network platforms. Unlike the boolean connectivity model, connectivity in real deployments can be noisy and time varying. Our experimental data show that link quality fall-off varies substantially with respect to different receivers; in fact, the communication range, where all links would have good connectivity, is surprisingly short in our data. Therefore, even without interference, to conclude from hearing a message that a link would exhibit good connectivity is a poor assumption. Nonetheless, many wireless routing protocols today still rely on this boolean connectivity assumption. The result is a high end-to-end packet loss rate as discussed before. Link retransmissions at each hop can overcome such a potential unreliability. However,

42 2.3. DETAILED ROADMAP 26 retransmissions would carry a high cost in bandwidth and energy, which are precious resources in sensor networks. This motivates us to identify a subproblem in which every node must characterize link quality for each neighboring node using an on-line link estimation process. In Chapter 4, we discuss such local processes and explore a set of candidate link estimators. By characterizing the link quality of each potential neighbor, high-level protocols can exploit routes with high-reliability in both directions of communication. This approach changes the assumption about connectivity. Instead of accepting the boolean connectivity assumption, we treat connectivity as a probabilistic metric and define it relative to link estimation. We should think of connectivity as a statistical relationship, P ij (t), representing the probability of successful packet transmission from node i to node j and time t. We call this probabilistic connectivity throughout this chapter. However, estimating P for each neighbor can be costly in memory since statistical history must be maintained for each node. This issue is particularly prominent in sensor networks because of its high cell density property due to short sensing range. Since the connectivity cell is irregular, as in Figure 2.3, a high node density would imply that a node can potentially hear many nodes near it or very far away. All these nodes are neighboring nodes if we take the common concept of neighborhood as defined relative to physical connectivity. We call these nodes potential neighbors. Since connectivity is probabilistic rather than defined within a specific cell radius, the number of potential neighbors is not bounded, since there is a non-zero probability of hearing a node far away. As a result, it is difficult to bound the number of potential neighbors a priori; non-uniform density in actual deployment would further complicate the problem. As a result, each node faces a challenge of manag-

43 2.3. DETAILED ROADMAP 27 A Select Good Routes A A Neighbor management - keep the good ones - build a connectivity graph Discover & characterize connectivity Figure 2.4: A holistic view showing the cross-layer interactions of routing. ing a potentially unbounded number of potential neighbors. In addition, connectivity to each potential neighbor must be estimated in order to identify a reliable subset of them for routing. The tight constraint in data memory renders maintaining link statistics for each potential neighbor impractical, as the number of them can also go unbounded. Therefore, a local process must exist to dynamically manage a notion of neighborhood, which consists of a subset of neighbors suitable for routing, using only constant memory. This subproblem is unique and especially important for routing in sensor networks. These two subproblems of link quality estimation and neighborhood management under limited memory have led us to concretely establish our holistic approach in identifying the core underlying problems of routing in sensor networks. Figure 2.4 illustrates an overview of this holistic perspective on routing. Any routing system, before performing any route explorations, must first self-discover a network connectivity graph, which is analogous

44 2.3. DETAILED ROADMAP 28 to building a physical map before determining how to travel from one place to another. The lossy wireless characteristics found in our sensor network platform complicate such a map building process as it is necessary to quantify link quality to determine connectivity. Therefore, link quality estimation becomes a fundamental map building process for routing, with each node participating in a distributed fashion to build a distributed connectivity graph, with each link weighted by link estimation. We call such a graph as derived connectivity graph. Above the link estimation process is neighborhood management. It assists the graph building process by dynamically maintaining a subset of good neighbors with reliable links suitable for routing. In particular, it must deal with the challenge of using only constant memory on each node to maintain a subset of neighbors regardless of the actual cell density. The result is a distributed graph, which is a subset of the derived connectivity graph, that continuously adapts to changes in link quality, node failures, and new nodes joining. We call this the logical connectivity graph. In Chapter 5, we present a study of such a process and explore what management policy is effective to realize such a goal. The routing process should run upon this kind of logical connectivity graph. For example, a node would be a neighbor only if its link quality exceeded some threshold, say 75%. Connectivity is not necessarily symmetric, but nodes can broadcast link estimates to neighbors or assume that most good links are roughly symmetric and verify the reverse direction for links that are actually used. Route selection using this probabilistic connectivity approach can greatly improve packet delivery reliability in the shortest path algorithm discussed before. Moreover, if we have a good estimate of the link success rates, we can use

45 2.3. DETAILED ROADMAP 29 more sophisticated path selection rules. For example, we may assume inductively that all nearer nodes maintain an estimate of the path success rate from them to the sink. Once we have an estimate of the local link success rate, we can locally combine the two estimates, assuming that the success rate along the path from the neighbor is independent of how we arrive at the neighbor, to select the parent that will give the maximum success rate to the sink and we can record this as the path reliability estimate for use by children. All the above routing processes would form a routing topology, as exemplified by the top layer in Figure 2.4, which is a subgraph of the logical connectivity graph. The resulting network traffic would influence the underlying connectivity graph, which in turn would introduce changes all the way up to the routing topology. Understanding these cross-layer interactions is one of the goals of this thesis. In the thesis, we focus on tree-based routing since it is the most common and important routing service in sensor networks. Exploring the appropriate tree-based routing protocols and cost functions that take our holistic approach and utilize the derived connectivity graph is the goal of Chapter 6. We also present an overall architecture that illustrates how the three local processes in Figure 2.4 can be combined to yield a complete tree-based routing system. The routing system has many important underlying issues that it must face, and we discuss each of these issues and provide an appropriate understanding and the mechanisms to cope with them. In Chapter 7, we present more information about the architectural framework and the implementation details of our routing system. The goal of the chapter is to evaluate the different routing cost functions and protocols discussed in Chapter 6, using extensive simu-

46 2.4. RELATED WORK: A HIGH-LEVEL PICTURE 30 lations and empirical experiments. We evaluate how carrying the probabilistic connectivity concept all the way to the routing layer would affect global network-wide system properties, such as the end-to-end success rate of packet delivery, the hop-count distribution of the network, and topology stability. In addition, we study the intricate cross-layer interactions and understand how the three local processes influence each other as in a closed-loop system. These analyses help us to understand the routing dynamics when a holistic approach is taken. In Chapter 8, we summarize our contributions and discuss a few important extensions that we are planning to take in the future to improve this work, especially taking the holistic approach all the way to the application level for in-network processing to influence routing decisions. 2.4 Related Work: A High-Level Picture The related work on wireless ad-hoc routing can be found across a rich set of literature from packet radios to mobile computing and sensor networks. On one hand, these networks require routing protocols that are tailored to specific platform characteristics, resource constraints and application needs. However, on the other hand, they address many issues that are similar. Therefore, one of the challenges is to analyze the protocols from these different networks, and understand the background that influences the particular approach that they take and the role they play in the overall picture of wireless ad-hoc routing. In this section, we attempt to paint such a picture and provide a high-level discussion on the relevant related work found in packet radios, mobile computing, and sensor networks. We

47 2.4. RELATED WORK: A HIGH-LEVEL PICTURE 31 leave the protocol details for the related work section in each of the remaining chapters Packet Radio Packet radio research, such as the DARPA packet-radio project [41], began studying wireless ad-hoc routing protocols around the 1970s. One of the primary goals of these projects has been to devise protocols that guide the wireless nodes to self-organize into a multihop mobile packet-radio network. It was assumed that the dominant traffic pattern would comprise many independent traffic flows similar to that in the Internet today and in the any-to-any traffic model. Nevertheless, packet radio research sets a foundation of what an ad hoc routing protocol should be like; it should be scalable, expandable, and robust against network dynamics, such as mobility and node failures. Such research goals are influential on the later kinds of wireless networks, such as mobile computing and sensor networks. Similar to sensor networks, typical packet-radio nodes were also primitive, with low-bandwidth wireless technology and limited compute power, memory and energy. They also suffered from the lossy and noisy wireless characteristics. However, instead of taking the probabilistic approach to connectivity that we advocate and building routing topologies upon our definition of connectivity, their approach was to use a link estimator to identify links with low link quality and avoid using them for route selection [41]. In this thesis, we take the holistic approach and integrate link estimation into routing cost functions. There also exists work [54, 70] that overcomes the limitation of memory resources in maintaining neighborhood information in dense networks. In [70], neighbor selection is

48 2.4. RELATED WORK: A HIGH-LEVEL PICTURE 32 done using a random selection process similar to the percolation theory in random graphs. It focuses on the neighbor selection layer using the boolean connectivity assumption. A better approach is presented in [54], which performs link estimation and relies on a candidacy list for potential neighbors to build up link estimations before considering insertion into the neighbor table. We discuss the details of these protocols in Chapter 5. At the routing level, there exists various packet-radio protocols that are distancevector based, but use different routing cost functions other than shortest path with hop counts. These protocols include Least-interference routing [73], Least-resistance routing [62], and Maximum-minimum residual capacity routing [14]. These metrics enhance the reliability of communication by routing over less congested or interfered paths. While these factors are important to improve the quality of service in routing, these protocols rely on the boolean connectivity assumption; it neither takes a probabilistic view of connectivity similar to ours nor do they define routing cost functions to directly address the fundamental lossy characteristics, which is the theme of this thesis Mobile Ad Hoc Networks (MANET) In the early 1990s, mobile computing networks began to emerge along with the advent of laptop computers and local area wireless networks. The technology became prevalent as short-range, spread-spectrum radios became affordable and protocol specifications became standardized, with IEEE [1] being the widely used one. Although mobile computing and packet-radio networks share similar high-level design goals, the degree of mobility in mobile computing is assumed to be high because users are expected to carry

49 2.4. RELATED WORK: A HIGH-LEVEL PICTURE 33 their laptops and move around within an office or a building. The idea is to maintain a multihop network over a set of laptop computers to support any-to-any traffic among these nodes. Users are expected to move within an area such that the network can remain connected. Supporting mobility efficiently rather than creating efficient optimal routing paths, such as shortest paths, then became the first priority that routing protocols in MANET had to consider. However, as mobile computing becomes more pervasive in indoor environments, a significant amount of ad-hoc communication research is devoted to traffic pattern that directly communicates with the base station. That is, mobile nodes do not need to forward each other messages since each node can directly communicate with one or more base stations. Thus, mobility handling reduces to a problem of handoff from one base station to another. This kind of infrastructure mode has become the dominant form of wireless communication for portable computing. Building upon the packet-radio literature, many important ad-hoc routing protocols for mobile computing have emerged. However, improvements in technology and different usage scenarios have impacted these protocols to leave out some of the system issues that packet-radio protocols have addressed before. In particular, the abundant resources in memory and compute power on a laptop computer relax the constraints on protocol simplicity and the requirement for neighborhood management. Since mobility yields intermittent connectivity, having some connectivity is often adequate as packet losses can often be recovered through link retransmissions. As a result, many of the protocols assume boolean connectivity rather than performing link quality estimation, as such information may become stale quickly or be handled by underlying link layer mechanisms.

50 2.4. RELATED WORK: A HIGH-LEVEL PICTURE 34 There are improvements over the basic distance-vector based protocol. For example, DSDV [58] uses a simple mechanism to convey route freshness that guarantees cycle-free topologies and solves the well-known counting-to-infinity problem [9]; both of these problems can arise easily when nodes are moving around, since stale routing information can lead to cycles. Although the routing cost function still uses hop count, DSDV optimizes for path freshness before minimizing the shortest path. Another improvement in the mobile computing literature is source-initiated ondemand routing, which exploits the fact that since traffic is compose of many independent traffic flows, the system should only maintain state to support actual traffic. DSR [44] and AODV [59] fall into this category. These protocols rely on the source node, which should know the destination s network address, to initiate route discovery to the destination through some form of flooding mechanism; the reverse path of the flood is used as the routing path. They also assume link qualities are good in general, while AODV does have mechanisms to avoid routing over asymmetric links. For these protocols, their goal is to identify a path to the destination quickly rather than optimizing for some metric such as shortest paths. A special kind of receiver-based source-initiated on-demand routing called Gradient Routing (GRAd) [67] has also emerged. It supports the same any-to-any traffic model as do other on-demand routing protocols. It first establishes a gradient of routing cost by using the same route discovery mechanism to build up reverse path routing. However, instead of selecting the next node to forward a message via a unicast packet, all forwarding is done as a local broadcast by the sender. Each receiver decides on whether it should forward

51 2.4. RELATED WORK: A HIGH-LEVEL PICTURE 35 or not. This form of receiver-based routing is resilient to mobility, but may carry a high communication overhead in redundant forwarding. Another kind of on-demand routing protocol, TORA [57], that is based on a link reversal technique has also emerged. It assumes boolean connectivity and links are bidirectionally good in general. It first discovers a network through flooding and builds a directed-acyclic graph (DAG) rooted at the source. To handle mobility and link failure, it relies on the discovered DAG and maintains the DAG through link reversal mechanisms. Since the topology is a DAG, it is always loop free. The protocol is more complex than DSR and AODV and requires global time synchronization to establish temporal order Sensor Networks The advent of sensor networking in the late 1990s, as pioneered and exemplified by Directed Diffusion [40], shows a form of networking that is like packet radio, but supports very different traffic scenarios and applications. In general, sensor network nodes are relatively immobile; however, connectivity variations and node failures can be quite frequent. The resource constraints are tight in both packet radios and sensor networking. Multihop traffic is the norm of communication. One of the major characteristics of sensor networks is to overlap communication and computation in the form of in-network processing. Since communication is more expensive than computation in term of energy, processing within the network helps to reduce the amount of multihop communication and to increase network life time. Therefore, in-network processing is a key to cope with the tight resources available in sensor networks.

52 2.4. RELATED WORK: A HIGH-LEVEL PICTURE 36 Packet Radio Mobile Computing Sensor Networks Mobility Depend on Treat as Relatively Immobile, Scenarios First Priority Adapt Slowly Resource Limited Abundant Limited Constraints Traffic Independent Flows, Independent Flows, Correlated, Multihop, Pattern Multihop, Single-hop to BS, Many-to-one(few) Any-to-Any Any-to-Any In-network Processing Addressing Internet Like Internet Network Address Addressing Protocol Free Source of Users and the Mainly from Mainly from the Variations Environment Users Environment Table 2.2: Summary of the differences among the different wireless networks. Directed diffusion shows a sample framework of how in-network processing and multihop routing can be done to support the intended applications in sensor networking. Such a framework does not dictate the underlying protocols and implementations while allowing the applications to have the flexibility to devise new protocols. The idea is to have a sink node to issue an interest in some particular data similar to the route discovery in on-demand protocols, except that it is now destination-initiated and every node in the network can potentially be a source of data. Nodes with the interested data will send the data back along the reverse paths on the routing tree, with immediate nodes performing innetwork processing such as aggregations. Note that such a traffic pattern does not require a network-wide addressing scheme, since nodes need not know the address of the sink node; they simply need to know the link address of the next hop. Table 2.2 provides a framework that summarizes the different dimensions across the different types of networks discussed in this section. The source-initiated on-demand routing protocols, used mainly to support independent flows in mobile computing, does not match the many-to-few data collection traffic

53 2.4. RELATED WORK: A HIGH-LEVEL PICTURE 37 pattern in sensor networks. However, as discussed in Section 2.2.1, this network-wide dissemination process can be used to establish a reverse-path routing tree for data collection if the route discovery is sink-initiated. That is, sink-initiated tree-based routing would face the same issues as source-initiated on-demand routing, since the discovery process is similar and the reverse path is also used to route data. For routing in packet radio and mobile computing, topology stability is generally not a concern since nodes are expected to move anyway and getting data reliably to the destination is the ultimate goal. However, for sensor networks, topology stability would be very useful, since it benefits in-network processing by allowing high-level algorithms to rely on a stable routing tree to perform aggregations. Therefore, achieving network stability is one of the main goals in our study. Besides tree-based routing, there exists other related work on routing in sensor networks that fall in the any-to-any routing based on establishing either geographical or virtual coordinates as discussed in Section 2.2. The geographical approach would require additional localization support, such as a Global Positioning Systems (GPS), while the virtual coordinate approach is still in its early stage of research [65]. Many of these routing protocols in sensor network assume that lossy connectivity is hidden by low level mechanisms and routing protocols can safely assume that they operate on a well-defined boolean connectivity graph and rely on link failure mechanisms to adapt to changes. As discussed in the previous section, we take a different approach and expose the underlying connectivity as a probabilistic metric to the routing layer such that it can make the best routing decisions when different degrees of connectivity are encountered.

54 38 Chapter 3 Understanding Link Characteristics The starting point for development of a practical topology formation and routing algorithm is to understand the dynamics and loss behavior of wireless connectivity in sensor networks under various circumstances. Rather than carry along with a detailed model of the channel or the propagation physics, we have sought a simple characterization of connectivity through empirical studies over our sensor networking platform discussed in Chapter 2. Our experimental results show that connectivity does not resemble the circular-disc model used in many formulation of distributed algorithms. To the contrary, it is irregular, time-varying and probabilistic. We present a simple model to approximate these empirical results, such that synthetic packet traces resembling real-world packet losses can be generated to support higher level protocol design and simulations. We also measure the actual channel capacity under periodic traffic and the effectiveness of using received signal strength to predict link

55 3.1. CONNECTIVITY, RANGE, AND LINK DYNAMICS 39 quality. All these observations lead us to stress the need to take this probabilistic concept of connectivity all the way up to the routing level. 3.1 Connectivity, Range, and Link Dynamics With primitive, low-power radios, sensor networks face wireless characteristics that tend to be more noisy and lossy than those found in typical wireless computer networks. Thus, we carefully characterize connectivity observed on our sensor network platform. We perform many empirical experiments to study packet loss behavior over distances across many nodes, qualitatively define the structure of the communication range, observe time variations of link quality, and capture the effects of obstructions and node mobility. Although these experiments are done over the two different Mica platforms, the overall results are similar and yield the same implications for high level protocols Physical Connectivity and Communication Range We measured packet loss rates between many different pairs of nodes at many different distances over a long period of time. Each node is scheduled to transmit packets at a uniform rate and other nodes record the successful reception of these packets. That is, only one node transmits at any given time, and for each transmitter, we obtain numerous measurements at different distances. With sequence numbers embedded in all packets, we can infer losses and generate a sequence of success/loss events that would constitute packet loss traces. We vary the placement and environment of the nodes to explore how they may affect connectivity.

56 3.1. CONNECTIVITY, RANGE, AND LINK DYNAMICS 40 One such representative measurement is summarized in Figure 3.1. It shows a scatter plot of how link quality varies over a distance for a collection of many pairs of Mica nodes. The nodes are placed as a line 3 inches above the ground in an open tennis court; the first 20 nodes are placed 2 feet apart in the line while the rest are 4 feet apart. Each node is scheduled to transmit 200 packets at 8 packets/s, with a power level setting of 50. A number of other settings show analogous structure. As expected, for a given power setting there is a distance within which essentially all nodes have good connectivity. The size of this effective region increases with transmit power. There is also a point beyond which essentially all nodes have poor connectivity. However, in this clear region, some very distant nodes occasionally do receive packets successfully. Between these two points is the transitional region, where the average link quality falls off relatively smoothly, but individual pairs exhibit high variation. Some relatively close pairs have poor connectivity, while some distant pairs have excellent connectivity. A fraction of the pairs have intermediate loss rates and asymmetric links are common in the transitional region. This three-region communication structure is observed on both Mica and Mica 2 platform. These observations imply that the usual concept of the communication range of a pair of nodes can be quite misleading when we consider many pairs of nodes together. The general concept of connectivity used in most studies, such as [58, 44, 59, 57], is either a sharp fall off of link quality at the end of the communication range (as defined by the specification of the radio) or a degradation in link quality over distance at the same rate for all nodes, since they follow the same path-loss model. In fact, the communication range consists of three unique regions, with the noisy, transitional region making up most of the

57 3.1. CONNECTIVITY, RANGE, AND LINK DYNAMICS 41 Reception Success Rate Transitional Region Clear Region Each circle represents the link quality of a directed edge in the topology Edges with the same distance can have very different reliability. 0.2 Link Quality Std. Dev. 0.1 Mean Effective Region Feet Figure 3.1: Reception probability of all links in a network, with a line topology on a tennis court. Note that each link pair appears twice to indicate link quality in both directions. communication range and being very sensitive to the particular sender and receiver pair. Such a large transitional region can give a false impression that the reliable communication range is very large, especially when a few long reliable links do exist. In a dense deployment, nodes are close to each other, and many neighboring nodes would fall within the effective region; good connectivity for routing should exist. If the deployment is too sparse, most of the neighbors would fall in the clear region and a network cannot be established. There is also the point that if the network is not dense enough and all links in the network end up falling within the transitional region, reliable routing would be difficult since the underlying links that build up the derived connectivity graph for routing can have large variations in reliability. Therefore, we stress the importance of the spacing of nodes within the effective communication range in actual deployments. One can achieve this by measuring the effective

58 3.1. CONNECTIVITY, RANGE, AND LINK DYNAMICS 42 communication range in each deployment at the desired transmit power, and using the resulting estimation of the effective communication range to guide the nodal distance. An alternative is to rely on protocols to configure the system to achieve this property and adapt to a given deployment Time Variations Our observations have focused on the relationship between link quality and distance. We now turn to observations that study the time variations of link quality even when nodes are stationary. We start with a fixed source node sending to a receiver at a given distance in an indoor environment. Figure 3.2 shows a situation where a transmitter sends 8 packets/s to a receiver 15 feet apart for the first twenty minutes. Note that the mean is about 20%, but the fluctuations range from ±20% to ±10%, using a sample size of 240 packets. Although there are no observable interferences, link quality varies a lot within such a short period. The same pair of nodes were placed 8 feet apart after the first 20 minutes and remained stationary for more than four hours. We see that link quality again undergoes abrupt changes in these four hours. For example, it exhibits a mean of about 65% with about a ±10% swing, using a sample size of 240 packets. Despite the fluctuations, this mean and the fluctuation swing remain relatively stable over the course of the experiment. This implies that if a link is characterized, the time window for the characterization to predict future link quality can be relatively long, given there is no observable interference from other traffic. However, this is not true in all cases. Instances exist where the mean and the degree of variations in link quality can vary over time so much that link characterization

59 3.1. CONNECTIVITY, RANGE, AND LINK DYNAMICS 43 Reception Probability Time (minutes) Figure 3.2: After 20 minutes, the sender is moved from 15 ft to 8 ft from the receiver and remained stationary for four hours. needs to adapt to these changes quickly. Figure 3.3 shows link quality varying over a pair of nodes which are deliberately placed close to the end of their communication range in an indoor laboratory environment. Link quality varies from 0% up to 70% over a period of 7 hours. This evidence suggests that even though nodes are immobile and no observable influences occur in the physical environment, link quality can vary significantly over time. As a result, agility in link estimation becomes an important metric in our study of link estimators in the next chapter Obstructions and Mobility In many cases, obstructions from a moving object, such as a person, can affect the quality of links between nodes. We attempt to capture such effects by measuring how

60 3.1. CONNECTIVITY, RANGE, AND LINK DYNAMICS 44 Figure 3.3: Link quality variation over a 7 hour period in an indoor laboratory environment. the packet reception rate changes when a person stands in the vicinity receiver. Figure 3.4 shows the result of an experiment in which a person deliberately stands beside the receiver for about five minutes (from time at 15 minutes to 20 minutes). It shows that the reception probability is very sensitive to the person s position, with discrete changes of substantial magnitude. At some times it blocks communication, at other times it has no effect, and at others it actually improves matters. These traces are from an indoor environment; however, our outdoor traces show similar behavior. Figure 3.5 shows a more complete scenario that involves moving a sender and receiver pair to different distances. Although mobility is not expected since, in many sensor networks, it is easy to envision opportunities where nodes can be moved by external forces from the environment, as when wind or moving objects are being sensed. Again, we observe how the packet reception rate of a link changes as we deliberately move the sending node to

61 3.1. CONNECTIVITY, RANGE, AND LINK DYNAMICS 45 Figure 3.4: Obstruction effects on packet loss behavior. A person deliberately stands beside the receiver in the interval minutes. different distances from the receiver. The experiment begins with the sender placed at 14 feet from the receiver. After 9.5 minutes, the sender is placed 8 feet from the receiver. At 17.5 minutes, it is placed 4 feet from the receiver. At 21 minutes, it is moved back to 12 feet from the receiver. Finally, at 26.5 minutes, it is placed 4 feet from the receiver again. The results show a strong correlation of link quality with the distance between the two nodes. We have observed instances of abrupt changes and substantial variations of link quality between a pair of nodes from these experiments. The instances reinforce the need for link estimation to track these changes quickly and accurately in the derived connectivity graph, over which the routing process would be aware of and adapt to these changes.

62 3.1. CONNECTIVITY, RANGE, AND LINK DYNAMICS 46 Figure 3.5: Movement effects on packet loss behavior. Transmitter is deliberately moved to different distances at various times Irregular Connectivity Cell Our observations have focused on connectivity issues between a pair of nodes. The lossy nature can best be seen if we observe a typical connectivity cell over a set of nodes in a two-dimensional field. Figure 2.3 shows such a cell which illustrates how the packet reception probability of a sender falls off over a 150-node network deployed as a grid on an open tennis court. The experiment is done over the RFM radio with a power setting of 70. As seen from the graph, the connectivity cell is very irregular, with the effective region covering a much smaller area than the transitional region. Furthermore, there exist many nodes whose probability of reception is less than 20%; these nodes would be treated as neighbors at the protocol level if the boolean connectivity assumption is used. Not shown is the degree of link quality asymmetries, which is expected to be significant in the transitional

63 3.1. CONNECTIVITY, RANGE, AND LINK DYNAMICS 47 region Implications: Connectivity and Hop-Count An important implication of this section is to give a new perspective on the definition of connectivity and a new understanding of communication range and hop-count. The lossy data reported in this section led us to define connectivity relative to link estimation. That is to say, without knowing the actual link quality, connectivity is meaningless. For example, a link with more than 95% loss rate is not useful at all. In the process of building a derived connectivity graph for routing purposes, therefore, the link quality of each edge must be defined for the graph to be meaningful. With this probabilistic view of connectivity, we can provide a better definition of communication range. Our data shows that the communication range indeed consists of three distinct regions: effective, transitional, and clear. A conservative approach would take the communication range as the effective region where the link qualities for all links in that region is above 90% in both directions. As shown in our data, this is much shorter than the observed connectivity cell. The definition of a hop-count becomes more complicated. The usual concept is to bind hop-count with connectivity. However, when connectivity is probabilistic, the concept of a hop-count needs to be revisited. One can define all nodes with any physical connectivity as a one-hop neighbor. However, many of these one-hop neighbors would be far away and have very unreliable links. Once they are considered as neighbors, they are attractive for routing since they may yield shortest routing paths. This is the reason why routing protocols

64 3.2. MODELING THE OBSERVED LINK CHARACTERISTICS 48 that simply rely on hearing a message to define connectivity perform poorly. Therefore, we only define nodes that a node hears as potential neighbors. Following our probabilistic perspective on connectivity, one can define a one-hop neighbor relative to link estimation; for example, a node with link quality above some threshold. This in effect has created a logical connectivity graph that the neighborhood management process is responsible for. We will revisit the concept of a hop-count as we discuss the neighborhood management process in Chapter 5 and the routing process in Chapter Modeling the Observed Link Characteristics Modeling the essence of the time-varying characteristics of the three-region connectivity structure, when applied to a large field of nodes as seen in the previous section, is the main objective of this section. Capturing these behaviors is an important step towards making simulations of packet loss dynamics of real networks possible for protocol design and evaluation. Rather than taking detailed models that explain the complex sources of packet loss in a real network, we abstract these complexities by taking a probabilistic link behavior model for simulations built from traces collected empirically. We first compute the packet loss mean and variance from the traces collected for Figure 3.1 to create a link quality model with respect to distance. For each directed node pair at a given distance, we associate a link quality (packet loss probability) based on the mean and variance extracted from the empirical data, assuming the variance follows a normal distribution. An instance showing how this model captures a node s connectivity cell is shown in Figure 3.6; it matches well with the spatial irregularity shown in the empirical

65 3.3. BINOMIAL APPROXIMATION OF STATIONARY PACKET LOSS DYNAMICS 49 Grid Y Coordinate Sender Grid X Coordinate Figure 3.6: Cell connectivity of a node in a grid with 8-foot spacing as generated by our link quality model. 0.1 observations in Figure 2.3. This model of packet loss characteristics is used in all simulation studies in this thesis. 3.3 Binomial Approximation of Stationary Packet Loss Dynamics The previous section illustrates how empirical traces are used to assign the distribution of packet losses over a distance. In this section, we investigate whether binomial approximation or coin-flipping is adequate to capture the instantaneous variations of packet loss, given a fixed average packet loss behavior. Since the outcome of packet reception is either a loss or success, a simple model is to treat each packet reception as a Bernoulli Trial, with 1 denoting a success and 0 denoting a loss, where p equals probability of success

66 3.3. BINOMIAL APPROXIMATION OF STATIONARY PACKET LOSS DYNAMICS 50 and 1 p equals probability of loss. For each link, the value of p or 1 p can be obtained through the packet loss probability model discussed above. The assumption of this binomial approximation is that each trial is independently and identically distributed. In reality, this may not be the case, but we can compare this approximation with the empirically collected traces, and evaluate whether it is valid or not. To investigate whether packet loss in our traces follows the binomial distribution when there is no observable physical influence, we plot quantiles extracted from the stationary portion of our data in Figure 3.2 against quantiles derived from the theoretical Binomial distribution. The expected value or average packet success rate from the data set is 65%, so we set the expected value in the Binomial distribution to this value. The resulting quantile-quantile graph is shown Figure 3.7. By a quantile, we mean the fraction of points below the given value. If the data in Figure 3.2 follows the Binomial distribution, the data set should be linear along the 45 degree line. Figure 3.7 shows a good match when the quantile is near the mean, but a slight deviation at both extremes. This suggests that the empirical data has a larger degree of variance than the Binomial distribution model. Nonetheless, Figure 3.7 suggests that the Binomial distribution is a fairly good model to approximate the instantaneous dynamics of packet loss. Furthermore, the Binomial distribution also supports the macroscopic behavior observed in empirical link quality variations. That is, variation seems significant when packet loss is around 50%, while it is minimal at both extremes (0% and 100%).

67 3.4. SYNTHETIC TRACE GENERATION Empirical Data and Binomial Distribution Quantile Quantile Plot Quantile Quantile Line y=x Quantile (Empirical Data) Quantile (Theoretical Binomial Distribution) Figure 3.7: Quantile of empirical data against quantile of binomial distribution. 3.4 Synthetic Trace Generation In this section, we expand our link quality model such that we can synthetically generate packet traces that resemble empirical traces as a mean of initial evaluation. This ability allows us to evaluate protocols or link estimators using mostly synthetic traces, with which we have the full control and information required to drive systematic studies. The previous sections allow us to model packet loss dynamics for a given loss probability. To model changes of link quality resulting from mobility or obstacles at the receivers, we make the loss probability p a piecewise function of time p(t). In order to generate a synthetic trace similar to that in Figure 3.5, we define p(t) as the sequence of steps shown in Table 3.1. These values are chosen by partitioning the traces found in Figure 3.5 into five different regions and matching the average reception probability over each 30 second interval within

68 3.4. SYNTHETIC TRACE GENERATION 52 Figure 3.8: Time series comparison of empirical traces with simulated traces. that time. t (minutes) p(t) % % % % % Table 3.1: Definition of p(t) to model Figure 3.5 The resulting trace derived from p(t) using the binomial approximation is surprisingly close to the empirical trace as shown in Figure 3.8. The simulated trace captures the essence of the empirical trace, except for a smaller degree of variance due to the deficiency from the binomial approximation. This form of synthetic trace generation is used heavily in evaluating the different link quality estimators in Chapter 4.

69 3.5. EFFECTIVE CHANNEL CAPACITY: SINGLE AND MULTIHOP Effective Channel Capacity: Single and Multihop Another important issue that we need to understand about the link layer is the difference between the channel bit rate, as defined by physical hardware capability, and the effective channel bandwidth, as defined by the performance of the media access control (MAC) layer when multiple nodes share and contend for the same wireless channel. Since the channel is normally shared among different nodes in a common connectivity cell, only one transmitter can access the channel and send at a given time; otherwise, packet collisions will occur. The goal of the MAC layer is to arbitrate such channel accesses among the different senders to avoid collisions. As a result, in order to quantify the actual deliverable bandwidth at the link layer under heavy traffic conditions from multiple senders, we need to measure it explicitly for the two Mica platforms. Both platforms use a similar CSMA MAC layer as discussed in the previous chapter. Figure 3.9 shows how the channel utilization changes as the offered load increases by adding more transmitting nodes. Each node is set to send periodically at 10 packets/sec. The channel utilization peaks when the offered load is about 30 packets/sec, which is equivalent to about 75% channel utilization. As the number of nodes increases to 4, the offered load reaches 40 packets/s, which is close to the theoretical capacity, and significant backoff is seen as a result. The effective bandwidth drops back to about 20 packets/sec (or about 50% utilization). Figure 3.10 shows the channel capacity for the ChipCon radio on the Mica 2 platform. A series of improvements from TinyOS-1.1 over channel utilization under different offered load is shown here. The new improved B-MAC [61] not only increases the channel

70 3.5. EFFECTIVE CHANNEL CAPACITY: SINGLE AND MULTIHOP Offered Load vs. Aggregate Delivered BW in One Cell 25 Aggregated Delivered BW (Packet/s) Number of Node with Offered Load at 10 packet/s Figure 3.9: Channel capacity of the Mica/RFM platform using TinyOS 1.0 radio stack. capacity to about 50 packets/sec from 42 packets/sec, but also achieves a 85% channel utilization under congested traffic. Furthermore, channel utilization seems to sustain at the same level without much degradation as offered load increases. For multihop traffic, it is important to realize that the effective bandwidth available is only 1/3 of the above measured utilization. This is a theoretical limit because each multihop packet occupies a communication cell three times during the process of forwarding: from a child to its one-hop parent, from the one-hop parent to the child s two-hop parent, and finally, from the two-hop parent to the three-hop parent. In these three cases, the packet will occupy the one-hop parent s cell three times. If the path takes on more hops, the packet will occupy the communication cell of the parent at each hop three times, and thus, the effective bandwidth is reduced to 1/3. As a result, bandwidth is a tight resource

71 3.6. RECEIVED SIGNAL STRENGTH AND LINK QUALITY 55 Figure 3.10: Channel capacity of the Mica2/Chipcon platform using different versions of the TinyOS radio stack. for multihop traffic. 3.6 Received Signal Strength and Link Quality One of the interesting link layer characteristics that is related to link quality is received signal strength. It is very attractive if link quality can be reliably inferred by simply measuring the received signal strength from each packet received. Theoretically, the biterror-rate (BER) is expected to have a direct correlation with the received signal-to-noise ratio of the packet, and the packet error rate is a function of the BER and coding. A similar measure to the signal-to-noise ratio that can be obtained on the current Mica platforms is the received signal strength indicator (RSSI). In order to to determine if RSSI values can be used as an indicator to determine link quality, we need to collect experimental data.

72 3.6. RECEIVED SIGNAL STRENGTH AND LINK QUALITY 56 The RFM radio on the Mica platform does not directly support RSSI measurements. We need to measure the baseband signal indirectly to infer RSSI values. However, the CC1000 radio provides a fairly accurate measurements of RSSI values by default. Therefore, we performed our study with the Mica 2 platform. The experiments were done over an open grass field, with 20 nodes deployed 2 meter apart as a line topology. Each node took turns to transmit 200 packets every 250ms, and all other nodes listened and collected link reliability statistics. All nodes are also situated 3 inches above the ground. Figure 3.11 shows the relationship of average RSSI values and link reliability from our experiments; each data point represents a link in one direction. Note that lower RSSI values on the mote means stronger received signal strength. The data shows that if RSSI values are below a threshold value around 300, the link qualities are very good or about 90% reliability. In general, the graph shows good correlation between RSSI and link quality. However, the circled region reveals that some links end up having zero reliability because of a failing CRC checksum even though the RSSI values are below the 300 threshold. These links are not asymmetrical links; they are on the clear region as only a few packets were received and no packets are received at all in the opposite direction. If we used 300 as our RSSI threshold to infer reliable links, these links would yield a false positive because they indeed have zero reliability. One may argue that if a stronger threshold, such as 200 is used, weak links will be filtered out and the remaining links would be reliable. We agree that a stronger threshold may achieve this goal; however, in situations where the traffic is not a controlled experiment, collisions can, in fact, affect link quality even though the received signal is very strong. Figure 3.12 shows that under other traffic condition, the reception

73 3.7. RELATED WORK 57 Figure 3.11: Relationship of RSSI signal strength and link quality on the Mica2/Chipcon platform. probability of a link drops to below 10% even though the received signal strength stays relatively strong and stable. All in all, these results show us that RSSI provides a good hint of link reliability. Situations where unreliable links having strong RSSI values do occur. Furthermore, as with any other threshold-based selection, it is difficult to find a generic threshold that works in all cases. However, a strong RSSI value is certainly a useful hint at potentially reliable links. This can be a useful mechanism to quickly select reliable links for higher-level protocols; one example of this usage is discussed in [71]. 3.7 Related Work Packet loss characteristics in sensor networks have also been studied extensively in other research efforts. A thorough study of understanding link characteristics of the

74 3.7. RELATED WORK 58 1 Reception Probability Minutes 200 RSSI Values Minutes Figure 3.12: Example showing strong RSSI values may not be a good indicator for link quality. Mica/RFM platform over indoor, outdoor, and habitat environments has been done in [82]. The loss characteristics over a distance are similar to what we observed, with a large portion of the communication range (the transitional region) consisting of links with a large variation of reliability and degree of asymmetry. Furthermore, it showed that these characteristics persist across the different environments studied and forward-error-correction coding does not help to reduce the ratio of unreliable links in the gray (transitional) region. They also found that received signal strength is not a good indicator of link quality. Similar results on the packet loss over distance behavior are also documented in [20]. In another experimental study done by [17] on both Mica and Mica2 platforms across different environments, the time variations of the packet losses are also similar to our findings. These studies concluded that imperfect hardware calibrations across the different radios and antennas on each node is likely to be the main reason accounting for the wide variations in link quality and asymmetry

75 3.8. CONNECTIVITY: A PROBABILISTIC PERSPECTIVE 59 in the transitional region. There exists extensive prior work on modeling loss characteristics in various wireless networks. For example, [33] used a trace-based approach to modeling wireless errors, and [13] collected error traces on WaveLAN and developed a Gilbert model for packet losses. In addition, [8] collected GSM traces and created a Markov-Based channel model. The observed loss characteristics from these experimental studies are different from the results that we observed over our platform. Much of the work in the WaveLAN and GSM traces are done over a pair or a few pairs of nodes. Without a large number of nodes collecting packet loss traces, they do not reflect the lossy characteristics, such as the extent of the transitional region, that we did in our experiments. Furthermore, these sophisticated wireless platforms potentially have very different reactions to background interference, environmental effects, and mobility than our low-power radios. Nonetheless, we draw upon the methodology developed in these studies to build an empirical characterization of our regime and to study how well the established techniques carry over. 3.8 Connectivity: A Probabilistic Perspective In this chapter, we have shown, through many empirical measurements, that wireless connectivity is far from a circular-disc model and it is much more appropriate to take a probabilistic perspective. That is, connectivity should be defined relative to link estimation. A connectivity cell does not only fall off irregularly, but the communication range can also be classified into three distinct connectivity regions, with each of them having very different link quality characteristics. The relationship among node placement, effective communica-

76 3.8. CONNECTIVITY: A PROBABILISTIC PERSPECTIVE 60 tion range, and RF transmit power is an important one to understand at each deployment site. It should be understood to ensure there are neighboring nodes that fall in the effective communication region connected with reliable links. As a result, blindly using geographic technique to establish routes would yield poor routing performances. It is expected that the majority of the links fall in the transitional region or gray area where link quality can vary significantly and link quality asymmetry is common. There also exist nodes in the clear region that have low, but non-zero connectivity. These complications make it difficult to assume a neighboring node simply by hearing a message from it, since such a node can fall in the clear region and the probability of hearing it again can be very low. Furthermore, we have empirical results that show that link quality can also be time-varying even though nodes are immobile and no observable physical influences are present. Therefore, nodes must have an on-line local process to discover neighbors by maintaining statistics to characterize link quality probabilistically for each neighbor, which is the first local process in our holistic approach that we have shown in Figure 2.4. This process lays out the foundation of having the network discover itself and characterize connectivity to build a derived connectivity graph, with each edge having bi-directional link qualities. Together with the neighborhood management process, a connectivity graph is built. We take this probabilistic perspective all the way up to the routing layer, where a reliable routing topology is built upon such a discovered connectivity graph. This holistic approach is a fundamental design choice that we take in coping with the lossy dynamics found in sensor networks. We will discuss the three local processes in detail in the next three chapters.

77 61 Chapter 4 Characterizing Connectivity using Link Estimators Our empirical study of the wireless characteristics of our sensor networking platform has led us to take a probabilistic perspective on connectivity. That is, connectivity should be defined relative to the link quality obtained through link estimation. Thus, an on-line, distributed link estimation process is an important building block for self-organizing network protocols. Following our holistic approach, reliable multihop routing must be built upon a self-discovered connectivity graph. Each node must locally collect statistical measurements of its connectivity quality with respect to its neighboring nodes in creating such a graph. Higher-level protocols can use these statistics to select paths that are efficient and reliable for multihop communication. Designing such an estimator is not as straightforward as it might seem because it must strike a balance between stability, agility, and resource usage as a sensor network is

78 4.1. LINK ESTIMATION AS PART OF NETWORK SELF-ORGANIZATION 62 highly resource-constrained. Thus, simplicity and efficiency are the two important design principles that we follow. As a result, we take a passive rather than an active approach to link estimation. We propose a general estimator framework that allows us to consider different kinds of estimation schemes within the same evaluation platform. We describe a set of metrics that are important for evaluating the different estimators. These metrics are compared in order to find the best link estimator. We also study the intricate relationships among agility, stability, and the amount of history required that will help us understand the effects in tuning each estimator. With the methodology explained, we define the objectives in tuning the estimators, and present many different candidate estimator designs along with the tuned parameters in meeting the tuning objectives. Such process allows us to fairly identify the best estimator among our candidate estimators. The related work on link quality estimation is rather narrow, but abundant. We attempt to give an overview of the different techniques that researchers have used. Finally, we state the limitations of our link estimation approach, and address implications of these limitations for multihop routing protocols that build upon link estimation. 4.1 Link Estimation as Part of Network Self-Organization Vast networks of low-power, wireless devices, such as sensor networks, raise a family of challenges that are variants of classical problems in a qualitatively new setting. One of these is the question of link loss rate estimation based on observed packet arrivals. Traditionally, this problem has been addressed in such contexts as determining the degree of error coding redundancy or the expected number of retransmissions on a link. In sensor

79 4.1. LINK ESTIMATION AS PART OF NETWORK SELF-ORGANIZATION 63 networks, the problem arises as part of network self-organization as nodes must discover neighbors and estimate link qualities among them in order to build a connectivity graph for higher level process such as routing. To realize our probabilistic view on connectivity, each node keeps track of a set of nodes that it hears, either through packets addressed to it or packets it snoops off the channel, and builds a statistical model of the connectivity to/from each. Thus, we would like to gain a good estimate of P ij (t) at each node j from packets it hears (and does not hear) so we can use this to define the weight of each edge in a connectivity graph. Maintaining many local link success (or loss) rate estimations is essential for self-organization into multihop routing topologies in sensor networks. However, there are several challenges. The storage capacity of the nodes are very constrained and their processors are not very powerful. Thus, the estimators must use very little space and be simple to compute. Furthermore, it is not sufficient that the estimator eventually converge, since the link status changes fairly quickly. We want the estimator to be agile and to have a small settling time, so route selection can adapt reasonably quickly to changes in the underlying connectivity. However, there are also transient variations in the link, and we want a stable estimator, so the routing topology does not change chaotically. Moreover, fluctuations and errors in the estimate may potentially introduce (temporary) cycles in the routing graph due to inconsistent partial information used in the route selection. These desires are clearly in conflict, stable estimators tend to be less agile and agile estimators tend to be less stable, especially ones that are inexpensive to compute. The constraints and conflicts discussed above motivate us to investigate the be-

80 4.2. ESTIMATOR DESIGN FRAMEWORK AND METHODOLOGY 64 havior of a wide range of link estimators in the context of low-power wireless connectivity for the purpose of multihop route selection. The basis for estimation is the sequence of packets that a node observes. Thus, we can view this as a series of binary events over time. We only get to observe the 1s directly - the arrival of a packet. However, when we receive a packet we can infer the intervening zeros from the sequence number. Of course, if we stop hearing from a node, the zeros are silent; we cannot, in general, know the expected packet rate from the node, even if we know its sample rate, since it may be performing local data compression, as well as routing traffic for other nodes. So, additional measures are required to estimate silent losses, which should be incorporated into the design of link quality estimators in general. Our proposed link estimation process only yields an in-bound quality estimation because only packet reception statistics are collected. However, obtaining the out-bound link quality estimate (success rate of a node s packet as received by neighboring nodes) is as important, since it measures the success rate of forwarding a packet from a node among its neighboring nodes and reveals if link asymmetry occurs. Therefore, a simple and efficient mechanism is required for nodes to obtain out-bound link quality estimates among its neighboring nodes. We will discuss the details in the later part of this chapter. 4.2 Estimator Design Framework and Methodology Our goal is to design a link reliability estimator that is responsive, yet stable, reasonably accurate, simple with little computational requirement, and memory efficient. While there exist potentially many estimation approaches, we focus on ones that passively

81 4.2. ESTIMATOR DESIGN FRAMEWORK AND METHODOLOGY 65 snoop on packet arrivals and maintain statistics to estimate link quality. Such approach can be generalized into a framework, as shown in Figure 4.1, such that different kinds of estimation techniques can be fitted into it and evaluated using the same input and output format. The inputs to the framework include external events from packet arrivals, M, and internal periodic timer events, T. We assume that each packet contains a source ID and a link sequence number. Since a lost packet does not generate any message arrival events, we can only infer packet loss events based on the gaps in the link sequence number. Therefore, since M denotes a packet arrival event, it is equivalent to signal zero or more packet loss events followed by a packet success event. If we denote successes as 1 s and losses as 0 s, M is always started with one or more 0s followed by a 1. The periodic timer event provides a synchronous input to the estimator that allows it better estimate losses when message events are infrequent. For example, if a node were to disappear, no later message events would occur, yet the connectivity estimate should go to zero. One temporal assumption is that higher layer protocols can provide a minimum message rate, R, for neighboring nodes. If R is known, estimators can safely infer the minimum number of packet losses over the time period, T, and compensate accordingly. Since the minimum message rate is usually much lower than the actual data rate, if R is not known, a conservative R can be used, such that good links can still be estimated correctly while bad links that are heard infrequently will not be mistaken to be good. The above process would yield an in-bound estimation of the link quality of neighboring nodes. For out-bound estimation, each node must collect from the neighboring nodes

82 4.2. ESTIMATOR DESIGN FRAMEWORK AND METHODOLOGY 66 the in-bound link estimation of itself. That is, nodes must exchange their in-bound link quality estimation to others for them to establish bi-directional link quality estimation. Since such a process is similar to a local broadcast, one efficient approach is to piggyback in-bound link estimates over the the minimum data rate traffic that is required by link estimation. Such traffic is often realized as route update messages in routing protocols. We will discuss how this is realized in Chapter 6. The out-bound link estimation can become stale when nodes are moved, obstructed, or disappear. Therefore, a mechanism for decaying is required to prune such information as it becomes stale. We choose a binary exponential decay mechanism to age an out-bound link estimation if it is not updated for a period of time, which is defined by the parameter OutBoundDecayW indow. That is, if an out-bound link estimation of a neighbor is not updated after a period defined by OutBoundDecayW indow, at each T event, the out bound estimation will be halved until it reaches zero or is updated again. We will revisit the effect of this on routing in Chapter 7. Our high-level evaluation methodology is as follows. For each estimator, there is a continuous tuning space. To make fair comparisons among different estimators, we pick two meaningful points in the tuning space as our tuning objectives. One point is to tune for best agility given a stability target. The other is to tune for best stability given an agility target. Individual estimators are well tuned to meet these objectives before being compared against each other. We use the simulated trace, denoted as Ŵ (t), shown in Figure 3.8, as the trace generator output, M, to the tuning process. The data rate was set to be 8 packets/s, which is the same as the empirical trace, denoted as W (t), also shown in Figure 3.8. We also set

83 4.2. ESTIMATOR DESIGN FRAMEWORK AND METHODOLOGY 67 Trace Generator p M = Message Arrival Event T Timer Event Estimator R Minimum Data Rate Constant Stable Estimation, ^P Agile Estimation, ^P Figure 4.1: General framework of passive link estimators. R to this value in order to derive the best out of these estimators Metrics of Evaluation We use a set of metrics to evaluate our estimator designs relative to our framework in Figure 4.1; they include settling time, crossing time, mean square error, coefficient of variance, and sum of errors. Recall that in Chapter 3.4, a synthetic trace can be generated using a step function p(t). Table 3.1 shows a p(t) created based on one of the empirical traces, which we call W (t). This p(t) is used to generate input M as indicated in 4.1, and ˆP is the current estimation of p. The following defines the metrics in greater detail. Settling Time is the length of time after a step in p(t) before ˆP reaches within ±ɛ% of p(t) and remains within that error bound. We use a threshold of ɛ = 10%. Crossing Time is the length of time after a step in p(t) before ˆP first crosses ±ɛ% of p(t). Since p is known to us at all times, we can compute the mean square error (p ˆP ) 2 which not only captures the degree of error, but also places a higher penalty on large overshoot or undershoot. Coefficient of variance measures how

84 4.2. ESTIMATOR DESIGN FRAMEWORK AND METHODOLOGY 68 stable an estimator is after reaching steady state. Sum of errors is used to capture if the estimator is biased, which may lead to systematic errors. Finally, Memory Resources and Computation Complexity measure the degree of efficiency and simplicity for each estimator design. One important concept to clarify is the measurement of settling time and crossing time. Since both of them depend on the packet arrival rate, they are measured in terms of number of packet opportunities rather than raw elapsed time. If the average packet arrival rate is known, the two metrics can be converted back to the time domain Error, Stability, and Memory Relationship It is important to understand the tradeoff between estimation stability, agility, and the amount of history used to generate the estimation. Section 3.3 shows that binomial distribution can be an adequate approximation to the channel variations, where the average link quality remains roughly constant. With this independently and identically distributed (i.i.d.) assumption, we can use the central limit theorem to learn the relationship between the number of samples required and the corresponding error bound on our mean estimation (link quality) with a 95% confidence interval. From that, we can infer the relationship of error, stability, and potential memory requirement. By the central limit theorem, to yield a 95% confident mean estimation with at most a ɛ% error of a Bernoulli process, the minimum number of samples n required can be expressed as: n > 4p(1 p), where p 0, 1 is the true mean. Although this approximation ɛ 2 requires a large n to begin with, various relationships can still be learned from it. First, the true mean p has a non-linear effect on n. The worse case occurs when p = 0.5; this

85 4.2. ESTIMATOR DESIGN FRAMEWORK AND METHODOLOGY 69 will maximize n for a given ɛ. Second, changes in ɛ have an inverse, quadratic effect on n. That is, halving the error requires increasing the history size by four times. Third, agility also has a quadratic tradeoff with error since a smaller n tends to increase ɛ quadratically. Finally, the expression shows that to achieve a 10% error would require n > 100 for p = 0.5. (n > 36 for p = 0.9 if we do not take the worst case). In general, these relationships imply that many samples (O( 1 )) are required to achieve a stable and accurate estimator. Agile ɛ 2 estimation is possible, but the error will be large. We will explore these effects in simulation when we study our candidate estimators in detail Confidence Interval Approximation Estimating the confidence interval of the estimation ˆP from the tuned estimator can be valuable for higher-level protocols. The typical method is to use the normal approximation of the binomial distribution, which is an appropriate approximation when n is large and p is around 0.5 [47]. However, since p can range from 0 to 1, one would like to understand what technique should be used to estimate the confidence interval over a different p. Results from [47] show that the normal approximation has less than 4% error as long as p 0.2 (note p is symmetrical at 0.5). Thus, for small p, the Poisson approximation should be used to estimate the confidence interval. It is certainly useful to use the normal approximation since we would like to estimate semi-good links (p around 0.5). However, for links with large or small p, estimating the confidence interval may not be useful at all. Very bad links are not utilized for routing while for the very good links, the variance of ˆP is low and the confidence interval is

86 4.3. ESTIMATOR DESIGN AND EVALUATION 70 within our error tolerance for the worst case. Therefore, we do not find an urgent need to approximate the confidence interval using the Poisson distribution. 4.3 Estimator Design and Evaluation In this section, we first introduce the basic terminology that we will use to standardize the descriptions of the six different estimator designs that we will discuss in detail. We then discuss our tuning objectives such that each estimator can be compared fairly at the end. The six different estimator designs are EWMA (Exponentially Weighted Moving Average), Flip-Flop EWMA, moving average, time-weighted moving average, Flip-Flop packet loss and success interval with EWMA, and WMEWMA (window mean with EWMA). These estimators are chosen because they are simple estimators that utilize relatively small storage space Terminology We first establish the relevant terminology that we will use to present the different estimator designs. The symbols and the corresponding definitions are summarized in Table 4.1. If the input is an M event, a packet must have been received successfully. Therefore, we set t equals to the current time stamp. To calculate m, we extract the sequence number from the successful packet in M and subtract it from the last sequence number heard plus one. In general, the number of missed packets accumulated since the last estimation, l, is max(m, k). Note the process of maintaining k, t, last sequence number heard,

87 4.3. ESTIMATOR DESIGN AND EVALUATION 71 Symbol ˆP T M m t R k l Definition The current estimation. The period timer event. Last message arrival event. The number of currently known missed packets based on link sequence number since last estimation is done. Time stamp of the last M event. Minimum data rate (packet/s). Estimated missed packets based on R and elapsed time from t. Number of missed packets accumulated since last estimation is done. Table 4.1: Terminology used for describing link estimator design. and the calculation of l apply to all estimators and is orthogonal to the actual estimator s algorithm. For example, if no messages have been received for the entire period T, then m = 0, k = R T, and l = k Tuning Objectives We tune each estimator design to satisfy two different objectives: stability and agility. For the stability objective, we aim to minimize the settling time while requiring the total error ɛ < 10%. For the agility objective, we aim to minimize the total error while requiring the crossing time, (i.e. ˆP reaches ±10% of p), be within 40 packet opportunities. The crossing time is chosen somewhat arbitrarily since our concern is to reveal the different shortcomings among different estimators when they are tuned to the same objective. However, 40 packet opportunities is slightly fewer than half of what we would expect for the most reactive stable estimators, using the central limit theorem with a binomial distribution assumption and a 10% interval with 95% confidence. With these tuned estimators,

88 4.3. ESTIMATOR DESIGN AND EVALUATION 72 we compare settling time, crossing time, mean square error, and coefficient of variance, as well as memory resources and computational requirement. Note that settling time is only meaningful when we consider the stability objective because the agility objective would yield a much greater error which would undermine the meaning of the settling time Candidate Estimator Design and Evaluation EWMA (Exponentially Weighted Moving Average) The exponentially weighted moving average (EWMA) estimator is very simple and memory efficient, requiring only a constant storage of the last estimation for any kind of tuning. Since EWMA is so simple and widely used, we use it as the basis for comparison with other estimator designs. EWMA computes a linear combination of infinite history, weighted exponentially. It has the property of being reactive to small shifts and is often used as a responsive change detector in many statistical process control applications [56]. The estimator works as follows. Let 0 < α < 1 be the tuning parameter. At any M or T event, repeat ˆP = ˆP α for l times. If it is an M event, compute ˆP = ˆP α+(1 α). The implementation of EWMA will take 4 bytes (floating point) or 1 byte ( fixed point) to store ˆP and the amount of computation involved is 2 multiplications and 1 addition. Figure 4.2(a) shows ˆP (t) of the tuned, stable estimator, with α = It reveals that to keep within 10% error, EWMA is already set very close to its maximum gain of 1. With such a large gain, agility is not to be expected. The crossing time for EWMA is 167 packets while the settling time is close to 180 packets. Figure 4.2(b) shows the agile version with α = It is probably not a useful estimator since it has large overshoots

89 4.3. ESTIMATOR DESIGN AND EVALUATION 73 and undershoots, which is expected since EWMA is sensitive to small shifts. Nevertheless, the agile version is good for detecting disappearance of a neighboring node over a relatively short time. Note that a small decrease of α from 0.99 to has a large effect on agility and error, something we do not normally see in other contexts. Furthermore, in practice, representing α using a fixed point to avoid heavy weight floating point operations may create extra complexity since α needs to be quite precise. Flip-Flop EWMA(α stable, α agile ) A Flip-Flop between two EWMA estimators, with a different stability and agility setting, is suggested to be a good estimator to provide both agility and stability in [50]. Such a design uses statistical control theory to dynamically estimate the upper bound error, which is used as a switching policy between stability and agility. That is, if the spot value of the stable estimation is beyond the estimated upper bound limit, the estimator automatically switches to the agility setting; otherwise, the stable estimation is used. To explore the effectiveness of such a flip-flop design in our sensor network context, we follow a similar flip-flop approach using the agile and stable tunings found in the above EWMA study. Since α stable is tuned to have a 10% noise margin, a simpler switching policy is to switch whenever the difference between the output of the two EWMA is greater than 10%. Note that such a switch can go in two directions, and we simulated both of them. One is to be agile by default. When the agile output deviates by more than 10% from the stable estimation, we fall back on to the stable estimation. The resulting ˆP (t) is shown in Figure 4.2(c). The other approach is to be stable by default, but switch to be agile since

90 4.3. ESTIMATOR DESIGN AND EVALUATION 74 detection on sudden change such as mobility can be detected much earlier. Similarly, 10% is also used for the switching threshold, and the resulting ˆP (t) is shown in Figure 4.2(d). These two graphs suggest that the flip-flop idea does not provide an advantage over the simple EWMA in our setting. This is because fluctuations of the agile estimator are so bad that it only introduces instability and error. The study in [50] does not show the dynamics of either estimator separately over time, so it is difficult to isolate why it does so much better in that setting. Moving Average(n) The moving average estimator is another simple estimator that is widely used for packet loss rate estimation, including in IGRP routers. The algorithm works as follows. Let n be the tuning parameter specifying the maximum number of bits of a sliding history window, h. At any given event, append l zeros to the end of the window, and append 1 to the end if it is an M event. The window h will be left shifted logically by the corresponding number of bits inserted. Then, ˆP = n i=1 h(i) n. To avoid a large error in ˆP when there are only a few samples, the estimator gives no estimation, ˆP = 0, if the number of samples is below some threshold, φ. The implementation of such an algorithm will take n 8 bytes for storage and the amount of computation involved in computing ˆP is n bit shifts, 1 addition, and log 2 (n) shifts rather than a full division. For ease of implementation, the tuning process takes n in multiples of 8. Figure 4.2(e) illustrates ˆP (t) of the moving average estimator tuned for the stability

91 4.3. ESTIMATOR DESIGN AND EVALUATION 75 objective with n = 144. This estimator achieves a settling and crossing time of about 120 packets, a much shorter time than EWMA tuned for the same error objective. However, with n = 144 or 18 bytes of storage per link estimation, it is expensive to keep track of link quality for a reasonable number of neighboring nodes. Figure 4.2(f) shows the agile case with n = 24. Tuned for the same agility objective, the moving average estimator appears to have less error and variance than those of the EWMA. Compared to Figure 4.2(f), Figure 4.2(b) shows that EWMA is more sensitive to small changes. Time Weighted Moving Average (TWMA)(n,w) The moving average estimator applies the same weight on all packets within the sliding window. A common improvement is to apply a weighting function, which places heavier weight on more recent samples so that the estimation can be more adaptive to temporal changes. The basic algorithm works the same as the moving average except for an addition of a time weighted function, w. Thus, the tuning parameters for this estimator are n and w. In our study, we stick to one weighting function, w, and only tune n. While w is not the perfect function, it serves a purpose for observing the effect on weighting. The w that we choose is a sequence of coefficients that weight the elements of the sliding window differently. Let h be the sliding window with elements 0, 1 and s be the number of elements currently in h. Then, w is a sequence of length s, and the weight that it applies for the most recent s/2 elements in h is 1. For the rest of the s/2 samples, 1 the weight is linearly decreased from 1 to s/2, with the smallest weight applied for the most stale element. Therefore, ˆP = s i=1 w(i) h(i) s i=1 w(i), where s <= n.

92 4.3. ESTIMATOR DESIGN AND EVALUATION 76 (a) Stable EWMA(α = 0.99) and p(t). (b) Agile EWMA(α = ) estimation. (c) Flip-Flop EWMA(α stable = 0.99, α agile = ). It uses the stable estimation if the agile estimation goes beyond 10% from the stable estimation. (d) Flip-Flop EWMA(α stable = 0.99, α agile = ). It uses the agile estimation if that goes beyond 10% from the stable estimation. (e) Stable moving average (n = 144) and p(t). (f) Agile moving average with (n = 24). Figure 4.2: ˆP (t) for different estimators at both stable and agile configuration.

93 4.3. ESTIMATOR DESIGN AND EVALUATION 77 The implementation of this estimator also takes n 8 bytes to store the sliding window in h bits. As for the amount of computation, since h 0, 1, the multiplications can be turned into additions. As a result, there are n additions and 1 division. This carries more complexity compared to the moving average. Since w is different for different s, a fixed size lookup table can be used to store all w given n is fixed. Note that for ease of implementation, the tuning process also takes n in multiples of 8. Figure 4.3(a) illustrates ˆP (t) of the tuned, stable TWMA estimator with n = 168. Figure 4.3(b) shows the agile version with n = 32. Visual comparison of the same figures for the moving average show that the two are very similar and both have better settling time and less high frequency fluctuation than EWMA. However, the effect of the weighting function is twofold. First, it increases the history required to achieve our stability objective from 144 to 168, making it less memory efficient as compared to the moving average. Second, it is likely that w requires floating point operations, which we try to avoid. Nevertheless, as indicated in Table 4.2, w improves over moving average by decreasing the average settling time from 122 to 113 packet time, while maintaining the same amount of error and variation. As for the agile case, n = 32 also requires more memory than 24 for the moving average case. However, the resulting mean square error and coefficient variance are smaller than those of the EWMA and the moving as shown in Table 4.3. Flip-Flop Packet Loss and Success Interval with EWMA (FFPLSI) (α success, α loss, ff) The packet loss interval is the number of consecutive successful packets in between two successive packet loss events. That is, it measures the number of 1s in between two

94 4.3. ESTIMATOR DESIGN AND EVALUATION 78 0s. The greater the interval, the better is the reception probability. An estimation of the packet loss interval adapts slowly to bursts of packet successes, but reacts quickly to bursts of packet losses. The estimator works as follows. The tuning parameter is α loss. Let I be the current loss interval average computed using an EWMA. Let i be the most updated number of consecutive successes when a packet loss is detected through either a T or M event. The average I is computed as follows: for each i, I = I α loss + (1 α loss ) i. At any instance, ˆP (t) = I I+1. 1 is added in the denominator to avoid any division by 0. The packet success interval is the reverse of the packet loss interval. That is, it measures the number of 0s between two 1s. The estimation of this average corresponds to the average burst of errors. Therefore, the greater the interval is, the worse is the quality of the link. Unlike the packet loss interval, the packet success interval adapts slowly to bursts of packet losses, but it reacts quickly to bursts of packet successes. The computation is similar to packet loss interval, with I being the current average of packet success interval computed by an EWMA. Similarly, i is the most updated number of consecutive losses when a packet success is detected. The tuning parameter is α success. For each i, I = I α success + (1 α success ) i. At any instance, ˆP (t) = 1 in the denominator to avoid any division by 0. I I+1. 1 is added The flip-flop mechanism can be used to capture the best of both worlds. For stability, the packet loss interval should be used when successes are frequent (e.g. ˆP >= 50%) while the packet success interval should be used when losses are frequent (e.g. ˆP < 50%). We call this configuration f f =STABLE. For agile estimations, it should be the

95 4.3. ESTIMATOR DESIGN AND EVALUATION 79 reverse, and we call this ff=agile. Since EWMA is used for averaging, the implementation of this estimator is very efficient. For each entry, it only takes 2 bytes to store the intervals and 2 bytes (fixed point) or 8 bytes(floating point) to store ˆP. Like EWMA, parameter tuning does not affect the storage requirement. Figure 4.3(c) shows ˆP (t) for the tuned, stable (ff=stable) estimator, with α success = 0.98 and α loss = This estimator is very stable and smooth around both extremes at 0 and 100%. However, the slow rising edges show that its settling and crossing time are much larger compared to other estimators, even with the EWMA. The agile case is shown in Figure 4.3(d), with α success = 0.85 and α loss = This estimator is not a good agile candidate, since its fluctuations are large. Window Mean with EWMA (WMEWMA) (t, α) So far, all estimators that we have discussed update the estimation for every M event. It is possible to perform low-pass filtering by taking an average over a time window and adjusting the estimation using the latest average. This average is actually an observation of ˆP, and EWMA can be used for more filtering to yield a better estimation. The tuning parameters are the time window, t, and α for the EWMA. Let t be the time window represented in the number of message opportunities between two T events, and 0 < α < 1. The algorithm works as follows. ˆP is only updated at each T event. Let t defines the time interval between two T events. Let r be the number of received messages (i.e., the number of 1s from the M events) during this time interval. Thus, at the time of each T

96 4.4. CANDIDATE ESTIMATOR COMPARISONS 80 event, the mean µ = r/(r + l), and ˆP = ˆP α + (1 α) µ. For each entry in this estimator, it will take 2 bytes for storing r and l, and 1 byte (fixed point) or 4 bytes (floating point) for storing ˆP. The amount of computation involves 2 additions, 1 division, and two multiplications. The computation is done per T event rather than per M event. Similar to EWMA, this estimator s storage requirement is independent of parameter tuning. Figure 4.3(e) shows ˆP (t) of the tuned, stable estimator with t = 30 message time and α = The observed settling time and crossing time are relatively small, and the high frequency components in the estimation are clearly removed as compared to Figure 4.2(a) in the EWMA case. In fact, the settling time of this estimator is comparable to the fastest time weighted moving average as shown in Table 4.2. Figure 4.3(f) shows the agile version with t = 10 message time and α = 0.3. Although the windowing effect has low-pass filtering effect, using a small t actually creates variations that EWMA is sensitive to. As a result, the performance in the agile scenario does not show significant improvement over the EWMA. 4.4 Candidate Estimator Comparisons With the controlled study of our candidate estimators in hand, we return to the question of what is the best estimator relative to the metrics we consider for both tuning objectives.

4.4. CANDIDATE ESTIMATOR COMPARISONS 81 (a) Stable TWMA estimation (n = 168) and p(t). (b) Agile TWMA estimation (n = 32). (c) Stable FFPLSI estimation (α success=0.98, α loss =0.

97 4.4. CANDIDATE ESTIMATOR COMPARISONS 81 (a) Stable TWMA estimation (n = 168) and p(t). (b) Agile TWMA estimation (n = 32). (c) Stable FFPLSI estimation (α success=0.98, α loss =0.98, ff = STABLE) and p(t). (d) Agile FFPLSI estimation (α success=0.85, α loss =0.85, ff = AGILE). (e) Stable WMEWMA(t = 30, α = 0.6) and p(t). (f) Agile WMEWMA(t = 10, α = 0.3). Figure 4.3: ˆP (t) for different estimators at both stable and agile configuration.

98 4.4. CANDIDATE ESTIMATOR COMPARISONS Stable Estimators Table 4.2 summarizes the results of the stable estimators. We first look at the sum of errors. The value should approach 0 if the estimator is unbiased. For all the estimators, the sum of errors is small, showing that they are not biased. The mean square error penalizes estimators that have large overshoots or undershoots. FFPLSI is very stable at the extremes. As a result, it achieves the smallest mean square error as expected, though other estimators come close to it. The coefficient of variance measures the effectiveness of the estimator in staying within the true value. FFPLSI has the largest coefficient of variance while EWMA has the best. Again, values for other estimators are relatively close. The major determining factor is the settling time. Moving Average, TWMA, and WMEWMA all have much smaller settling times than the rest of the estimators. It is desirable to have the most agile estimator that can still stay within 10% of the true value, even if it does not have the best mean square error and coefficient of variance. One would hope that the crossing time for EWMA and FFPLSI will be much smaller than the actual setting time. However, from the Figures of ˆP (t) and Table 4.2, it is clear that the crossing times are only slightly smaller than the settling times. Another important constraint is storage space. Since Moving Average and TWMA do not have constant storage space, WMEWMA seems to be the best choice given that it is well balanced in all dimensions Agile Estimators Table 4.3 summarizes the performance of the agile estimators. For sum of errors, FFPLSI is five times larger than its stable counterpart. EWMA also increases three times.

99 4.4. CANDIDATE ESTIMATOR COMPARISONS 83 Estimator Settling Crossing Mean Coefficient Sum of Storage Time Time Square Error of Errors per Entry (Packets) (Packets) (% 2 ) Variance (bytes) EWMA x % 0.10% 1 (α = 0.99) (2 mul,1 add) Moving Average x % 0.26% (n = 144) n 8 (1 add, n + log(n) shifts) TWMA x % 0.23% (n = 168) n 8 (1 div, n add) FFPLSI x % -0.18% 2 (α success = 0.98) (α loss = 0.98) (4 mul, 2 add) WMEWMA x % 0.16% 3 (t = 30, α = 0.6) (2 mul,1 div, 2 add) Table 4.2: Simulation results of all estimators in stability settings. This suggests that these two estimators can be biased in agile configuration. Our agility settings decrease the settling time by 5 to 10 times relative to the stability settings. However, mean square error and coefficient of variance increase by roughly the same factor. It appears that the agile estimator is only useful to discover a significant change in link reliability quickly, such as disappearance of a node Performance based on Empirical Traces Since our study suggests that WMEWMA is a good estimator, we now focus on its performance based on input from empirical traces. Figure 4.4 shows how the WMEWMA estimator, tuned for the stability objective, performs on the empirical trace input that shaped the trace generator. Our final choice estimator tracks the empirical trace well. The

100 4.4. CANDIDATE ESTIMATOR COMPARISONS 84 Estimator Settling Mean Coefficient Sum of Time Square Error of Errors (Packets) (% 2 ) Variance EWMA 21 65x % 0.28% (α = ) (2 mul,1 add) Moving Average 23 55x % 0.24% (n = 24) (1 add, n + log(n) shifts) TWMA 25 48x % 0.22% (n = 32) (1 div, n add) FFPLSI 23 80x % -0.98% (α success = 0.85) (α loss = 0.85) (4 mul, 2 add) WMEWMA 21 70x % 0.18% (t = 10, α = 0.3) (2 mul,1 div, 2 add) Table 4.3: Simulation results of all estimators in agility settings. degree of overshoot and undershoot is higher than in simulation. This is expected since real traces W (t) have larger variances that Ŵ (t). As a result, estimators tuned using Ŵ (t) should be tuned for more stability when applied in real situations Confidence Interval Estimation with WMEWMA We can improve our link estimator by using normal approximation to derive confidence intervals of the estimated mean. We use the WMEWMA estimator and study how the confidence interval changes across the different link quality, P. We relax α from 0.6 to 0.5 to capture the likelihood of using bit shifting rather than divisions in real implementations. Figure 4.5 shows the 95% confidence interval approximation when P changes from 90% to 50%. According to the normal approximation equation, the confidence interval lies between

101 4.5. ALTERNATIVE ESTIMATION TECHNIQUES 85 Figure 4.4: Output from the stable WMEWMA estimator using empirical data input. [ ˆP zσn n, ˆP + zσn n ]. Since n is fixed by the estimator tuning and z can be looked up from the normal distribution table, the variance, σ n, affects the confidence interval calculation the most. For Binomial distribution, σ n depends on P. Figure 4.5 shows that the 95% confidence interval can vary from 6% to 11% as P changes. Since the expected 10% noise margin agrees with the range of the estimated confidence interval, we believe that an on-line approximate of the confidence interval can be omitted unless higher-level protocols require an accurate estimate of it. 4.5 Alternative Estimation Techniques The resource constraints on our platform significantly limit the amount of processing and storage one can do, which narrows the choice of estimators. Computing a statistically meaningful median already raises a concern on storage constraint, given there

102 4.5. ALTERNATIVE ESTIMATION TECHNIQUES % Confidence Interval Actual Probability WMEWMA(30,0.5) Percentage % T Events Figure 4.5: Confidence interval estimation with respect to the WMEWMA(30,0.5) estimator for different link quality. are potentially many neighboring nodes one needs to estimate. Despite the rich literature on estimation techniques, such as linear regression, the Kalman filter, or the Hidden Markov Chain, their use is not practical for such a low-level mechanism; some of them may even require a detailed model of the channel, which is difficult to achieve for all kinds of environments. There may exist other estimation techniques at the packet level that are more effective than what we have explored. Without a channel model, they would be non- Bayesian estimators and their performance, as approximated by the central limit theorem, would be very close to what we have already achieved with our candidate estimators. The new IEEE radios, such as the Chipcon 2420 [2], provide hardware link quality indicator support on a per packet basis at the physical layer. The units of measurements are not normalized to reception rate probability; such support can be very useful to augment

103 4.6. RELATED WORK 87 our link quality estimation at the packet level. The hardware is new and future work is required to utilize such capability. For the studies in the remaining chapters, we use the WMEWMA estimator and explore its effectiveness with respect to the routing layer. 4.6 Related Work Passive probing to estimate link reliability for the wired and Wi-Fi type of wireless networks is well established. In wired networks, it is widely deployed over the Internet in protocols such as the Internal Gateway Routing Protocol (IGRP) [35] and the Enhanced IGRP (EIGRP) [10]. Reliability is measured as the percent of packets that arrived undamaged on a link. It is reported by the network interface hardware or firmware, and is calculated as a moving average. In IGRP, link reliability of a route equals the minimum link reliability along the path. This is one example illustrating how link estimation is used in the context of routing in the Internet. In wireless networks such as [1], a wide range of link estimation techniques have been proposed and implemented. It is necessary to perform link level estimation because only characterizes link at the frame level and uses it together with the signalto-noise ratio to determine the appropriate bandwidth setting for links to communicate reliably. Such information is not necessarily exposed by the firmware and also depends on whether the operation mode is infrastructure mode or ad-hoc mode. The received signal strength to infer link quality has been used to systematically study the link characteristics [28]. However, [24] shows that using the received signal strength to infer link quality in networks is not accurate, and they propose a packet-

104 4.6. RELATED WORK 88 level window moving average estimator based on the periodic broadcast of link probes. The tight decoupling of link estimation and routing protocols limits [24] from piggybacking information over routing packets; thus, such an estimation technique is no longer passive. Another approach is based on burst-wise C/I estimates [7], in which a carrier signal estimate is calculated by convoluting a training sequence with the channel impulse response; however, this requires support from the radio hardware and a correct impulse response of the channel. Other methods that utilize the acknowledgment history of the most recently transmitted packets have also been proposed [16, 55]; however, these mechanisms require explicit packet transmissions to each neighbor. There is also a rich literature on network performance estimation, especially in the context of multicast and overlay topology management. Most of these efforts focus on active probing by injecting measurement traffic into the system, since direct measurement is not built-in as it is for link interfaces. For example, special multicast probes are used to estimate the internal multicast network packet loss rate and infer the overall topology [63]. To minimize the power consumed by link estimation, we focus on passive techniques and avoid sending probe packets by piggybacking over route update packets. Most prior work using passive estimation seeks to estimate a value from a large set, where each observation is itself a direct measurement instead of an event. For example, the round trip time estimator in TCP [76] can adjust its estimation based on each round trip time measurement. In contrast, we must estimate the probability of reception from each discrete boolean event - the arrival of a packet or the silent failure to receive a packet. Thus, estimators that have proved effective in other regimes may not be effective here. For

105 4.7. SUMMARY AND MULTIHOP ROUTING IMPLICATIONS 89 example, a highly relevant work in [50] studies the behavior of a collection of estimators in a context where both responsiveness and stability are desired in the face of sudden changes. By measuring round trip delay, they calculate the latest available bandwidth and filter it with an estimator. They assert that flip-flopping based on statistical process control between two EWMAs, with agile and stable gain settings, provides the best estimator. Since each measurement reveals the latest estimate of available bandwidth, it can determine if the latest bandwidth falls within a certain prediction. If the agile estimation deviates so much that the process is likely to go out of control, the flip-flop drops the agile estimation and relies on the stable prediction. We do not have the same ability on a sample by sample basis, and, as observed in our estimator studies, such a flip-flop scheme does not yield significant benefits in our regime. 4.7 Summary and Multihop Routing Implications The goal of this chapter has been to explore the design space and select a simple, passive link estimator that performs well according to our metrics and can be efficiently implemented on our resource-constrained platform. Such resource limitations allow us to filter out many complex or inefficient estimation techniques and help us focus on a few efficient estimators. Without an accurate a priori channel model, these estimators are non- Bayesian. Through a systematic study of six different estimators, we found that EWMA over an average time window (WMEWMA) performs best overall. In our study, it provides stable estimations within a 10% error, with a corresponding reaction time to large link quality changes of about 100 packets, which agrees with the lower bound approximated by

106 4.7. SUMMARY AND MULTIHOP ROUTING IMPLICATIONS 90 the central limit theorem. Furthermore, the storage requirement is constant for all tunings, making it an attractive estimator for our resource constrained platform. Our study has narrowed down the estimator design space into this one estimator and suggests a reasonable parameter setting that yields the above performance. Real time estimation of link reliability is vital in any self-organizing network, wired or wireless, since routing paths should never be constructed over poor links. Understanding how the performance of link estimations may affect higher-level algorithms is important for our holistic design. Our results suggest that one can either build an agile estimator with large errors or a stable estimator with a settling time of about 100 packets. For routing purposes, the stable estimator is a clear choice. Even so, the estimate will only be accurate to about 10%. In choosing a parent for routing, fluctuations of up to about 10% should be tolerated before switching to a better alternative. For routing algorithms that use cost metrics composed of link estimations as aggregated routing costs, care must be exercised to avoid cycles due to variations and errors in the estimates. The stability of the estimator can affect the global stability of the topology, especially if the routing cost metric is built upon the estimations. Although all the link estimator designs that we consider are passive, they all rely on an implicit minimum data rate set by the application. Such minimum data rate is often realized as beacons (such as route updates) and will affect both bandwidth utilization and rate of topology adaptation. All in all, while link estimation is purely an underlying mechanism, such a minimum data rate is a policy that higher-level protocols should consider.

107 91 Chapter 5 Neighborhood Management under Limited Memory We have laid out the groundwork of using passive link estimators to characterize link quality for assigning weights for each edge in building a connectivity graph for routing. The next step in our holistic approach is for each node to build up its local neighborhood using a fixed size neighbor table, which is often small due to memory constraint on the platform; such a logical neighborhood defines the local connectivity options of a node. The sum of all the local neighborhood information from the entire network thus forms a distributed logical connectivity graph for routing. The usual concept of defining a neighbor is based on boolean connectivity. With the probabilistic view of connectivity, as defined relative to link estimation, neighborhood becomes a fuzzy concept. Therefore, in this chapter, we revisit the basic concept of neighborhood management under this probabilistic approach. The challenge in such a process is to achieve network scalability while using only limited

108 5.1. DENSE AND FUZZY NEIGHBORHOODS 92 resources on each node; in typical deployments, there would be more potential neighbors than what a node, using its limited memory, can keep track of. Thus, each node must identify a subset of neighbors with reliable connectivity. However, we cannot rely exclusively on link estimation to determine if a node should be tracked as a neighbor, since estimation requires memory. How should we determine whether a potential neighbor might be a good neighbor to keep in the neighbor table? We describe a framework of such a local process, borrowing techniques from cache design policies and data-stream estimation techniques in database literature to solve the problem. A thorough evaluation of the different techniques is presented, and the best is selected for the routing study in the remaining chapters. We survey the related work, with most of the prior work found in the packet radio literature. We then discuss the overall implication of neighbor management to routing. 5.1 Dense and Fuzzy Neighborhoods Chapter 3 shows that the connectivity cell of a node is irregular and the communication range consists of three distinct regions. To ensure reliable links, nodes are typically spaced within the effective region of the communication range. Since the transitional region is much larger and has a mix of many good and bad links, this is likely to create a relatively dense network. In addition, there are many potential neighbors with unreliable links, not suitable for routing. For example, Figure 5.1 illustrates the potential ratio of the number of nodes in the effective region (darker region) to that in the transitional region (lighter region). That is, in a typical deployment, the number of nodes in the effective region is small compared with that in the transitional region. Furthermore, not all the nodes in

109 5.2. CHALLENGES OF NEIGHBORHOOD DISCOVERY UNDER LIMITED MEMORY 93 the transitional region have bad links. The darker circles in the shaded region represent nodes that have good links suitable for routing. Simply hearing a message cannot determine whether a node is in the effective region or has bad or good links in the transitional region. One simple approach is to use a link quality threshold and only consider nodes that are above the threshold as neighbors. While this technique is simple, it is difficult to determine one appropriate value for all deployments. For example, a sparse network would require a very different value from a dense network. If the node layout is not uniform, it is difficult to expect a single threshold would apply for the entire network. In addition, interferences from other traffic and environmental effects can lead to link quality fluctuations around the threshold. The result is links coming and going over time, which may lead to network partitions. In the next section, we discuss further how memory constraint would make this thresholding approach impractical. 5.2 Challenges of Neighborhood Discovery under Limited Memory Typical approach to neighbor discovery is to record information about all nodes from which packets are received (potential neighbors), either as a result of passive traffic monitoring or active probing through beacons. Link quality can then be estimated and used for neighbor discovery. This implies that memory resources must potentially be allocated for each potential neighbor. Even though the link estimator that we have chosen is simple and memory efficient, placing too many memory resources in such a low-level operation is not efficient from the overall system point of view. For example, if an entry of a neighbor

5.2. CHALLENGES OF NEIGHBORHOOD DISCOVERY UNDER LIMITED MEMORY 94 Figure 5.1: Illustration of the potential neighbors of a center node in a dense network.

110 5.2. CHALLENGES OF NEIGHBORHOOD DISCOVERY UNDER LIMITED MEMORY 94 Figure 5.1: Illustration of the potential neighbors of a center node in a dense network. The darker shaded region shows the effective region while the lighter region shows the transitional region. The cross indicates the center node. table requires 10 bytes for maintaining and estimating the quality of each neighboring node, a node that hears 50 nodes would require a 500-byte neighbor table, which is 1/8 of the total memory available for the entire node on the Mica or Mica 2 platform. Furthermore, in a dense network, not only does a node receive packets from more potential neighbors that it can represent in its neighbor table, most of these potential neighbors would have unreliable links that are not suitable for routing to begin with. As a result, how does a node determine, over time, in which nodes it should invest its limited neighbor table resources to maintain link statistics? The problem is that if a node is not in the table and the table is full, there is no place to record the link statistics of the node. As a result, the receiver cannot determine the link quality of this node and determine whether to invest precious memory

111 5.3. AN ON-LINE NEIGHBORHOOD SELECTION PROCESS 95 resources on it or on the current set of neighbors in the table. Therefore, the process of link estimation and neighborhood management are mutually related. In fact, neighborhood management must itself infer link reliability without the use of link estimation. Controlling the transmission power to adjust cell density does not solve the problem either since the parameter is often application or deployment specific. For example, [64] adjusts the transmit power to control the topology and minimize energy required to transport data. All in all, it is fundamental that sensor network applications can only maintain statistics about a subset of the potential neighbors. That is, we need an on-line neighborhood selection process that keeps track of a set of good neighbors with a limited size table regardless of cell density. The selection criteria of neighbors heavily depend on the nature of the higher-level protocol or application. For routing, there are many ways to define good neighbors, but we first focus on the basic concept in finding the reliable ones without the need of link estimation. 5.3 An On-line Neighborhood Selection Process The neighborhood management process essentially has three components: insertion, eviction, and reinforcement. For each incoming packet upon which neighbor analysis is performed, the source is considered either for insertion if it does not reside in the table or reinforcement if it does. If the source is not present and the table is full, the node must decide whether to evict another node from the table. We seek to develop a neighborhood management algorithm that will keep a sufficient number of good (reliable) neighbors in the table regardless of cell density. Ultimately, the goodness criterion should reflect which nodes are most useful for routing. For example,

112 5.3. AN ON-LINE NEIGHBORHOOD SELECTION PROCESS 96 we would want to discard nodes with low-quality links, as they are poor routing neighbors. With a link quality distribution as indicated by Figure 3.1(a), a node in a field of sensors will hear from many more weakly connected, distant nodes than from well-connected ones. However, a node should hear from the well-connected nodes more frequently, since a smaller fraction of their packets are lost, assuming every node has a roughly uniform transmit rate. Therefore, we rely on reception frequency (or rate) to infer the likelihood of a link being reliable. Since frequency can be a traffic dependent measure, a fairer comparison of link quality is obtained by using only periodic messages, such as beacons. The management algorithm should prevent the table from being polluted by many low utility neighbors, but at the same time allow new valuable neighbors to enter. In this section, we describe such an on-line process and relate how it can be approached by traditional cache design techniques. We also take an alternative approach by borrowing techniques from the database community. We focus on passive neighborhood discovery, where nodes snoop on periodic data messages. Insertions are always performed if the table is not full, while evictions are performed only if the table is full. An adaptive down-sampling insertion policy is used, which governs the rate of insertion into the table when it is full, to avoid table being polluted by unreliable neighbors. Within the table, each entry contains the relevant data for link estimation, table management data and all relevant information related to routing, as defined by the routing layer. When a node is evicted from the table, its link estimation along with the routing information is lost.

113 5.3. AN ON-LINE NEIGHBORHOOD SELECTION PROCESS Adaptive Down-sampling Insertion Policy Upon hearing from a non-resident node, we must determine whether to insert it when the table becomes full. No historical information can be used, since there is no table entry allocated. In some cases, geographic information or signal strength associated with the packet can guide the selection process. However, geographic data is often absent and does not account for obstructions, while signal strength can be highly variable. Therefore, we look for a simple statistical method. The insertion policy should avoid over-running the neighbor table with a high rate of insertion in order to establish a stable set of neighbors. To avoid over-running the table, the insertion rate must be much less than the rate of reinforcement such that nodes in the table can stay to get reinforced before being evicted by new insertions. That is, the eviction rate due to new insertions cannot be greater than the rate of reinforcement. For periodic traffic commonly found in sensor networks, the maximum insertion/eviction rate can be estimated by the data rate multiplied by the number of neighbors with physical connectivity; the reinforcement rate is just the data rate. Therefore, controlling the insertion rate is critical and one simple technique is to rely on probabilistic down-sampling; when a node not in the table is encountered, we only consider it for insertion with some probability, p. For neighbors in the table, they will be reinforced as normal. This is similar to the sticky policy described in [32]. The down-sampling rate, p, in effect controls the insertion rate and needs to adapt to different cell densities. One simple approach is to set p to the ratio of the neighbor table size, T, and the number of distinct potential neighbors, N. This ratio, in the worst case, evicts all T entries if every beacon message from all potential neighbors is received.

114 5.3. AN ON-LINE NEIGHBORHOOD SELECTION PROCESS 98 Input: Number of neighboring nodes N, neighbor table T, and node to be inserted n Output: None. Downsampling(N, T, n) (1) if n in T (2) call REINFORCE(n,T ) (3) else (4) if rand(0,1) min( T N,1) (5) call INSERT(n, T ) Figure 5.2: Downsampling process. Since not all potential neighbors have good links, nodes in the table can have a chance to get established before being evicted. Our assumption on periodic messages can be relaxed because insertion is simply a mechanism to allow nodes to establish themselves as neighbors in forming the logical connectivity graph; if a node is heard frequently, it is likely to be a good neighbor. This down-sampling process is summarized in Figure 5.2. We will investigate the effect of changing this ratio later in the chapter. To estimate N, there exists prior work in the database literature to estimate the number of distinct values over a continuous stream [31]. However, in our case when periodic beacons are present, we simply count the average number of beacons received over a period to determine N Cache-Based Eviction and Reinforcement For on-line eviction and reinforcement, a simple approach is to borrow techniques from traditional cache policies, since they also seem to maintain the most frequent data or instructions in a limited table. We consider FIFO, Least-Recently Heard (LRH), or

115 5.3. AN ON-LINE NEIGHBORHOOD SELECTION PROCESS 99 CLOCK algorithm approximations to LRU. For FIFO, eviction is based on order of entry, with the entry that has resided in the table longest being the candidate for eviction; no reinforcement is performed. For LRH, a resident entry is made most recently heard upon a message reception from the node, thereby reinforcing it in the table. The entry that has not been heard for the longest time will be removed upon eviction. For the CLOCK algorithm, reinforcement sets the reference bit to 1. On eviction, the table is scanned, clearing reference bits, till an unreferenced entry is found Frequency-Based Eviction and Reinforcement A similar problem of using limited memory to find the most frequently occurring tokens in a data-stream appears in the database literature. One effective policy is the FREQUENCY algorithm [25]. The algorithm is shown in Figure 5.3 and works as follows. It keeps a frequency count for each entry in the table. A node is reinforced by incrementing its count. A new node will be inserted in the table if there is an entry with a count of zero to be replaced; otherwise, the count of all entries is decremented by one and the insertion fails, with the new candidate being dropped. The most frequent entries will be retained by the table. In contrast with all the cache policies in our study, considering a node for insertion does not always lead to eviction. This is an important difference since it affects how well the algorithm can maintain its entries, as we see in the next section.

116 5.4. EVALUATION METHODOLOGY 100 Input: Node n to be inserted. Neighbor table T Output: Success or Fail. Insert(n, T ) (1) if an entry e in T where e counter = 0 (2) Use e to store n in table T (3) return SUCCESS (4) else (5) foreach entry e in T (6) e counter = e counter 1 (7) return FAIL Input: Node n and table T. Output: Success or Fail. Reinforce(n) (1) if n is in T s entry e (2) e counter = e counter + 1 (3) return SUCCESS (4) else (5) return FAIL Figure 5.3: Insertion and reinforcement in Frequency algorithm. 5.4 Evaluation Methodology We explore the effectiveness of the different eviction and reinforcement policies described before by evaluating them using simulations. The simulation setup works as follows. We use the probabilistic link model for connectivity derived from Figure 3.1. We simulated a large dense network of 6400 nodes placed uniformly as a grid. Using a 80x80 grid with 4 feet spacing, so that the effective region covers nodes within 3 grid points in either direction, we consider the neighborhood of a typical node near the center in a dense network. Such a node in this simulated scenario has 207 potential neighbors, i.e., nodes from which it hears at least one packet; each node transmitted 100 packets in the simulation. Figure 5.4

117 5.5. RESULTS 101 shows the cumulative distribution function of the link quality of all of this node s potential neighbors. About 30% of the nodes have link quality greater than 75% while about 40% of the nodes have link quality less than 25%. As expected, many potential neighbors have unreliable links and only a small fraction of the links have very good reliability. Repeating the study for different grid spacings shows that this ratio remains roughly constant (25% to 30%) as the number of potential neighbors ranges from 20 to 200. This is expected for a grid layout, but it is also true for any uniformly random layout. For this study, we define a good neighbor to be a potential node with link quality greater than 75%. Recall that the goal for a neighbor management policy is to retain as many good neighbors in the table as possible regardless of cell density. To evaluate the different policies, we measure yield, i.e., the fraction of good neighbors that are found in the table more than 75% of the time. This yield metric captures two notions about neighbors in general. In a sparse network, most of the neighbors would stay in the table, but only a few have good links. In a dense network, many potential neighbors would have good links, but they may not stay in the table for long because the management policy may not be able to maintain a stable set of neighbors. Yield captures these two scenarios very well. 5.5 Results In this section, we first explore the effect on yield when the FREQUENCY algorithm uses our adaptive down sampling mechanism. We then incorporate the mechanism with other management policies and evaluate yield for each of them.

118 5.5. RESULTS CDF of Neighbor Link Quality F(x) Link Quality Figure 5.4: Cumulative distributive function showing the link quality distribution of the 207 neighbors of a center node in a 80x80 grid network with 4 feet spacing using our empirical link model Effect of Adaptive Down-Sampling Figure 5.5 shows a case illustrating a contour plot on the yield of FREQUENCY at different cell densities and table sizes, with the down sampling insertion policy disabled. In contrast, Figure 5.6 shows the same case but with a down sampling rate, p, set to 50%. The difference is very dramatic; the contour lines are pushed much lower to the lower right corner with down sampling. This observation signals that a much smaller table can be used to maintain all the good neighbors. For the case without down sampling, the table is polluted by many of the unreliable neighbors, and therefore requires a larger table to maintain the good ones. This demonstrates the importance of the down sampling insertion policy.

119 5.5. RESULTS Table Size vs. Number of Neighbors using FREQ without Downsampling Table Size (M) and M < N Number of Neighbors (N) Figure 5.5: Contour plot on yield of the FREQUENCY algorithm for different cell densities and table size with no down sampling for insertion. The system is adaptive as long as the p adapts to the changing N. We experimented with different cell densities and found that as long as p is greater than ablesize NumberNeighbors or T /N, the results are very similar. This is because the insertion rate has already lowered to avoid over-running the table. Thus, our adaptive scheme follows this ratio to adjust the down sampling rate for different cell densities Eviction and Reinforcement Policy We evaluate the different policies by setting the table size to be constant and measure the yield as node density increases. Figure 5.7 shows how the different policies perform at different densities with a table size of 40 entries. We can analyze Figure 5.7 by breaking it into 3 regions. First, as expected, all policies perform well when the table can

120 5.5. RESULTS Table Size vs. Number of Neighbors using FREQ with Downsampling = 50% Table Size (M) and M < N Number of Neighbors (N) Figure 5.6: Contour plot on yield of the FREQUENCY algorithm for different cell densities and table sizes with down sampling rate of 50% for insertion. hold all the potential neighbors. Second, when the number of potential neighbors exceeds the table size, but the number of good neighbors is less than the table size, all policies still maintain most of the good neighbors. However, when the number of good neighbors exceeds the table size, i.e., when the number of potential neighbors is three times the table size, the cache based policies are unable to hold onto a subset of good neighbors, even though they are plentiful. In contrast, FREQUENCY retains at least 20 neighbors, a 50% yield, even at high densities. We continue our evaluation by changing two variables independently: cell density and table size relative to cell density. We vary the cell density from 20 to 220 potential neighbors and the table size from 150% of the number of potential neighbors to 50%. The results are shown in Figure 5.8. In each figure, the x-axis shows the number of good

121 5.5. RESULTS 105 Number of Good Neighbors Maintainable FREQ LRH CLOCK FIFO Number of Neigbhors Figure 5.7: Number of good neighbors maintainable at different densities with a table size of 40 entries. neighbors in the cell as the number of potential neighbors increases from 20 to 220. The y-axis shows the yield. The series of sub-figures represent different yields among different management policies as table size decreases from 1.5 times the number of potential neighbors to only half of it. Note that the uniform layout keeps the number of good neighbors to be about 25% to 30% of the number of potential neighbors, even as the number of potential neighbors increases. The FREQUENCY policy performs much better than all other policies across all the scenarios.in Figure 5.8(a), where the table size is greater than the number of good neighbors, all policies perform very well. As table size decreases, all of the cache policies start to degrade significantly. When the table size is only half of the number of good neighbors, all the cache policies can no longer maintain any good neighbors, as indicated in Figure 5.8(d). In contrast, FREQUENCY policy maintains a yield of about 30% of the

122 5.6. OTHER GOODNESS METRICS 106 good neighbors, even though its size can only fit 50% of them, which is a 60% efficiency. In fact, across all the table sizes, the average efficiency of FREQUENCY is 70%, which is very effective. Furthermore, the yield of the FREQUENCY policy experiences smaller fluctuations at different densities as compared to other cache policies. We conclude that FREQUENCY is very effective in maintaining a subset of good neighbors over a fixed-size table, even for densities much greater than the table size. For example, with a table of 32 entries, this policy yields at least 10 good neighbors at all measured densities. One of the reasons why the cache policies under-perform at high densities is that for each insertion it is designed to evict an entry, while FREQUENCY would drop the insertion since no entries are replaceable. We believe this is the main reason why FREQUENCY can maintain a stable set of good neighbors even at very high density. 5.6 Other Goodness Metrics The frequency count goodness metric is the most basic way to infer reliability for neighborhood management. However, there are many other ways such a metric can be augmented. For example, the neighborhood management policy can take into account routing cost, geographical location, energy/lifetime of a neighbor, time scheduling issues, or aggregation opportunity. A wide variety of metrics can be defined, so the design space is large. In this study, we take the basic goodness criteria and focus mostly on the frequency metric. However, in Chapter 7, we augment the route table management policy further by taking into account routing cost as the goodness metric. The idea is to avoid maintaining sibling nodes (nodes with roughly the same routing cost) in the table since they are likely

123 5.6. OTHER GOODNESS METRICS 107 % Good Neighbors stay >= 75% Time in Table Yield vs. Number of Good Neighbors( S ) with Table Size=1.5* S FIFO LRH CLOCK FREQ S /Total Number of Neighbors % Good Neighbors stay >= 75% Time in Table Yield vs. Number of Good Neighbors( S ) with Table Size=1* S FIFO LRH CLOCK FREQ S /Total Number of Neighbors Number of Good Neighbors ( S ) (a) Table size equals 1.5 times the number of good neighbors Number of Good Neighbors ( S ) (b) Table size equals the number of good neighbors. % Good Neighbors stay >= 75% Time in Table Yield vs. Number of Good Neighbors( S ) with Table Size=0.75* S FIFO LRH CLOCK FREQ S /Total Number of Neighbors Ideal % Good Neighbors stay >= 75% Time in Table Yield vs. Number of Good Neighbors( S ) with Table Size=0.5* S FIFO LRH CLOCK FREQ S /Total Number of Neighbors Ideal Number of Good Neighbors ( S ) (c) Table size equals 75% the number of good neighbors Number of Good Neighbors ( S ) (d) Table size equals 50% the number of good neighbors. Figure 5.8: Yield for different table sizes and cell densities.

124 5.7. RELATED WORK 108 to find their own routing paths. Therefore, it is more advantageous to free up the entries for potential parents or children. One simple technique is to apply a threshold on the routing cost difference between neighboring nodes. 5.7 Related Work Similar problems were studied in the packet radio literature, as they also needed to manage a large set of potential neighbors with limited on-node memory. One such prior work randomly selects neighboring nodes with physical connectivity into the routing table and limits a node with a maximum allowed degree [70]. This approach is similar to creating random graphs, which establish links between potential neighbors with a probability p. When a potential neighbor is heard, a coin is flipped with probability p. If successful, a handshake occurs between the two nodes to define this neighborhood relationship. It relies on random graph percolation theories, which proved that random selection can create a globally connected graph given that the value of p is high. Once a node has reached its maximal degree, the process of neighbor selection ends. However, to allow space to accommodate new nodes when links die, the protocol always keeps the node degree to a steady state below the maximum degree. If the protocol detects that the routing graph is not connected, links will be randomly evicted to re-establish the neighborhood relationships; such a process repeats if partition at the routing graph persists. The above mechanism shows a simple distributed scheme to build logical connectivity graphs. However, it treats every potential neighbor equally; it does not address the issue of lossy connectivity and the necessity to select neighbors with good links. Our ap-

125 5.7. RELATED WORK 109 proach integrates this process by having a biased sampling reinforce nodes that are heard frequently and allow routing layers to have an opportunity to influence the selection process. Another approach to neighborhood management that considers individual link quality is discussed in [54]. The idea is to build up a candidacy list for potential neighbors to be considered for inserting into the neighbor table. The paper assumes that if the node already has a two-or-more-hop routing path to a potential neighbor at the routing graph, it is not necessary to keep this potential neighbor in the candidate list. Otherwise, this potential neighbor would stay in the candidate list for its link estimation to build up. If the candidate is determined to be good (have link quality above some threshold), it may replace a table entry if there is one worse than it. If the candidate is bad, it will be removed from the candidate list. It is unfortunate that no evaluation of the algorithm is presented in [54]. However, this algorithm is close to the holistic approach that we advocate for neighborhood management. Note that the neighbor exclusion criteria requires any-to-any routing support since a node should not become a new logical neighbor if there is already a route to it. The problem of neighborhood management has aspects in common with cache management and with statistical estimation techniques in databases. There is a growing body of work on gathering statistics on Internet packet streams using memory much smaller than the number of distinct classes of packets. Heuristics are used in [27] to identify a set of most frequently occurring packet classes. Two algorithms are presented in [32] to identify all values with frequency of occurrence exceeding a user specified threshold. A sliding window approach is used in [23] that can be generalized to estimate statistical information of a data stream. Finally, [25] showed a simple FREQUENCY algorithm that estimates the frequency

126 5.8. MULTIHOP ROUTING IMPLICATIONS 110 of Internet packet streams with limited space. We explored these different techniques during our design process and identified the FREQUENCY approach as our solution. 5.8 Multihop Routing Implications We have demonstrated that it is possible for a local process to maintain a stable subset of good neighbors using a limited size neighbor table much smaller than the actual number of potential neighbors. With such a stable set of neighbors, statistics can be collected for them such that link estimation can be performed. This subset of neighbors defines the local connectivity options of the node. Together, a logical connectivity graph with each edge characterized by link estimation is discovered. The definition of a neighbor is now relative to a set of local rules judging all the nodes that are heard, and only the most competitive ones will be kept as neighbors by the neighbor table. It is important to note that in deriving such a reliable neighborhood, no explicit threshold setting is required and the problem of setting too high a threshold that results in network partition would not occur. Furthermore because the FREQUENCY algorithm works well regardless of cell densities, higher-level protocols can adjust the cell density without affecting the robustness of the routing system. That is, the logical connectivity graph built by the nodes will adapt to the different cell density according to its best ability with the limited resources. We have shown how to use link reliability as one of the basic criteria for neighborhood selection. This definition of the selection criteria can be further augmented to achieve better selections. Since higher-level services may need to specify its own goodness

127 5.8. MULTIHOP ROUTING IMPLICATIONS 111 metric for neighborhood management, designing a flexible neighborhood abstraction and architecture such that different services can cooperate to influence the choice of neighbor will be important research that is left as future work. In the next chapter, we revisit the high-level picture of what role neighborhood management plays in the context of network self-organization and study the overall routing problem.

128 112 Chapter 6 Cost-Based Routing With the local processes of link estimation and neighborhood management, a distributed logical connectivity graph is created. Each edge on the graph is characterized by link quality as a probabilistic metric of reliability. Routing protocols should build topologies upon this graph and the resulting topology is a subgraph of the logical connectivity graph. The primary focus of this chapter is to explore the design of such a routing process to form a stable and reliable routing topology. We first introduce a typical distributed distance-vector tree formation process and extend it to a general framework to support different kinds of cost-based routing. We focus on tree formation since data-collection is the most common form of communication pattern for sensor networks; it also brings forward the issues that need to be considered in any pattern. Such a tree formation process has to be integrated with the other two processes: link estimation and neighborhood management. We give an overview of how these three processes work together to form a routing subsystem, and present a set of underlying system issues when it is implemented. With the routing frame-

129 6.1. DISTRIBUTED TREE BUILDING PROCESS 113 work in place, we discuss the different routing cost functions that run upon the logical connectivity graph. Finally, we survey the relevant related work in the context of wireless ad hoc routing and discuss how the design approaches in the literature are different from ours. 6.1 Distributed Tree Building Process As discussed in Chapter 2, many of the sensor network applications require a basic form of tree-based routing for data collection and tree-based in-network processing. In this section, we focus on a general framework of such a distributed tree building process in the context of distance-vector based routing with an arbitrary routing cost function. Such a framework can be extended to form multiple trees rooted at different nodes when a spanning forest topology is required. We first discuss the basic process of a distributed distance-vector based routing protocol in the context of tree building. The root of the tree or the sink node always has a routing cost of 0. For all other nodes, the routing cost is initialized to infinity. Every node in the network periodically transmits route messages, which contain the node address, and the estimated routing cost to the tree root. Upon reception of route messages from neighboring nodes, each node extracts the information from the message and stores it in the neighbor table. As the neighbor table is updated, a local parent selection function is invoked to select the best parent. For the basic distance-vector based routing, the best parent is the one that carries the smallest routing cost or distance to the root of the tree. Once parent selection has identified the best parent, the routing cost of the node is computed by adding the link

130 6.1. DISTRIBUTED TREE BUILDING PROCESS 114 Input: Last Routing Message Received M. Neighbor Table T. Output: Success or Fail. TreeBuild(M, T ) (1) if sinknode (2) PathCost = 0 (3) Parent = itself (4) else (5) PathCost = (6) Parent = nil (7) update T with M(SourceAddress, RoutingCost) (8) foreach entry e in T (9) e LinkCost = EvalLinkCost(e) (10) RoutingCost = EvalCost(e P athcost, e LinkCost ) (11) PathCost = Min(PathCost, RoutingCost) (12) Parent = node with the Min(PathCost) (13) Send Route Message (14) if Parent = nil (15) return FAIL (16) else (17) return SUCCESS Figure 6.1: Distributed tree building algorithm framework. cost to the routing cost of the parent. The next route message would convey the new cost to neighboring nodes. The above is formalized in a general framework shown in Figure 6.1. Line 9 in Figure 6.1 shows the EvalLinkCost function, which defines the cost of a link, and the EvalCost function on line 10, which combines the link cost with the routing cost of the path from each potential parent. For example, for the cost functions to perform shortest hop routing, PathCost and LinkCost would be in the units of hop-counts, and EvalLinkCost would return 1 for all links in T and EvalCost would be a simple addition operation. One of the assumptions of such a distributed tree building process is that no matter what the routing cost function is, the link cost must be positive and EvalCost must always increase

131 6.1. DISTRIBUTED TREE BUILDING PROCESS 115 the cost. Figure 6.2 shows a more advanced framework that takes the probabilistic nature of connectivity into the routing layer. As discussed in Chapter 4, out-bound link estimations among neighbors are obtained through piggybacking in-bound link estimations over route messages, which are sent periodically to disseminate routing information and maintain the minimum data rate for link estimation. Line 10 shows such a mechanism of obtaining outbound link estimation feedback from the node that originates the route message. With both in-bound and out-bound link estimations, asymmetric links can be avoided, depending on the actual criteria defined by the EvalLinkCost function. Line 13 shows a simple mechanism that avoids creating two-hop cycles. It works by simply not choosing immediate children as potential parents. Line 15 avoids selecting a parent that does not have a parent. This does not imply that this potential parent is connected to the root of the tree. Line 17 breaks cycles that are more than one hop when they are detected by the cycle detection mechanism. Line 19 ensures that the potential parent is connected to the root of the tree, a mechanism to cope with the counting-to-infinity problem, which is discussed later in the chapter. Finally, with the understanding that link quality estimation has at least 10% fluctuations, line 26 provides a hysteresis to lower the potential occurrences of route flapping. Note that the switching threshold is useful for error-prone routing cost functions such as those derived from link estimations. With the high-level tree building framework presented, we turn to investigate an appropriate routing cost function for sensor networks using such a framework.

132 6.1. DISTRIBUTED TREE BUILDING PROCESS 116 Input: Last Routing Message Received M. Neighbor Table T. Output: Success or Fail. TreeBuild(M, T ) (1) if sinknode (2) PathCost = 0 (3) Parent = itself (4) return SUCCESS (5) else (6) OldParent = Parent (7) OldPathCost = (8) PathCost = (9) Parent = nil (10) update T with M(SourceAddress, RoutingCost, OutboundLinkEstimation) (11) foreach entry e in T (12) e LinkCost = EvalLinkCost(e) (13) if e is a child (14) continue (15) if e has no parent (16) continue (17) if Cycle is detected with e (18) continue (19) if e P athcost!= 0 and e RootConnected is FALSE (20) continue (21) RoutingCost = EvalCost(e P athcost, e LinkCost ) (22) if e addr = OldParent (23) OldPathCost = RoutingCost (24) PathCost = Min(PathCost, RoutingCost) (25) Select new Parent with Minimum PathCost (26) if (OldPathCost! = ) and (OldPathCost - PathCost > SwitchT hreshold) (27) Keep using OldParent and OldPathCost (28) if Parent = nil (29) return FAIL (30) else (31) return SUCCESS Figure 6.2: Distributed tree building algorithm framework with link estimation incorporated.

133 6.2. OVERVIEW OF THE SYSTEM ROUTING ARCHITECTURE Overview of the System Routing Architecture In this section, we present the underlying details of the core mechanisms in supporting the general framework shown in Figure 6.2. We first focus on the core system architecture in Figure 6.3, which captures the high-level interactions of all the components implementing the routing framework. There are several concurrent processes operating together in Figure 6.3. Upon message reception over snooping the channel, if the source node of the message is not in the table and the message is neither a route message nor an originated data message, the message is dropped since the neighborhood management assumes the same message rate for each node in considering for table insertion. This restriction can be relaxed as discussed in Chapter 5. If the message is not dropped, the source node will be considered for insertion into the neighbor table by the Table Management component. If the source node is already in the neighbor table or to be inserted into the table, information in the table needs to be updated. Figure 6.4 shows the data structure of the neighbor table. It contains node status and routing entries for neighbors. Its fields include: MAC address of the neighbor, neighbor s parent address, routing cost, children information, internal management flags, duplicate packet elimination information, reception (in-bound) link quality, send (out-bound) link quality, and link estimator statistics. These fields are updated accordingly by the different components depending on the information on each incoming message. For example, the link estimator would maintain estimates of the in-bound (reception) link quality of each neighbor in the neighbor table. Out-bound estimations are piggybacked on the route messages; the Neighbor Table component would extract the

134 6.2. OVERVIEW OF THE SYSTEM ROUTING ARCHITECTURE 118 Send originated data message Application Timer Originating Queue Transmit Forward Queue Filter Forwarding message discard non data packet discard duplicate packet Cycle Detection Send route update message Data message Cycle detected choose other parent Estimator All message sniff and estimate Parent Selection Neighbor Table Route message save information Run parent selection and send route message periodically Table Management Route or originated data message Insert or discard All Messages Receive Figure 6.3: Message flow chart to illustrate the core components for implementing our routing subsystem. estimates and store them accordingly. The Neighbor Table component also decays the outbound estimation if it is not updated by the period specified by OutBoundDecayW indow, which is defined in Section 4.2. Parent selection is run periodically to identify one of the neighbors for routing. The Timer component generates timing events to run the parent selection component, which broadcasts (locally) a route message to disseminate routing information to neighbors after completing the parent selection. That is, both parent selection and route message updates run at the same rate. Such a process is described in the pseudocode previously shown in Figure 6.2. The route messages include parent address, estimated routing cost to the sink, and a list of reception link estimations of neighbors. When a route message is received from a node that is resident in the neighbor table, the corresponding entry is updated. Otherwise,

135 6.2. OVERVIEW OF THE SYSTEM ROUTING ARCHITECTURE 119 typedef struct TableEntry { uint16_t id; uint16_t parent; uint16_t cost; uint8_t hop; uint8_t rootconnected; uint8_t childliveliness; uint8_t flags; uint8_t lastpacketno; uint16_t missed; uint16_t received; int16_t lastseqno; uint8_t liveliness; uint8_t receiveest; uint8_t sendest; } TableEntry; // Neighbor MAC Address // Neighbor s Parent s MAC Address // Neighbor s Routing Cost // Neighbor s Hop-Count // Neighbor s Path Connection to Root // For cycle detection // Internal management flags // For duplicate packet elimination // Estimator statistics // Estimator statistics // Estimator statistics // Outbound decay window // Inbound estimation // Outbound estimation TableEntry NeighborTbl[ROUTE_TABLE_SIZE]; Figure 6.4: Typical data structure of the neighbor table. ROUTE TABLE SIZE determines the size of the neighbor table. the neighbor table manager decides whether to insert the node or drop the update. Data packets originating from the node, i.e., outputs of local sensor processing, are queued for sending with the parent as the destination. Incoming data packets are selectively forwarded through the forwarding queue with the current parent as destination address. The corresponding neighbor table entry is flagged as a child to avoid cycles in parent selection. Packet sequence numbers are used to suppress forwarding duplicate packets as shown in Figure 6.4. When cycles are detected on forwarding packets, parent selection is triggered with the current parent demoted to break the cycle. Such a cycle detection process can be eliminated if the routing protocol is guaranteed to be cycle-free.

136 6.3. UNDERLYING SYSTEM ISSUES Underlying System Issues We now present the issues underlying the routing process described in Figure 6.3. They include rate of parent selection, packet snooping, the counting-to-infinity problem, cycles, duplicate packet elimination, queue management, and relationship to link quality estimation Rate of Parent Change Regardless of the routing algorithm, routes can be changed whenever the parent selection algorithm is scheduled to run. For fast adaptation, it is tempting to schedule the parent selection component to evaluate new routes for every route update received from neighboring nodes and to generate a new route message if the parent is changed. However, a domino effect of route changes is likely to be triggered across the entire network, especially when routing costs are very sensitive. This not only creates topology instability, but also leads to an unbounded message overhead, since a parent change can cause more route update messages. To address this issue, we limit the rate of parent change and attempt to bound the message overhead when the network is unstable. We simply run the parent selection algorithm synchronously using the timer event. That is, routes are evaluated on a periodic basis as a route damping mechanism, rather than asynchronously upon receiving a route update, except when a cycle is detected. Thus, the rate of parent change is bounded over time. It also bounds the route message update rate and conveniently defines the minimum data rate for link estimation. Therefore, the rate of parent selection can affect the adaptation

137 6.3. UNDERLYING SYSTEM ISSUES 121 rate for topology and link quality. For sensor networks that are relatively immobile, it is possible to reduce such a rate once the network has stabilized Packet Snooping Given that the wireless network is a broadcast medium, a lot of information can be extracted by snooping packets on the channel. Link estimation is one example. At the routing level, since each node is a router, snooping on forwarding packets allows a node to learn about all its children, which is useful to prevent cycle formation. Furthermore, snooping on a neighboring node s messages is a quick way to learn about its parent, which decreases the chance of stale information causing a direct two-hop cycle. The same technique can also be used to prune children quickly in the case of a network partition. When a node with an unreachable route receives a forwarding message from its child, it will NACK by forwarding the child s message with a NO ROUTE address. All neighboring nodes, including its children, that snoop on this packet can quickly learn about an unreachable route. In fact, this provides a natural feedback deep down into the tree that the routing path has become invalid. Packet snooping requires support from the underlying data link layer. In particular, the data link layer should not produce any link-level acknowledgments for packets not destined to it while allowing the flexibility for a higher level to snoop on different kinds of packets. Our sensornet platform allows such a flexibility of packet snooping; however, some other platforms, such as the [3], may need to disable automatic link-level acknowledgment at a cost of packet snooping.

138 6.3. UNDERLYING SYSTEM ISSUES Counting-To-Infinity Problem The classic counting-to-infinity problem occurs when a network partition causes the routing distances to increase slowly and requires many messages to detect. A simple solution is for nodes that are connected to the tree root to set a flag periodically. The flag is propagated over route messages and is stored in the neighbor table. Other nodes that wish to join the network must select parents with the flag set. Since nodes that are not connected to the tree root cannot set their own flags, this implies that the selected routing path must be connected to the root. Every node will expire the flag after a period, and the process will repeat. If the flag of the current parent is not set after it has expired for some time, it is assumed that the path is unreachable due to network partition. If no other potential parents have their flags set, the node becomes disjoint from the tree; when all nodes become disjoint from the tree, the tree is pruned automatically. This mechanism, which is reflected on line 19 in Figure 6.2, solves the counting-to-infinity problem efficiently and also works for multiple tree roots Cycles For many-to-one routing over relatively stationary sensor networks, we use simple mechanisms to mostly avoid loop formation and to break cycles when they are detected, rather than to employ heavy weight protocols with inter-nodal coordination. DSDV [58] provides an attractive approach to avoid cycles for mobile networks, but it requires sequence number propagation and sequence number settling time tuning, which may differ in each deployment.

139 6.3. UNDERLYING SYSTEM ISSUES 123 We rely on techniques similar to poison-reverse or split-horizon [34]. By monitoring forwarding traffic and snooping on the parent address in each neighbor s messages, neighboring child nodes can be identified and will not be considered as potential parents. We only need to maintain this information for nodes in the neighbor table. Route invalidation when a node becomes disjoint from a tree and tree pruning by NACKing children s traffic are used to prune stale routing information, which leads to cycles. With these simple mechanisms, cycles may potentially occur and must be detected. Since each node is a router and a data source, cycles can be detected quickly when a node in a loop originates a packet and sees it returning. That is, one of the nodes in a cycle can detect it. This mechanism works as long as the queue management policy avoids letting forwarding traffic starve originated traffic. (Otherwise, packets may get stuck in a loop in the middle of a route without detection.) This level of fairness is an appropriate policy in any case. Once a cycle is detected, discarding the parent by choosing a new one or becoming disjoint from the tree will break it. Alternatively, a Time-To-Live field can be added, but we did not use one in our evaluation Duplicate Packet Elimination Duplicate packets can be created upon retransmission when the ACK is lost. Without duplicate packet elimination, they will be forwarded, creating a multiplicative effect and wasting more bandwidth and energy. To avoid duplicate packets from link retransmissions, the routing layer uses a different sequence number from the link sequence number and stores it in the neighbor table to detect retransmitted packets as shown in Figure 6.4. When a

140 6.3. UNDERLYING SYSTEM ISSUES 124 duplicate packet is received from a child, the same sequence number would match the one stored in the neighbor table, and the corresponding packet is dropped. This approach relies on in-order packet delivery during retransmission and assumes that the neighbor table is able to track children Queue Management Nodes high in the tree forward many more messages than they originate. Care must be taken to ensure that forwarding messages does not entirely dominate the transmission queue, since it would prevent the node from originating data and undermine cycle detection. We separate the forwarding and originating messages into two queues so that upstream bandwidth is allocated according to a sharing policy that attempts to bias against upstream forwarding traffic. The policy that we implemented is very simple. With the assumption that originating data rate is low compared to that of forwarding messages, we give priority to originating traffic over traffic from distant nodes. For data collection it is possible to estimate the ratio of forwarding to originating packets by counting the descendents of each parent, but a general treatment of fair queuing is beyond the focus of this study Relationship to Link Estimation The usual approach to routing assumes links are either good or bad. Therefore, link failure detection is employed and is based on a fixed number of consecutive transmission failures. Our approach is to define connectivity relative to link estimation and incorporate it with the routing cost function. Thus, the stability and agility of link estimation can directly affect the stability of the routes and the rate of route adaptation. We will explore

141 6.4. COST METRICS FOR CONNECTIVITY-BASED ROUTING 125 such a stability issue in Chapter Cost Metrics for Connectivity-Based Routing In this section, we propose different routing cost functions for the distance-vector based tree building process described in Figure 6.2. They include shortest path, shortest path with threshold, path reliability, and minimum transmission. They all instantiate the EvalLinkCost and EvalCost functions in Figure 6.2. The traditional cost function for distance-vector routing is shortest path using hop-count. In power-rich wired networks with highly reliable links, retransmissions are infrequent and hop-count adequately captures the underlying cost of packet delivery to the destination. Hop-count is also well defined in a wired network. For shortest path using hop-count, the EvalLinkCost and the EvalCost functions would be implemented as follows. Input: Neighbor Table Entry e. Output: Routing Cost in Hop-Count. EvalLinkCost ShortestPath(e) (1) return 1 Input: PathCost in Hop-Count, LinkCost in Hop-Count. Output: Routing Cost in Hop-Count. EvalCost ShortestPath(PathCost, LinkCost) (1) return PathCost + LinkCost However, with lossy links, as found in many sensor networks, hop-count based on physical connectivity is not an appropriate cost function. As explained in Chapter 2, shortest path with hop-count tends to select links at the edge of the connectivity cell, because these links usually yield paths with minimal hop-count in reaching the destination.

142 6.4. COST METRICS FOR CONNECTIVITY-BASED ROUTING 126 If link estimation is not used, these links on the cell edge are likely to be unreliable for data delivery. We explore the actual routing performance of shortest path cost function in the next chapter. Nevertheless, shortest path routing can still be useful in unreliable networks, given that we define it relative to some probabilistic connectivity. A simple technique is to apply shortest path routing only to links that have estimated link quality above a predetermined threshold. For this shortest path with link quality threshold, the EvalLinkCost and the EvalCost functions would be implemented as follows. Input: Neighbor Table Entry e. Output: Routing Cost in Hop-Count. EvalLinkCost ShortestPathLinkThreshold(e) (1) if e sendest > T hreshold and e recvest > T hreshold (2) return 1 (3) else (4) return Input: PathCost in Hop-Count, LinkCost in Hop-Count. Output: Routing Cost in Hop-Count. EvalCost ShortestPathLinkThreshold(PathCost, LinkCost) (1) return PathCost + LinkCost As discussed in Chapter 2, this has an effect of increasing the depth of the network since links above the threshold are likely not to be close to the edge of the connectivity cell. The assumption of this routing cost function is that each link on the logical connectivity graph should have link quality greater than the threshold in either or both directions; this assumption may break down in actual deployments. We investigate these issues in the next chapter. A different way to incorporate link quality into a routing cost function is path

143 6.4. COST METRICS FOR CONNECTIVITY-BASED ROUTING 127 reliability, which is a product of link qualities along the entire path in the forward direction. The EvalLinkCost and the EvalCost functions for path reliability would be implemented as follows. Input: Neighbor Table Entry e. Output: Routing Cost in Path Reliability. EvalLinkCost PathReliability(e) (1) if e sendest > 0 and e recvest > 0 (2) return e sendest (3) else (4) return Input: PathCost in log(path Reliability), LinkCost in Path Reliability. Output: Routing Cost in log(path Reliability). EvalCost PathReliability(PathCost, LinkCost) (1) return PathCost + log(linkcost) Such a metric would yield a path with the most likelihood of success in reaching the base station without considering any link retransmissions. The logarithm turns multiplications into additions. It is used in [80] to optimize the end-to-end success rate to the base station. While this cost metric does not require any threshold tuning, it has a tendency to exploit short reliable links, which can yield routing paths with many short hops. Furthermore, this routing cost function assumes no link retransmissions, which is essential to cope with the exponential packet drop in multihop routing. Thus, we do not study this routing cost function. An alternative way to utilize link quality information is to use the expected number of transmissions along the whole path as the cost metric for routing. That is, the best path is the one that minimizes the total number of transmissions (including retransmissions) in delivering a packet over potentially multiple hops to the destination. We call this the

144 6.4. COST METRICS FOR CONNECTIVITY-BASED ROUTING 128 Minimum Transmission (MT) metric, which is also proposed in [21]. This metric combines both hop-count and link retransmissions into consideration during route selection. That is, a link retransmission is similar to increasing the hop-count by one. With links of varying quality, a longer path with fewer retransmissions may be better than a shorter path with many retransmissions. In considering the expected number of transmissions of a link, it is important to determine link quality for both directions since losing an acknowledgment would also trigger a useless retransmission. The EvalLinkCost and the EvalCost functions for MT would be implemented as follows. Input: Neighbor Table Entry e. Output: Routing Cost in Expected Number of Transmissions. EvalLinkCost MT(e) (1) if e sendest > 0 and e recvest > (2) return e sendest x e recvest (3) else (4) return Input: PathCost in Expected Number of Transmissions, LinkCost in Expected Number of Transmissions. Output: Routing Cost in Expected Number of Transmissions. EvalCost MT(PathCost, LinkCost) (1) return PathCost + LinkCost Note that MT also eliminates the need for predetermined link quality thresholds. However, the stability of MT routing is potentially an issue, since it utilizes link estimations in a non-linear fashion. Thus, for MT a noise margin should be used in parent selection to enhance stability.

145 6.5. RELATED WORK Related Work Many ad hoc routing protocols exist in the computing literature. In general, they can be classified into two categories: table-driven and source-initiated on-demand routing. The distance-vector based routing protocols, such as Bellman-Ford [45], DSDV [58] and our protocol in Figure 6.2, fall in the table-driven category. Another kind of table-driven protocol is link-state routing, such as OSPF [42] in the Internet. Link-state protocols are not attractive in ad hoc networks because they require significant overhead in maintaining an up-to-date global knowledge of the entire routing topology on each node. The more recent ad hoc routing protocols in the literature of mobile computing take the on-demand approach. The on-demand approach suits mobile computing traffic well, as it comprises many independent pair-wise data flows in the network and thus routes are established and maintained to support only the actual data flows, which reduces the amount of state and protocol overhead. For sensor network applications, such a model of many independent pair-wise traffic flows is not common. Instead, each node would originate data that needs to be forwarded to the data sinks. Thus, source-initiated on-demand routing does not fit well with the sensor network traffic model. In contrast, sink-initiated on-demand routing, where an interested sink node would initiate a route discovery to establish reverse-path routes, is the norm in sensor networks. Since these two kinds of on-demand routing share a common set of underlying issues that affect routing performances, understanding source-initiated on-demand routing is also important for sensor network research.

146 6.5. RELATED WORK Table-Driven Routing We discuss, in more detail, some of the key table-driven routing protocols in the literature. The amount of prior work in this space is large, but there exists relevant work, such as [68], that provides a survey of the overall space if the reader is interested in probing further. The Destination-Sequenced Distance-Vector Routing protocol (DSDV) presented in [58] is a table-driven routing protocol, which improves upon the distributed Bellman-Ford routing algorithm [45] by providing loop-free topologies and solving the counting-to-infinity problem. Every node maintains a routing table that records the routing distance in hopcount to all other nodes in the network. In addition, each destination has a sequence number and sends it along with each route message. The destination increments the number to provide a temporal order of its route freshness. Routes to the destination are constructed using the latest sequence number, and any stale routes would entail a smaller sequence number. Thus, a routing path is chosen based on freshness before considering the shortest hop path; if multiple routing paths to the same destination have the same freshness, the selection will be based on shortest hop-count. By ensuring that no stale information is used and the route always descends downhill along the hop-count gradient, no cycles will result. Instability can occur because the rate of route information propagation over different paths may vary, and a node may switch its route simply because a fresh new route is found. To avoid this problem, each node keeps track of the settling time of its best route and does not change route until the settling time has expired. In contrast to the approach we take in defining connectivity relative to link esti-

147 6.5. RELATED WORK 131 mation, DSDV assumes connectivity to be bimodal, either good or bad, and relies on link failure detection to avoid routing over failed links. A link is declared failed if a fixed number of consecutive packet transmissions occurs without being acknowledged by the receiver. In wireless networks, where connectivity is lossy, such a mechanism would result in many false positives of link failures, which lead to network instability. In the next chapter, we provide the details in comparing the routing performance of DSDV to our approach. The Expected Transmission cost function, discussed in [24], is the same as our MT cost function, since both take connectivity as a probabilistic metric defined by link estimation and have the routing layer exploit this information. The study was performed over wireless networks using laptop size computers over a static deployment. Since the memory resource is not a concern on such platforms, they did not investigate the issue of neighborhood management under memory constraints. They modify the DSDV protocol to use the expected transmission cost function. Another study in [26] explores the same routing cost function under mobility and concludes that such a cost function performs best when the network is static. These studies provide empirical evidence in other wireless networks that the MT cost function can yield better performance than the typical hop-count based cost functions. There are other kinds of protocols that exemplify the flexibility of defining different kinds of routing cost functions over the same distance-vector based routing framework. There exist routing cost functions that optimize the network for network lifetime or energy consumption [19, 69, 72]. While these protocols demonstrate great reduction in energy consumption, they also assume links are boolean in general and neglect the lossy wireless

148 6.5. RELATED WORK 132 characteristics. In the packet-radio literature, there are table-driven ad hoc routing protocols that attempt to enhance the reliability of communication by modifying the cost function to route over less congested or interfered paths, such as Least-interference routing [73], Leastresistance routing [62], and Maximum-minimum residual capacity routing [14]. These protocols do not directly address the fundamental issue of lossy connectivity. Instead, they attempt to form topologies to load balance the network. Unfortunately, we are unable to find empirical evaluations of these protocols on real packet radio nodes Source-Initiated On-Demand Routing In this section, we discuss some of the key source-initiated on-demand routing protocols in the mobile computing literature. Many of the issues of this kind of protocol are similar to the sink-initiated on-demand routing for sensor networks, such as Directed Diffusion [40]. For on-demand routing, the discovery process generally begins with the source node flooding the entire network to discover its destination node. The destination node or intermediate nodes with a route to the destination reply to the source using the reverse path of the flood. This path will be the routing path for the source to communicate with the destination. Because the reverse path is used for routing, these kinds of protocols assume links are symmetric. The Ad Hoc On-Demand Distance Vector (AODV) protocol presented in [59] is an on-demand routing protocol that improves upon the DSDV protocol. During route

149 6.5. RELATED WORK 133 discovery, each node records the first packet that it receives, discards subsequent redundant packets, and rebroadcasts the route request packet. The destination replies upon the first path that reaches it in the flood. The reply message flows along the reverse path and sets up the route table for each node in the path. Like DSDV, AODV utilizes destination sequence numbers to ensure all routes are loop-free and contain the most recent route information. The Dynamic Source Routing (DSR) protocol discussed in [44] is an on-demand routing protocol based on the concept of source routing. Each node maintains route caches containing the source routes that it hears. Route discovery relies on the same flooding process except that each route request contains the evolving source routing path; a node adds its address into the path before rebroadcasting the request. If a node has a route in its cache that can reach the destination, it will reply to the source without rebroadcasting the route request. If not, the end destination will reply to the source using the shortest path it hears. Because DSR uses source routing, nodes overhearing the traffic can learn new routes or improve old routes in the cache. Although the route cache is used to suppress some of the rebroadcasts, there are still a lot redundant rebroadcasts across the network. The Temporally Ordered Routing Algorithm (TORA) [57] is another protocol that is built upon link reversal. During route discovery, a flood is used to establish a directed acyclic graph (DAG) rooted at the source, with the the destination node being the sink of the DAG. A height metric is used to set up a gradient in such a DAG. Reversing the links in the DAG is the primary mechanism to deal with node mobility and link failures. Therefore, nodes need to maintain accurate routing information about all adjacent (onehop) nodes and link symmetry is always assumed. The height metric is composed of many

150 6.5. RELATED WORK 134 parameters, including the logical time of a link failure, a propagation ordering parameter, a node ID, and other protocol specific information. Since timing is important in determining the height, TORA assumes that all nodes have synchronized clocks. A different kind of on-demand routing is proposed in GRAd [67]. Like other ondemand routing protocols, if node A needs to talk to node B, it first floods the network and B replies using the reverse path. All the intermediate nodes will record the routing cost to node A in hop-counts. Unlike other on-demand protocols, GRAd exploits the local broadcast nature of the wireless medium; rather than sending unicast messages to the next hop for multihop forwarding, messages are sent as local broadcasts carrying the source address, the end destination address, a sequence number, and the remaining routing cost to the destination. Neighboring nodes that receive the message and have a lower routing cost than the remaining routing cost forward the message. That is, route selection becomes receiver based. Since many nodes may qualify to relay the message, especially in a dense network, this may seed a local broadcast storm. GRAd relies on the MAC layer s backoff to arbitrate the order of accessing the channel during the broadcast storm, which only provides a limited spreading in time among the different rebroadcasting nodes. The sequence number and the source address create a unique identifier for each message and GRAd uses it to suppress redundant forwarding by removing them from the MAC layer s queue and limits the extent of the broadcast scope. This approach to routing is resilient to mobility with high reliability; however, the potential dissemination overhead to support such reliability can be high. We found no empirical data in the literature to measure the extent of this overhead.

151 6.5. RELATED WORK 135 For all of these on-demand routing protocols, reverse path routing based on route discovery flooding is a fundamental mechanism for these protocols to perform well as they assume links are either good or bad and symmetric. As we have discussed throughout this thesis, these assumptions are not true in reality, as shown by our empirical findings and other related work. This implies that special attention to link characteristics must be made before relying on reverse-path routing. Another problem is timing control of the rebroadcast to avoid potential broadcast storm problems, which may result in nodes left out from the flood or selecting inefficient routing paths. Prior work in [29] has shown that blindly selecting the first node as parent in the route discovery flood can yield very unreliable routing paths and the broadcast storm problem can have interesting effects on the resulting topology. These shortcomings may be acceptable for mobile computing since the protocols are designed to cope with mobility and an inefficient routing path is better than disconnection. For sensor networks that are relatively static, optimizing for efficient and reliable routing paths is an important way to cope with the tight limitation of bandwidth and energy. Thus, a careful tree-building process using sink-based on-demand routing must address both the link quality and broadcast-storm issues. One sample design of this careful tree-building process that addresses both of these issues is discussed in [71]. A more recent work building upon GRAd is GRAdient Broadcast [81]. It is a sink-initiated on-demand routing protocol that builds a gradient using the RF transmission energy as the cost metric. Like GRAd, each message has a credit or remaining cost, and the receiver that has a smaller cost would forward the packet, with a random delay before each forwarding to avoid potential collisions. However, since energy rather than hop-count

152 6.5. RELATED WORK 136 is used as the cost metric, the GRAdient Broadcast has a higher granularity to scope the potential number of receivers. Furthermore, the nonlinear decrease of the credit in the packet allows the protocol to further limit the scope of the travel paths towards the sink. As with GRAd, no experimental data is found on the performance of the protocol Summary We have discussed a range of important approaches to ad hoc routing that exist in the literature. None of the work takes the holistic approach that we advocate in defining connectivity as a probabilistic concept and carrying it from link estimation to neighbor management and routing cost function. However, it is possible to extend these protocols to build reliable topologies in sensor networks as long as they run upon a concretely discovered logical connectivity graph, and use cost functions other than shortest hop in order to exploit the link quality information over such a connectivity graph. Protocols that require a flood discovery process must be carefully done to avoid potential broadcast-storm problems, which would yield ill-formed trees. The resulting tradeoff in relying on such logical connectivity graphs is a decrease in responsiveness to mobility, as the logical connectivity graph governs the rate of adaptation. Nonetheless, since sensor networks are relatively static, this tradeoff is acceptable for many applications. The many-to-few routing characteristics reduce the amount of state required at the routing layer to O(destinations) since the network only needs to know how to route to a few destinations. However, the majority of the state, if our holistic approach is taken, would be used for managing local link quality statistics and neighborhood information, which is

153 6.5. RELATED WORK 137 governed by the size of the neighbor table. This favors protocols that require a routing table anyway, such as DSDV, AODV, and TORA. However, it is an extra overhead for protocols such as DSR, which only maintains a cache of recent routes. Our protocol demonstrates one of the simplest ways to achieve tree-based routing. With a slight overhead in maintaining destination information, our work can improve upon DSDV or AODV. The communication overhead in sending periodic route messages or flooding the entire network periodically as in AODV can be configured to be the same. Receiver-based protocols such as GRAd would require an O(cell density) state maintenance in the worst case, since a node must maintain information about each transmitted packet from all its potential neighbors within a time window. For example, a lucky long link with a potential neighbor would generate a retransmission. Maintaining a neighbor table with a logical connectivity graph is an effective way to bound the state required and limit the scope of dissemination. However, it would reduce the resilience against mobility. This kind of protocol relies on primitive MAC layer support to deal with broadcast storm issues. Techniques for careful tree building should be applied here to avoid potential collisions that may hinder dissemination. Like DSDV and AODV, our protocol utilizes only one routing path. Other protocols such as DSR and TORA can use multiple paths to enhance the reliability of data delivery. Receiver-based routing protocols may yield higher reliability since many redundant routing paths may potentially be used. There certainly exists a tradeoff in the overall degree of reliability and routing efficiency between the various approaches, but we do not study their effects in this thesis.

154 6.6. SUMMARY Summary Many factors have influenced the design of the distributed routing process that we present in this chapter. Since the common sensor networking applications are data-collection oriented, designing an efficient and reliable routing layer that supports a spanning forest topology, with each sink node maintaining its own tree, becomes an important challenge. Such a challenge is complicated by the lossy wireless connectivity and limited resources found over our sensor network platform. When we explore the rich set of available protocols in the literature, we find that most of them either assume the connectivity graph is given or are easy to discover since links are bimodal (good or bad) and symmetric in general. While these assumptions may be true on those platforms, they do not hold in the sensor networking regime as we have shown in Chapter 3. This motivates us to define connectivity relative to link estimation and brings forward such a probabilistic view to the routing layer. It opens up new cost functions for routing and clarifies the fundamental concept of a hop, which is relative to how the cost function views connectivity and how competitive a node is with respect to the neighborhood selection criteria. With these understandings, we look back on some of the protocols in the literature and try to evolve them with our probabilistic view on connectivity. We decide to extend the tree-based routing protocol based on the classic distributed Bellman-Ford algorithm, where the routing cost function needs not be hop-count as long as it increases monotonically. The end result is a general distance-vector based routing framework that incorporates two other important local routing processes, link estimation and neighborhood management, while leaving the routing cost function open. The key remaining question is to identify an

155 6.6. SUMMARY 139 appropriate cost function that runs upon the discovered connectivity graph, with each edge characterized by link estimations, to yield reliable and stable topologies. We have identified interesting cost functions in this chapter. In the next chapter, we will implement our routing framework and the different cost functions in order to perform extensive evaluation of both simulations and empirical experiments.

156 140 Chapter 7 Evaluation In this chapter, we turn our design framework, presented in the previous chapter, into real implementations in order to evaluate and understand the performance of the different cost functions when they are integrated with the process of link estimation and neighborhood management. Our evaluation methodology has three levels, spanning from high-level large-scale simulations, to empirical experiments in real deployment settings. Each level of the evaluation process allows us to narrow the scope towards a smaller set of workable solutions. We set up the relevant metrics of evaluation and present our customized simulation framework for running graph and packet-level simulations. Since simulations can only approximate reality to a certain degree of fidelity, we evaluate our design with reasonable size networks and even deliberately drive the network into congestion to observe the effects on the routing protocol. This evaluation process allows us to arrive with a working solution for tree-based multihop routing that can support the common data collection applications found in sensor networks. In the process of interpreting the data, we gain

157 7.1. EVALUATION METHODOLOGY 141 an understanding of how performance can be affected by some of the subtle interactions among the three local routing subprocesses. Understanding these issues are important in understanding some of the root causes that hinder performance metrics, such as end-to-end success rate and topology stability. 7.1 Evaluation Methodology Having established the framework for concrete implementations of a variety of routing protocols and the underlying building blocks in Chapter 6, this section seeks to compare and evaluate a suite of distance-vector routing protocols in the context of data collection over a large field of networked sensors. We proceed through three levels of evaluation. The ideal behavior of these protocols, with perfect link estimation and no traffic effects, is assessed on large (400 node) networks using a simple analysis of network graphs with link qualities obtained from our probabilistic link characterization. The dynamics of the estimations and the protocols is then captured in abstract terms using a packet-level simulator. A wide range of protocols is investigated on 100-node networks under simulation. This narrows the set of choices and sheds light on key factors. The best protocols are then tested in greater detail on real networks on the scale of 50 nodes Candidate Routing Protocols The set of routing protocols under evaluation includes broadcast, Destination Sequenced Distance Vector (DSDV), shortest path, shortest path with threshold, and minimum transmissions. We discuss the details of these protocols in the context of our evalua-

158 7.1. EVALUATION METHODOLOGY 142 tion. Broadcast is a simple protocol that builds a network routing topology by periodically flooding the entire network from the root. For each iteration, a sequence number is added incrementally to prune the tree built last time. The parent selection mechanism is simple. Upon reception of the first broadcast message, each node would select the source address of the message as its parent for routing. This mechanism builds upon no link estimation and requires no neighborhood management. This form of routing essentially captures the route discovery phase in many of the on-demand reverse-path based routing protocols found in the mobile computing literature, such as DSR [44] and AODV [59]. The difference is that instead of the source initiating the route discovery flood, the sink node, being the root of the tree, originates the flood and all the source will take the reverse path to send data to the sink node. In Chapter 6, we discuss how such a simple form of tree-building can lead to unreliable routing trees. Nonetheless, since it is a fundamental mechanism used by many on-demand routing protocols, we decide to evaluate its performance also. However, it is important to point out that better routing trees can be built if the tree building process is done carefully. In particular, a mechanism is required to control the timing of the rebroadcast so that the broadcast storm problem can be avoided. Such a mechanism can be implemented above the MAC layer to dampen the rate of rebroadcast and thus eliminate the storming effect. Furthermore, instead of relying solely on the first broadcast packet to build the tree and then discarding the subsequent packets, it is possible to compare these subsequent packets to the first one and select the best among these choices. The best one

159 7.1. EVALUATION METHODOLOGY 143 could be a combination of a set of criteria, such as hop count, received signal strength, or link quality if such information is available. In this thesis, we do not evaluate the performance of the routing tree built with these improvements. Nonetheless, these techniques have been shown to build reasonable routing trees in one of the sensor network applications for intruder detection [71]. Shortest Path (SP and SP(t)) are the conventional distance-vector routing cost functions that we have discussed in Chapter 6, following the framework in 6.2. In SP a node is a neighbor if a packet is ever received from it. For SP(t) a node is a neighbor if its link quality exceeds a tunable threshold t as shown in Chapter 6. Thus, shortest path routing is performed within a sub-graph of high quality links. Based on Figure 3.1 in Chapter 3, we consider two values for t. With t = 70%, we consider only links in the effective region, while leaving a significant noise margin for the estimators. With t = 40%, we allow most of the good links in the transitional region, resulting in larger, less regular cells. For the implementation of the link estimators on the real platform, unsigned bytes are used to represent link quality from 0 to 100%. Destination Sequenced Distance Vector (DSDV) We customize the general DSDV protocol into our framework shown in Figure 6.2 and preserve the essence of the protocol; a parent is chosen based on the freshest sequence number from the root while maintaining a minimum hop count when possible. That is, only nodes that do not have failed links and have the latest sequence number from the tree root in the neighbor table would be considered in the for loop on line 11 in Figure 6.2. Similar to SP, DSDV ignores link quality and considers all nodes it hears as neighbors.

160 7.1. EVALUATION METHODOLOGY 144 The original DSDV protocol suggests a damping mechanism through the use of a settling timer to avoid route flapping due to different propagation delays on route messages. We instead use the periodic parent selection mechanism as route damping in this case. That is, a parent will be changed only when the period is up or a node becomes disjoint from the network due to link failure for example. To detect link failure, a fixed number of consecutive packet losses to the next hop is used, as in the original DSDV protocol. When link failure is detected, a node is disjoint from the network and declares the route unreachable to its neighbor through periodic route messages. Our DSDV exploits packet snooping for early detection of unreachable routes. Since each node would still send its data traffic using the broadcast address when its route becomes unreachable, snooping on a parent s traffic allows our DSDV to detect an unreachable route without waiting for any route messages. Minimum Transmission (MT) uses the expected number of transmissions as its cost metric. In the actual implementation on the sensor nodes, the routing cost computations are done using unsigned 32-bit integers. Each link estimation is represented as an unsigned byte to avoid floating point calculations, with 255 representing 100% reliability. The routing cost in EvalLinkCost are computed using these estimates, rounded to the nearest integer, and scaled by 256 to avoid maintaining floating point numbers. The SwitchT hreshold on line 26 in Figure 6.2 is set to a default value of 0.75 of a transmission Evaluation Metrics We define four important metrics for evaluating the performance of these protocols.

161 7.1. EVALUATION METHODOLOGY 145 Hop Distribution measures routing depth of nodes throughout the network, which reflects both end-to-end latency and energy usage. Path Reliability and End-to-End Success Rate are two ways to estimate the end-to-end reliability to the root of the tree of all the nodes in the network. Path reliability approximates the end-to-end reliability of a routing path in the absence of retransmission. It is calculated like the path reliability routing cost function discussed in Chapter 6. By taking the product of link quality, in the forwarding direction, along the path from each node in the network, we can infer the probability of reaching the sink node without any link retransmissions. We only use this metric for network graph analysis. For simulations and empirical experiments, we can directly measure the end-toend success rate, which is the number of packets received at the sink for a node divided by the number originated. A maximum number of link retransmissions is performed at each hop. Losing packets before reaching the sink not only wastes energy and network resources, but also degrades the quality of the application. Another subtle issue is fairness. Nodes far away from the sink are likely to have a lower end-to-end success rate than nodes that are close. The breakdown of the success rate by hop or distance should show this behavior. Stability measures the total number of route changes in the network for each route update cycle since the parent selection mechanism, as shown in Figure 6.2, is run at the same rate as the route updates. We use such a metric to evaluate the stability of the routing topology.

162 7.2. NETWORK GRAPH ANALYSIS 146 Number of Nodes MT SP (70%) SP (50%) SP Hop Count Figure 7.1: Hop distribution from graph analysis of a 400 node network with 8 feet grid size. 7.2 Network Graph Analysis The first method of evaluation is to explore the different routing cost functions using high-level graph analysis. That is, given a static connectivity graph with fixed probabilistic link qualities of all edges derived from inter-node distance, we compute optimal trees using the distributed Bellman-Ford algorithm [45] for each routing cost function, including SP, SP(70%), SP(50%) and MT. Without packet level dynamics, only hop distribution and path reliability are meaningful in this case. Nonetheless, this high-level analysis enables us to explore large scale networks; it also establishes optimistic bounds on routing costs. We analyze a network of 400 nodes, organized as a 20x20 grid with 8 foot spacing. The sink node is placed at the corner to maximize network depth. Connectivity information is derived from the data shown in Figure 3.1. Figure 7.1 shows the expected hop-count

163 7.2. NETWORK GRAPH ANALYSIS Path Reliability MT SP (70%) SP (50%) SP Distance(Feet) 1 (a) 0.8 Path Reliability MT SP (70%) SP (50%) SP Distance(Feet) (b) Figure 7.2: Path reliability to tree root from graph analysis of a 400 node network with 8 feet grid size.

164 7.3. EFFECT OF NEIGHBORHOOD MANAGEMENT USING ROUTING COST 148 distribution for the four different cost metrics. SP builds a very shallow network, while the rest yield deeper networks with wider hop distribution. With most nodes being 2 and 3 hops away from the root in a network of 160-foot extent, many of the links must cover 40 to 50 feet. This suggests they are at the border of the transitional and clear region of Figure 3.1 and have very low quality. Figure 7.2 shows the corresponding expected path reliability. Indeed, reliability for SP drops below 5% for nodes of distance greater than 50 feet. Protocols that utilize link quality estimates yield much higher path reliability by taking more, higher quality hops. For SP(70%), the lowest expected path reliability for two and three hop paths are ((0.7) 2 = 49% and (0.7) 3 = 34.3%). SP(50%) takes advantage of links in the transitional region for fewer, longer hops, but reliability is hindered as a result. MT takes reliability into account and performs best without the need to set a threshold. This higher path reliability comes with the tradeoff of a slightly higher hop count for MT. 7.3 Effect of Neighborhood Management using Routing Cost Recall, from Section 5.6, that neighbor selection can go beyond the basic link quality inference through frequency estimation. In particular, it is possible for the routing layer to influence the neighbor selection process, such that better neighbors can be used for routing purposes. In this section, we show one instance of this approach by extending the FREQUENCY algorithm discussed in Chapter 5 to incorporate routing information for neighbor selection. From the routing layer perspective, one approach is to influence the neighbor

165 7.3. EFFECT OF NEIGHBORHOOD MANAGEMENT USING ROUTING COST 149 selection to avoid maintaining sibling nodes that are unlikely to be used for routing. By sibling nodes, we mean neighboring nodes that have almost the same routing cost to the destination. For example, if we consider hop count, all neighboring nodes having the same hop count are unlikely to become either a parent or a child. Therefore, it is better to avoid maintaining these neighbors and save the precious table entries for neighbors that may be potential parents or children. To explore the effect of this neighborhood selection based on routing cost difference, we simulate the FREQUENCY neighbor table management process, using the same simulation method as described in Chapter 5, a 80x80 grid network with each node transmitting 100 packets. In addition, we use the routing tree built using the MT cost function from the network graph analysis in the previous section to determine the routing cost of each node in the network. The modification to the FREQUENCY algorithm is in the insertion and eviction process, which is shown in Figure 7.3. That is, the neighbor to be inserted must have an absolute routing cost of at least CostDiff from the node s routing cost, where CostDiff is a tunable parameter as shown in line 1 and 3 in Figure 7.3. Similarly, priority of eviction is given to nodes whose absolute routing cost difference is less than CostDiff. To evaluate the effectiveness of this approach, we first examine what kinds of neighbors are maintained by the FREQUENCY algorithm without such a routing cost influence, which should guide us to finding an appropriate value of CostDiff for MT. We choose a center node in a 400-node grid network built using the network graph analysis. Figure 7.4 shows the dynamics of the neighbor table of such a node running only the FREQUENCY algorithm with a table size of only 40 entries. It shows that for most of the

166 7.3. EFFECT OF NEIGHBORHOOD MANAGEMENT USING ROUTING COST 150 Input: Node n to be inserted. Node n s routing cost, c. Neighbor table T. Output: Success or Fail. Insert(n, c, T ) (1) if c PathCost < CostDif f (2) return FAIL (3) if an entry e in T where e counter = 0 or e P athcost - PathCost < CostDif f (4) Use e to store n in table T (5) return SUCCESS (6) else (7) foreach entry e in T (8) e counter = e counter 1 (9) return FAIL Input: Node n and neighbor table T. Output: Success or Fail. Reinforce(n) (1) if n is in T s entry e (2) e counter = e counter + 1 (3) return SUCCESS (4) else (5) return FAIL Figure 7.3: Insertion and reinforcement in Frequency algorithm using routing cost difference. neighbors that spend a relatively long time in the neighbor table, many of their routing cost differences with respect to the receiving node are close to 0, as indicated in the topcenter portion of the graph. The circle with a cross in its center indicates the routing cost difference between this particular node and its parent. As expected, the cost difference is around 1 transmission. Figure 7.4 shows that it is inefficient to maintain neighbors that have a very small routing cost difference when the table is full as there are many potential parents or children that have a MT routing cost difference between 1 and 2 transmissions. This shortcoming can

167 7.3. EFFECT OF NEIGHBORHOOD MANAGEMENT USING ROUTING COST FREQUENCY w/o Cost Threshold (Table Size=40) Percent of Time Spent in Neighbor Table Routing Cost Difference Figure 7.4: Percentage of time spent in the neighbor table of the different neighbors vs. their difference in routing cost relative to the receiving node running the FREQUENCY algorithm. The cross indicates that node is chosen as the parent. be minimized if we augment the FREQUENCY algorithm with the routing cost influence and set CostDiff to be around a quarter of a transmission, which should eliminate many of the neighbors around the center in Figure 7.4. Figure 7.5 shows the result with such an augmentation implemented. It shows that for most of the nodes that spend a long time in the neighbor table, only a few nodes have a routing cost difference close to 0. Instead, a majority of the neighbors in the table have about a cost difference of 1. Furthermore, it is also interesting to observe that neighboring nodes that have a cost difference of 2 are frequent enough to be maintained by the table. This is because these nodes become more competitive as the number of candidate nodes trying to compete for the neighbor table

168 7.4. PACKET LEVEL SIMULATIONS FREQUENCY with Cost Threshold=0.25 (Table Size=40) Percent of Time Spent in Table Routing Cost Difference Figure 7.5: Percentage of time spent in the neighbor table of the different neighbors and their difference in routing cost relative to the receiving node running the FREQUENCY algorithm with routing cost filtering. The cross indicates that node is chosen as the parent. resources decreases with such a routing cost filter. This is advantageous for routing since these new nodes have a smaller routing cost and can become potential parents. These results demonstrate how such a simple technique can effectively achieve the original goal of maintaining non-sibling neighboring nodes. The simulation results that follow will explore how such a technique would affect overall routing performance in general. 7.4 Packet Level Simulations We turn to packet level simulations to understand the dynamic behavior of the routing protocols and their interactions with link estimation and neighbor table manage-

169 7.4. PACKET LEVEL SIMULATIONS 153 ment. We build a custom discrete time, event-driven network simulator in MATLAB for all of our packet-level simulations. While results from our network graph analysis allow us to not consider the shortest path routing cost function, the packet-level simulations allow us to further explore other cost functions, as well as other protocols, such as Broadcast and DSDV A Packet-Level Simulator We have developed a packet-level simulator using MATLAB. It is a discrete time, event-driven simulator that utilizes the MATLAB graphical environment to provide a visualization of the evolution of the network topology. The architecture of the simulator is modular such that different radio models, interference models, media access control protocols, routing protocols, and applications can be mixed and matched in each simulation. The simulator also allows the user to configure the network with different kinds of node placements, such as a grid or random layout. Figure 7.6 is a screen shot of the simulator, showing a sample routing tree topology where the root of the tree is located at the lower left corner. We implemented the network stack found in TinyOS 1.0 for this simulator. The low-level details of link acknowledgment and media access are also captured by the simulator, while the radio connectivity model is based on Figure 3.1. To capture the effect of collisions, we performed an empirical study using a similar set up to that in Section The Mica nodes were laid out as a line topology 3 inches above the ground over an open tennis court. Instead of scheduling the network to have only one transmitter at a time, we use a RF

170 7.4. PACKET LEVEL SIMULATIONS 154 Figure 7.6: Screen shot of the packet-level simulator. broadcast to synchronize two transmitters to send packets within a bit time and disable the media access control layer. For nodes that are not scheduled to transmit, they record all the received packet traces. We vary transmit power and the transmit schedule. The resulting traces suggest interesting observations in understanding the collision behavior among three nodes at different distances in our line topology: a sender, a receiver, and a collider that also transmits. To distinguish the sender from the collider, we consider the sender as the node physically closer to the receiver. In general, with the same transmit

171 7.4. PACKET LEVEL SIMULATIONS 155 power used on the sender and the collider, noticeable interference is observed only if a collider is within the transitional region of the receiver. That is, the receiver can still receive some packets from the sender if the sender is in the effective region of the receiver while the collider may be at the same or greater distance from the receiver. If the sender is in the transitional region of the receiver, a fraction of packets from both the collider and the sender are received. Almost no reception is possible if the collider is within the effective region of the receiver. Based on the above empirical observations, instead of pursuing a correct statistical model to simulate interferences and collisions, we just approximate the essence of the observed behavior using a simple probabilistic model in simulation. Such a model builds upon the probabilistic reception link qualities among all nodes that we obtain through our radio connectivity model. Assume p i,j is the probability of successful reception for node i to receive j s message. Let node b be the receiver and node a be the sender. The probability for b to receive a s message given there are k colliders equals p b,a i k 1 p b,i. This model captures the effect that, if the colliders are in the effective region, the probability of the receiver s receiving the sender s packet in the presence of strong interference is small. The probabilistic interference behavior in the transitional region is also provided by this model Simulation Results on Routing Using the simulator described in this chapter, we analyze the different candidate routing protocols over a 100-node network, placed as a 10x10 grid with 8 foot spacing, with the sink node located at the lower left corner of the grid. Again, 8 foot spacing is chosen

172 7.4. PACKET LEVEL SIMULATIONS 156 because the grid spacing, even diagonally, is close and within the edge of the effective region as shown in Figure 3.1. The simulation time for each experiment is 2000 seconds. Each node offers a load of periodic traffic at 10s/data packet and 20s/route packet. With such a traffic load, the network is fairly congested, especially when there is a maximum of 2 retransmissions per link. The route packet generation rate is higher than what would be used in practice. For example, in the Great Duck Island application, the route packet generation rate is a packet every 2 minutes. We increase the rate for the convenience of reducing simulation time. With a simulation time of 2000 seconds, 20s/route packet corresponds to 100 rounds of parent selection cycle. We simulated all the protocols, except SP, since graph analysis in the previous section has shown its poor performance, confirming our experience in practice. For protocols that utilize link estimations, WMEWMA is used with the stable settings as described in Chapter 4. For MT, we additionally consider the effect of using the FREQUENCY algorithm to manage a neighbor table of only twenty entries, with the addition of the routing cost difference selection. We call this case MTTM. All other protocols use a table large enough to hold all neighbors. Figure 7.7 shows the resulting hop distributions. These agree with graph analysis fairly well even though this network has half the physical extent in each dimension of that used in graph analysis. In both evaluation approaches, MT and the two SP(t) cost functions all yield a network that is about 10 hops deep, with most nodes having a hop count of 6. Since link quality information is fixed rather than estimated in graph analysis, link quality of the long links are stable for the routing protocols to exploit. Furthermore, the presence

173 7.4. PACKET LEVEL SIMULATIONS 157 of network traffic in simulations would eliminate some of these long links, yielding longer routing paths. Figure 7.8 shows a CDF of the physical distances traveled by all the links in the network found in both graph analysis and simulation using the MT cost function. As expected, most of the routing links in packet level simulation cover shorter distances than those in graph analysis. For example, up to 60% of the links are below 12 feet in packet simulation compared with 20% in graph analysis. In Figure 7.7, SP(40%), Broadcast, and DSDV all have tight distributions, but wider than SP in graph analysis. SP(70%) and MT yield wider spreads in hop distribution and generally take more hops. For DSDV, about 15% of the nodes have no routes or infinite hops at the end of the simulation; these nodes have become disjoint from the network as a result of link failures or unreachable routes. Without link quality information, long, unreliable links are likely to be selected for routing and these are likely to experience link failures, causing nodes and their children to become disjoint from the network. We observe the average actual path reliability obtained by accumulating the link qualities of each packet that moves through the network in Figure 7.9. The top graph includes the protocols that utilize only high quality links in route formation. These yield relatively high path reliability even at 100 feet (or 6 to 9 routing hops). The differences between MT and SP(70%) are much smaller than those under graph analysis. This is because link estimation has at least ±10% error as compared to the perfect information available in graph analysis, and thus, SP(70%) has fewer opportunities to greedily exploit less reliable links close to 70%. As a result, actual path reliability of SP(70%) is slightly better as compared to graph analysis.

174 7.4. PACKET LEVEL SIMULATIONS 158 Number of Nodes MT MTTM SP (70%) SP (40%) DSDV Broadcast Hop Count Figure 7.7: Hop distribution from simulations. Infinite Note that MTTM shows only a slight drop in path reliability relative to MT in Figure 7.9 while the hop distribution is shallower than all other link estimation based cost functions. This demonstrates the effectiveness of influencing the neighbor selection at the neighborhood management layer using routing cost information. Because sibling nodes are excluded from the neighbor table when it is full, neighbors with less reliability but covering farther physical distance are maintained and yield networks with shorter hop-count without sacrificing much on path reliability. In the bottom graph in Figure 7.9, although SP(40%) exploits link estimates in determining the next hop, a higher tolerance of lossy links yields poor performance similar to DSDV and Broadcast. Protocols having similar hop distributions yield similar path reliability over distance; a higher majority hop count yields higher path reliability over distance.

175 7.4. PACKET LEVEL SIMULATIONS Graph Analysis Packet Level Simulation Empirical CDF of the Hop Distances using MT 0.8 Empirical CDF of Hop Distances x (Feet) Figure 7.8: Cumulative distributive function of the distances of all the links in the network using MT over graph analysis and packet level simulation. Figure 7.10 shows the stability over time of the routing structures due to stochastic variations in packet loss and the associated estimation error. Broadcast and DSDV are highly unstable. Broadcast is unstable because its parent selection mechanism is opportunistic; it depends on whether the parent can be heard during a route discovery flood and that can be different for each flood. DSDV suffers because poor links trigger link failure detection, which causes nodes to join and disjoin from a tree. The other protocols yield stable routing trees. MTTM is the most stable one as indicated in the graph. For all other protocols, the size of a neighbor table is unlimited and can maintain all the neighbors a node can hear. For MTTM, the number of potential parents is limited by the table size. As a result, the number of alternative parents in the neighbor table is reduced, while still

176 7.4. PACKET LEVEL SIMULATIONS Average Actual Path Reliability MT MT w/table Management SP(70%) Distance (Feet) (a) SP(40%) DSDV Broadcast 0.8 Average Actual Path Reliability Distance (Feet) (b) Figure 7.9: Path reliability over distance from simulations.

177 7.4. PACKET LEVEL SIMULATIONS 161 Parent changes in 1 route update (20 secs) MT MTTM SP (70%) SP (40%) DSDV Broadcast Time (seconds) Figure 7.10: Stability from simulations. presenting some good parents. Furthermore, the dynamic neighbor management process acts as a low-pass filter and dampens the parent selection process. This result suggests that constrained resources actually improve selection stability. Figure 7.11 shows that, given a maximum of two link layer retransmissions, the end-to-end success rate is close to 90% for protocols that utilize high quality links. SP(40%) suffers non-negligible packet loss. DSDV suffers from nodes joining and disjoining from the network, while Broadcast performs very poorly even with retransmissions. In all of the simulation runs, no cycles occur. Furthermore, MTTM yields no significant difference in overall performance; it maintains an adequate number of good choices for route formation to succeed. Packet level simulations allow us to explore the protocol dynamics and investigate protocol design issues that go beyond the capability of graph analysis. However, our

178 7.5. EMPIRICAL EXPERIMENTS Average Success Rate MT MTTM 0.2 SP (70%) SP (40%) DSDV Broadcast Distance (Feet) Figure 7.11: End-to-End success rate over distance from simulations. interference and collision model is not adequate to capture reality. Therefore, we rely on empirical studies to further validate our investigation in the next section. 7.5 Empirical Experiments The previous high-level evaluation processes are good approaches to explore issues at scale and to identify early designs that yield poor performances. However, the models of the wireless communication used by these simulations are primitive, and details of the hardware and software systems are often missing. As a result, many issues that are not present in high-level simulations will appear as surprises during real deployment and the performance can be very different. Our original goal is to overcome the real world noisy wireless characteristics for building reliable networks. To fully evaluate our designs, we evaluate our systems em-

179 7.5. EMPIRICAL EXPERIMENTS 163 pirically. Our graph analysis and simulation results allow us to further narrow down our evaluation space and focus on SP(40%), SP(70%), and MT in realistic settings. We implement these protocols and the WMEWMA link estimator on the TinyOS platform Experiments over an Indoor 5x10 Grid Network (Mica) Our first realistic testbed was a 50-node mica network placed as a 5x10 grid with 8 foot spacing in the the foyer of the Hearst Mining building on the UC Berkeley campus, as shown in Figure The nodes were placed on cups 3 inches above the ground, since ground reflection can significantly hinder the range of these radios. The sink node was placed in the middle of the short edge of the 5x10 grid to avoid the potential interference from the metal building supports at the corners of the grid. It was attached to a laptop computer over a serial port interface for data collection. A typical run lasted about three hours and was performed at night when pedestrian traffic was low. We found that to set the radio transmission power levels appropriately and to understand the behavior of the protocols, we had to repeat the connectivity vs. distance study of Figure 3.1 in this indoor setting. We deployed a 10-node line topology network diagonally across the foyer with 8 foot inter-node spacing. To have several hops while preserving good neighbor connectivity, we wanted to find the lowest power setting so that the effective region would cover the grid spacing. Figure 7.13 shows the reliability scatter plot for a low transmit power setting. The fall off is more complex, presumably due to various multipath effects, even though the space is quite open. At 8 feet, most of the links are above 90%. It is apparent that a significant number of reliable, long links exists, with

7.5. EMPIRICAL EXPERIMENTS 164 Figure 7.12: Deployment on the foyer in the Hearst Mining building. a few of them covering more than half of the network extent.

180 7.5. EMPIRICAL EXPERIMENTS 164 Figure 7.12: Deployment on the foyer in the Hearst Mining building. a few of them covering more than half of the network extent. We performed the data collection experiments with the above transmission power setting for SP(70%), SP(40%), and MT. The maximum number of link retransmissions was two. The link estimator setting for WMEWMA was (t = 30, α = 0.5). We used a neighbor table size of 30 in all our 50-node experiments. The traffic load was 30s/data packet and 60s/route packet per node, which offered a 2.5 packets/s average load, which was 30% of the available multihop bandwidth. This setting was smaller than the simulation study due to lower effective bandwidth on real nodes and all the nodes had a randomized start time to avoid bursty traffic. We also explored the effect of tripling the data rate and route update rate on MT, without any rate control, to deliberately drive the network into congestion. To expedite the warm up phase of the estimator, the route update rate was 10s/route packet for the first 10 minutes.

A Cross-Layer Perspective of Routing. Taming the Underlying Challenges of Reliable Routing in Sensor Networks. Underlying Connectivity in Reality

Taming the Underlying Challenges of Reliable Routing in Sensor Networks Alec Woo, Terence Tong, and David Culler UC Berkeley and Intel Research Berkeley A Cross-Layer Perspective of Routing How to get