A SPECTRAL METHOD FOR NETWORK CACHE PLACEMENT BASED ON COMMUTE TIME

Size: px

Start display at page:

Download "A SPECTRAL METHOD FOR NETWORK CACHE PLACEMENT BASED ON COMMUTE TIME"

Owen Simpson
5 years ago
Views:

1 A SPECTRAL METHOD FOR NETWORK CACHE PLACEMENT BASED ON COMMUTE TIME By PRIYANKA SINHA A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2013

2 c 2013 Priyanka Sinha 2

3 To my parents (Mr. Pradip Sinha and Mrs. Rekha Sinha), and my uncle (Mr. Shakti Chatterjee) 3

4 ACKNOWLEDGMENTS This thesis would not have been possible without the guidance and the help of Dr. John M. Shea. I would like to thank him for his encouragement and his valuable guidance in the preparation and completion of this work. I would also like to thank my family and friends for all their invaluable support. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION Problem Overview Literature Review Schemes that Aim to Improve Data Access Efficiency Improve Data Access Efficiency with Single Data Item: Improve Data Access Efficiency with Multiple Data Items: Schemes that Aim to Improve Energy Consumption: Contribution and Organization of this Thesis SYSTEM MODEL AND PROBLEM FORMULATION System Model Topology Link Model Problem Formulation Expected Commute Time Using Spectral Embedding to Express Commute Time as Euclidean Distance Optimality Criterion CLUSTERING ALGORITHM Selection of Algorithm Partitioning Around Medoids (PAM) Algorithm PAM for Cache Selection PAM Parameters and Performance Motivating Example NETWORK SIMULATION Simulation Results CONCLUSION

6 REFERENCES BIOGRAPHICAL SKETCH

7 Table LIST OF TABLES page 4-1 Simulation parameters

8 Figure LIST OF FIGURES page 2-1 The Gilbert-Elliot model Minimum average commute time over multiple runs Average commute time between vertices and medoids as a function of number of iterations in the PAM clustering algorithm Clusters formed by distance-based PAM clustering Clusters formed by commute-time based PAM clustering Average access latency vs p gb for p bg = 0.8 and routing frequency = Average access latency vs p gb for p bg = 0.6 and routing frequency = Average access latency vs p gb for p bg = 0.2 and routing frequency = Average access latency vs p gb for p bg = 0.8 and routing frequency = Average access latency vs p gb for p bg = 0.8 and routing frequency =

9 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science A SPECTRAL METHOD FOR NETWORK CACHE PLACEMENT BASED ON COMMUTE TIME By Priyanka Sinha May 2013 Chair: John M. Shea Major: Electrical and Computer Engineering Information caching in networks can be used to reduce latency and increase reliability in accessing information and also to reduce network traffic. In wireless networks, changing channel conditions may impact the availability of caches, and this should be taken into account when determining where in the networks caches will be placed. In this thesis, we investigate this problem and propose that expected commute time between a nodes and its corresponding cache is a good measure to optimize because it takes into account both the distances between the node and the cache and the number of paths between the node and the cache. We then develop an efficient way to place the caches using spectral clustering. The performance of cache placement based on expected commute time is compared to the performance of cache placement based on Euclidean distance The results show that for most of network topologies, the commute-time based clustering outperforms provides better access latency than distance-based clustering. 9

10 CHAPTER 1 INTRODUCTION 1.1 Problem Overview In communications networks, information may be cached at locations throughout the network to reduce the cost in accessing that information. For instance, by caching the information close to the nodes accessing it, the information can generally be accessed at lower latency, higher speed, and with higher reliability. In the Internet, this approach has spawned the creation of content delivery companies, such as Akamai. Caching is also important in military networks, and especially in distributed wireless networks in which communications over long routes are prone to failure. In this thesis, we focus on the case of cache placement in wireless ad hoc networks, and the discussion will be based on this scenario. However, most of the ideas and techniques are applicable to other wireless and wired communications networks with minimal modification. An important consideration is where to place caches in the network. The caches should be distributed throughout the network so that the information is easily accessible by the other nodes in the network. Here, easily accessible may include various criteria, such as short paths from a node to the nearest cache. Given such a path, additional criteria could be that it have high reliability, high capacity, or lower congestion. Reliability can be further enhanced if the cache is accessible via multiple possible routes. In this thesis, we propose to use techniques from spectral graph theory to optimize the placement of caches in a distributed wireless network, with dynamic link states. Information is cached at a set of nodes that minimizes the expected commute time to nearby nodes that may access the caches. As we discuss further below, expected commute time between two nodes is a measure that not only decreases both when the length of any path connecting the two nodes decreases but also decreases when the number of paths connecting two nodes increases. In addition, by using a spectral embedding of the node adjacency information into a high-dimensional Euclidean 10

11 space, the expected commute time among nodes in a graph can be calculated using Euclidean distance. This latter approach allows for caches to be placed using simple representative-based clustering algorithms and also allows for more efficient, approximate optimization of the expected commute time by embedding the nodes of the graph in a lower-dimensional Euclidean space. Results are presented to demonstrate the effectiveness of the algorithms. 1.2 Literature Review Since wireless networks have limited communication bandwidth, data caching may be a useful approach to improve the efficiency of the data access. A number of recent and past works have tackled the problem of cache placement in wireless networks, and they can be broadly categorized based on the following criteria: 1. Optimization objective: (a) (b) (c) Improve the data access efficiency Improve the energy consumption Improve the rate of utilization and cache-hit ratio 2. Number of data items in network: (a) (b) Cache placement in a network with single data item Cache placement in a network with multiple data items: The class of problems with multiple data items, can again be classified on basis of the size of data: i. Uniform-size data items ii. Nonuniform-size data items 3. Optimization approach: Optimal cache placement for a network with a general graph topology and a single type of data item is generally formulated as one of two different graph-theory problems: (a) (b) In the facility location problem, the goal is to minimize the sum of the total costs (cache setup cost + access cost) incurred due to caching at each node in a certain cache placement, without any constraint, and In the k-median problem, the goal is to minimize the total access cost with a maximum of k cache nodes. 11

12 Cache placement problems can be further classified in terms of the complexity of the algorithm that solves them. For instance, both the facility location and k-median problems are NP-hard, meaning an algorithm for solving it can be translated into one for solving any NP-problem (nondeterministic polynomial time problem). There are certain works that are formulated as APX-hard problems (approximable problems that does not have a polynomial time approximation scheme) like [1]. But in order to be able to find a solution to these cache placement problems, we must overcome the hardness or the nonapproxamibility. Several NP-hard or APX-hard problems have been solved using constant-factor approximation algorithms (polynomial-time approximation algorithms with approximation ratio bounded by a constant) after circumventing the hardness or the nonapproximability. Example: [1] overcomes a nonapproximability problem by choosing to maximize the reduction in total access cost instead of minimizing the total access cost. Several works overcome the hardness of a NP-hard problem by considering tree networks instead of general graph topologies, [2], [3], [4], [5]. 4. Centralized vs. distributed: A distributed algorithm has the advantage of being implementable in a network with dynamic traffic over a centralized one. 5. Memory constraint: Some of the existing works take into consideration that the nodes in the network may have limited memory and hence pose the problem with a memory constraint [6]. In other work, memory is not considered a constraint. Since these classification criteria are overlapping, we use the first criterion (optimization objective) as our primary criterion in describing the existing literature on cache placement in wireless networks. 1.3 Schemes that Aim to Improve Data Access Efficiency Improve Data Access Efficiency with Single Data Item: In [7], the cache placement problem is posed as a trade-off between over-head cost due to cache placement and access latency. A polynomial time algorithm is designed to approximately solve the NP-hard problem of minimizing the weighted sum 12

13 of overhead cost and access latency. The algorithm can be implemented in a distributed and asynchronous fashion. In [8], a hybrid cache-placement scheme is developed that carries out an optimal tradeoff between the dissemination and access overhead cost and the access latency. The proposed scheme uses a routing navigational graph that figures out the potential relationship among the nodes in the routing paths, using the current data access patterns, and a clustering strategy to partition the multihop wireless network to pick suitable nodes for cache placement from a set of nodes related to the users application. This approach helps the cache placement scheme be adaptive to changes in data access patterns while minimizing the number of cache nodes. The scheme results in a smaller overhead cost than flooding and achieves a significant improvement when the number of nodes is large. In [1] an evolutionary approach has been proposed for finding an optimal web proxy cache placement that minimizes the average response time for accessing the web content. When compared to the traditional approaches like dynamic programming and packet level simulation, the evolutionary approach is said to have similar results as packet-level simulation for simple networks, while being computationally faster. The evolutionary algorithm handles large scale networks equally well as the dynamic programming approach. Optimizing the cache placement to tradeoff between the total traffic cost and average access delay in wireless multi-hop ad hoc networks is considered in [2]. Since dynamic network topologies are considered the approach is called Dynamic Cache Placement (DCP). Unlike other data access efficiency optimization problems, DCP takes the impact of contentions in the wireless networks into account: hop counts, which are often used to measure the total cost of caching, result in different performances depending on the contention/traffic loads on the paths. Three kind of traffic flows are considered in the caching system: Access Flow: traversal of nodes for data access, Reply Flow (RF): Traversal of nodes for replying to data requests from cache nodes, and Update Flow: traversal of nodes for updating cache 13

14 content. DCP aims to select candidate nodes for cache placement so as to reduce the access traffic flows and increase the update traffic flows to the best of the possibilities and select cache nodes with fewer contentions from the candidates. In [9] an effective and low cost cache placement scheme for mobile P2P networks is proposed, along with a scheme to update the cache placement as the network evolves. Both schemes are implementable in a decentralized manner. In [10], a heuristic cache-distribution algorithm is developed that aims at improving document download latency by improving the over network latency. This scheme estimates the traffic at each cache of a mesh-network and based on the traffic, each cache is assigned a suitable percentage of the total storage capacity of the network. Refs. [11] and [12] design optimal dynamic programming polynomial algorithms for solving k-median problems in undirected and directed trees, respectively. In other works, [13] considers the placement of k transparent caches, [14] considers a cost model involving reads, writes, and storage, and [15] present a distributed algorithm for sensor networks to reduce the total power expended Improve Data Access Efficiency with Multiple Data Items: Optimizing cache placement in ad hoc network with multiple types of data items is the focus of [16], in which three different algorithms are proposed. In the first, each node caches the items most frequently accessed by it. The second approach eliminates replications among neighboring nodes introduced by the first approach. The third approach requires creation of stable groups to gather neighborhood information and determine caching placements. The approach in [16] is extended in [17] and [18] by generalizing the above approaches for push-based systems and updates, respectively. Here, [17] improves uses a push-based approach to shorten the average response time for data access, and [18] tries to improve data accessibility for systems in which the data items are updated periodically. 14

15 Several other references also consider cache placement with multiple data type. Ref. [19] suggests transparent replica placement in tree networks to minimize total data transfer cost. To support data access in a multiple data item environment, [3] devises three simple distributed caching techniques: CacheData (caches data items that are passing by), CachePath (caches the path to the nearest cache of the passing-by data item), and HybridCache (which caches the data item if its size is small enough or the path to the data otherwise). They use LRU (least recently used) policy for cache replacement. Ref. [20], proposes a 20.5-approximation non-distributed (where distributed implementation is not possible) algorithm for a non-apx optimal cache placement with uniform-size multiple data items, as no polynomial-time solution exists for the nonuniform-size data items. However, their approach (as noted by themselves) is not amenable to an efficient distributed implementation. Ref. [6] is a similar work that minimizes the total data access cost in ad hoc networks with multiple uniform-size (generalizable to non-uniform size data items) data items and nodes with limited memory capacity. A centralized tractable algorithm with a provable performance bound is developed. The algorithm is also suitable to a natural distributed implementation. Namely, a centralized 4-approximation algorithm (2-approximation for uniform-size data items), and a localized distributed algorithm, based on the approximation algorithm and capable of handling mobility of nodes and dynamic traffic conditions have been devised. In [21], a data caching algorithm is proposed for ad hoc networks with multiple data items and whose nodes exchange information items in a peer-to-peer manner. At each node, upon receiving requested information, it determines the cache drop time of the information or which content to replace for the newly arrived information. A near optimal cache placement is proposed to maximize reduction in overall access cost while meeting the limited memory constraint, which in turn leads to better bandwidth usage and energy savings. The algorithms proposed in this paper are both analytically tractable with a provable performance bound in a centralized setting and are also amenable to a 15

16 natural distributed implementation. In [22] an effective and low cost cache placement strategy, combined with an update scheme, has been proposed which is suitable for decentralized implementation in a mobile peer-to-peer network. This paper also compares its placement and update scheme with various placement-only schemes like Global Benefit Based Cache Placement (GBCP), Local Benefit Based Cache Placement (LBCP) and Cluster Based Cache Placement (CBCP), and Random Placement (RAND) and establishes that a combination of placement and update does better than the other three placement schemes in terms of average hop count required to transmit a segment of data. 1.4 Schemes that Aim to Improve Energy Consumption: In [4], cache-placement algorithms are developed to minimize the overall access cost with an update cost constraint, thus reducing energy consumption and taking care of resource efficiency. Dynamic programming is used to solve the optimal cache-placement problem for tree topologies, and a polynomial time algorithm is developed to approximately solve the NP-hard cache placement problem for general graph topologies. Distributed implementations of these algorithms are also developed. In [5] a caching scheme that optimally trades-off between energy consumption and access latency in wireless ad hoc network is developed. The problem is a special case of the connected facility location problem, which is known to be NP-hard. A polynomial time algorithm for the same has been developed, which provides a sub-optimal solution in arbitrary network topologies. This algorithm can be implemented in a distributed and asynchronous manner. In the case of a tree topology, the algorithm gives optimal solution. An energy-conserving caching scheme for wireless sensor networks is developed in [23]. Finding the locations of the nodes for caching data to minimize communication cost corresponds to finding the nodes of a weighted Minimum Steiner tree whose edge weights depend on the edge s Euclidean length and its data traffic rate. This 16

17 tree is called a Steiner Data Caching Tree (SDCT). Expressions determining the exact location of a Steiner point for a set of three nodes based on their location are derived along with their data refresh rate requirements. Based on these (optimality) results, a dynamic, distributed, energy-conserving application-layer service for data caching and asynchronous multicast is presented. A review of the various data caching techniques in wireless sensor networks (WSNs) is presented in [24]. In [15], a distributed application-layer service for cache placement and asynchronous multicast in wireless sensor networks has been proposed for placing replicas of requested data items and updating them in such a manner so as to minimize the frequency of communication, which results in reduced communication overhead and hence reduced power consumption. 1.5 Contribution and Organization of this Thesis The existing work on cache placement focuses on networks in which the links are reliable. In wireless mesh and ad hoc networks, depending on the communication frequencies and mobility rates, the links may often experience outages because of multipath fading. Thus, in this thesis we focus on the design of a cache placement strategy to improve performance in the presence of link failures. The rest of this document is organized as follows. In chapter 2, the system model is presented, and the proposed metric for optimizing cache placement is presented. In chapter 3, we describe how spectral clustering algorithms can be used to approximate the optimal cache placements. In chapter 4, we describe a network simulation that was used to compare performance of the proposed cache placement algorithm with a reference algorithm, and performance results are presented to show the advantages of the cache placement algorithm we propose. Finally, in chapter 5, conclusions are drawn and possible extensions to this work are discussed. 17

18 CHAPTER 2 SYSTEM MODEL AND PROBLEM FORMULATION 2.1 System Model We consider a wireless network with static topology but time-varying communication links. This scenario can model a slowly moving ad hoc network over short time frames and is sufficient to demonstrate whether the proposed cache-placement techniques can improve performance in the presence of link quality fluctuations. For the purposes of this study, at any given time, communication over a link between two radios is either possible (the link is up) or not possible (the link is down). Links are assumed to transition between up and down according to a random process. Thus, we can characterize the network in terms of its topology and the link model Topology Consider first the full network topology, which consists of the set of communicators (nodes) along the set of links when all links are up. The full network topology can be represented by a simple weighted graph G = (V, E), where V is the set of vertices (representing the data or nodes in the network) and E is the set of edges connecting the vertices in V. For convenience, let N = V be the number of vertices, or nodes, in the network. We assume that G is a connected graph, which means that there is a path from any vertex to any other vertex. If an edge exists between two vertices v i and v j, then those vertices have a nonzero similarity or affinity measure, a ij 0 which is the weight assigned to that edge. Larger weights indicate that communication is easier between the nodes, in terms of an appropriate measure, such as throughput or reliability. The weights can be collected into a weighted adjacency matrix A = [a ij ], i, j = 1, 2,..., N. Here w ij = 0 if v i and v j do not share an edge or if i = j. The degree of vertex v i V is d i = N j=1 a ij. Let D be the diagonal matrix with D ii = d i. An important matrix that we will utilize later is the (unnormalized) Laplacian matrix for G, which is L = D A. 18

19 2.1.2 Link Model As previously mentioned, at any given time, a given communication link may either be up or down. For convenience, we divide time into slots and characterize the state of each edge G in each slot. We assume that the states for different links are independent, which may not necessarily be true in situations such as shadowing; however this will be true if the link quality is caused by fading in a rich multipath environment. For most situations that cause link quality to fluctuate, such as fading or shadowing, the link quality will not be independent from slot to slot. To model the dependence between slots, in this thesis, we use the Gilbert-Elliot channel model, which is based on a two-state discrete-time Markov chain. The two states are the good state and the bad state, where the link is up when the Markov chain is in the good state and the link is down when the Markov chain is in the bad state. A state diagram for the Gilbert-Elliot channel is shown in Figure 2-1. Figure 2-1. The Gilbert-Elliot model Let p gb denote the conditional probability that the next state is the bad state given that the current state is the good state. Similarly, let p bg denote the conditional probability that the next state is the good state given that the current state is the bad state. The Gilbert-Elliot model can be completely characterized by specifying the probabilities of transitioning to the opposite state. (The two remaining state transition probabilities are given by p bb = 1 p bg and p gg = 1 p gb ). The expected number of slots for which a particular link stays in a given state is known as the state sojourn time, which in turn depends on the transition probabilities given that the channel is in that 19

20 particular state. The state sojourn times for the good and bad state are T g = 1/p gb and T b = 1/p bg, respectively. 2.2 Problem Formulation We consider the problem of how to place K caches among the N nodes in the network to minimize the latency for the N nodes to access the cached data. Let C V be the subset of nodes at which data will be cached. We consider cache placement under the assumption that each node will access a single cache for which it has the smallest cost to access. In a wireless network, even if the links are reliable, the time to fulfill cache requests may be extremely difficult to characterize because of contention issues and queuing delays. Thus, we consider instead minimizing a cost function that encodes features that impact latency. For example, if the links are reliable, stable, and multi-path routing is not used, the cost may be the number of edges that must be traversed or the sum of a cost function computed from the weights on the edges (such as w 1 ij ). However, in networks with time-varying link quality, such measures may result in poor performance because they depend on a single route from the nodes to the caches, and these routes may break because of changes in link quality. Thus, it is desirable to use a distance measure that incorporates path length, links weights, and information about multiple routes between the nodes. One such measure is expected commute time Expected Commute Time Expected commute time is defined in terms of a random walk on the graph G. Every vertex in the graph is associated with a state in a discrete-time homogeneous Markov chain. Let s(t) be the state of the Markov Chain at time t. Then we let the transition probabilities between states be proportional to the weights of the edges emerging from the states. Thus, the single-step transition probability from state i to state j is given by P [s(t + 1) = j s(t) = i] = aij/di = pij. Since the graph is connected and the edges are not directed, the Markov chain is irreducible. 20

21 Consider the time to first reach some state k from state i, T ik. Formally, T ij = min {t 0 s(t) = j and s(0) = i}. The expected (or average) first-passage time from state i to state j is m(j i) = E[T ij ]. Details of the calculation of m(j i) are given in [25]. Note that m(j i) is not necessarily equal to m(i j), since they depend, respectively, on the probabilities of leaving state i and leaving state j, which are in general different. Thus, m(j i) is not a distance measure. However, consider the expected commute time, n(i, j) = m(j i) + m(i j), (2 1) which is the expected time for a random walker to first reach state j and then to first return to state i. Then n(i, j) is a valid distance measure [25]. The expected commute time n(i, j) has the useful property that it decreases when any of the paths between i and j are shortened or if additional paths are added between i and j. This can be shown true via an isomorphism with electrical resistive networks and application of Rayleigh s Monotonicity Law [25, 26]. These properties make the expected commute time a good candidate for a distance measure to use in selecting cache locations in a communications network because they encode not only the distance between the nodes and the caches but also the robustness of the cache to link failures because lower expected commute time between nodes is also associated with multiple paths connecting the nodes Using Spectral Embedding to Express Commute Time as Euclidean Distance Expected commute time has another property that makes it a good candidate as a distance measure. It can be computed using Euclidean distance by an appropriate embedding of the vertices of the graph into a high-dimensional Euclidean space. The details of this approach are given in [25] and summarized here for clarity. Let L denote the Moore-Penrose inverse of L. Note further that L is the discrete Green s function for L (with no boundary conditions) [27]. L can be written in terms of the Laplacian matrix 21

22 L, as L = (L eet n ) 1 + eet n, (2 2) where n is the number of vertices of G and e = [1, 1,..., 1] T. Let V G be the volume of the graph, V G = n d i. (2 3) i=1 Then the expected commute time between nodes i and j is n(i, j) = V G (e i e j ) T L (e i e j ), (2 4) where e i is the unit vector of length n with zeros in all positions except for the ith position, which is one. Instead of computing the commute time using L and (2 4), we instead propose to embed the vertices of G as points in a Euclidean space where the commute time can be computed using Euclidean distance. Since L is a real-symmetric matrix, is has a spectral factorization of the form L = UΛ p U T. Here Λ p is a diagonal matrix with the eigenvalues of L on the diagonal, and U is a matrix whose columns are the eigenvectors of L. Then (2 4) can be rewritten as n(i, j) = V G (x i x j ) T (x i x j ) = V G x i x j 2, (2 5) where x i = Λ p 1/2 U T e i. Thus, the coordinates of all of the embedded vertices is given by the columns of the matrix X given by X = Λ p 1/2 U T (2 6) As noted in [25], it is not necessary to compute L to compute the spectral embedding given by (2 6). Let {λ i } be the eigenvalues of L, and {λ i } be the eigenvalues 22

23 of L. Then L and L have the same eigenvectors, and λ i = 1/λ i (except for the eigenvalue 0, which is shared by both matrices). Thus, the projection in (2 6) can be carried out directly from the eigenvalues and eigenvectors of L Optimality Criterion We wish to choose a subset C V such that the expected commute time from the nodes V to the caches C is in minimized according to some cost criterion. As previously mentioned, we assume that each node is assigned to access one cache. Thus, the network is partitioned based on which caches the nodes are assigned to. Let C(V i ) be the cache to which vertex i is assigned, and let V(C j ) denote the set of vertices assigned to cache C j. Below, we assume that specifying {C(V i )} for all V i V implicitly specifies C. We call the optimization criterion for selection of which nodes will act as caches and for assignment of nodes to caches the minimum average commute time (MACT): MACT = arg min {C(V i } 1 V C C V V(C) n(v, C) (2 7) Note that the term 1/ V is a constant that can be omitted in the computations. The allocation of caches can be solved efficiently via clustering, as detailed in the next chapter. 23

24 CHAPTER 3 CLUSTERING ALGORITHM 3.1 Selection of Algorithm As mentioned in [28] clustering algorithms can be broadly divided into two classes: based on hierarchical methods and based on partitioning methods. Hierarchical algorithms again can be of two main types: agglomerative and divisive. In agglomerative algorithms, every object forms a separate cluster, and in consecutive steps clusters are merged, until the desired number of clusters is achieved. In contrast, divisive clustering starts by assigning all objects to a single cluster, and splitting one cluster in each subsequent step. The splitting stops after desired number of clusters have been achieved. In this work we choose to work with partitioning algorithms because of an inherent disadvantage of hierarchical methods their inability to undo a merging or splitting of two clusters, even if their regrouping results in a smaller average dissimilarity in the new cluster. This property typically results in inferior clustering performance. On the other hand, a partitioning algorithm tries to find out the best clustering by putting the most similar objects together in a cluster. There are various types of partitioning algorithms like K-means, K-medians, K-medoids, and fuzzy analysis. We chose to work with K-medoid algorithms because unlike K-means problems, K-medoids clustering problems choose a set of K objects from the given set of objects to be the representative of the clusters and associates each of the rest of the objects to one of the chosen K representatives. In addition K-medoid algorithms are known to handle large data sets more efficiently and needs no modification for translation or orthogonal transformation of data points. Partitioning Around Medoids (PAM) is one of the best known K-medoid algorithms. Although PAM has a very high computational complexity, we selected PAM for our purpose as it provides us with very high quality clustering results and needs little modification for 24

25 handling Euclidean criteria, and in this thesis our aim is to achieve the best possible clustering quality. Alternatively, the CLARANS algorithm can be used, with lower complexity. CLARANS is the acronym for A Clustering Algorithm based on Randomized Search). The general problem of clustering can be viewed as the problem of searching a graph where every node represents a solution i.e. a set of k medoids. Two nodes are called neighbors if their set differs by only one object. Therefore each node has n(n-k) number of neighbors, where k is the number of clusters. Thus each node can be assigned cost defined as the total dissimilarity between every object and medoids of its clusters. Thus PAM is the search for a minimum on this graph, and at each step all the neighbors of the current node is checked, and the current node replaced with the neighbor that has the minimum negative cost. Whereas PAM checks all the nodes, CLARANS draws a sample of neighbors dynamically. This is the key difference between PAM and CLARANS. CLARANS is more efficient and scalable than PAM is. 3.2 Partitioning Around Medoids (PAM) Algorithm PAM was developed by Kaufman and Rousseeuw and is documented in [29, Ch. 2]. The objective of PAM is to minimize the average dissimilarity between an object and its medoid. The algorithm starts in a BUILD phase in which medoids are selected, and then executes a SWAP phase in which alternate nodes are evaluated as medoids. 1. BUILD phase: In this phase PAM selects K objects randomly from the given set of N objects and calls them the medoid points. Next each of (N K) objects are assigned to one of the clusters represented by those k medoids on basis of the objects similarity to those medoid objects. If a point P i has minimum dissimilarity with a medoid point P m, compared to all other medoids, then P i is assigned to the cluster belonging to P m. Thus the initial clusters are formed in the BUILD stage. 2. SWAP phase: Here we first compute the overall reduction in average dissimilarity by replacing each medoid O m by each of the non-medoid objects O m in the cluster. The replacement that provides the maximum reduction in overall average dissimilarity is then implemented by actually making the replacement. In this process we also consider the transfer of a non-medoid object O i from one existing cluster to the cluster belonging to the second nearest medoid O m2 depending on 25

26 changes that are inflicted by replacing a medoid with a non-medoid point. There can be four such situations : (a) (b) (c) (d) O i is initially assigned to the cluster belonging to the medoid point O m. Now if O m is replaced by O m which is more dissimilar to O i as compared to the nearest medoid point O m2, then the point O i would move to the cluster represented by O m2. This implies this replacement increases average dissimilarity, i.e. the cost of such replacement is positive and can be given by Cost i = dissimilarity(o i, O m2 ) dissimilarity(o i, O m ). O i is a part of the cluster represented by O m and O m2 is more dissimilar to O i than the non-medoid O m, so O i stays in the same cluster which is now represented by O m. The cost associated might be negative or positive and is given by cost i = dissimilarity(o i, O m )dissimilarity(o i, O m ). O i is a part of a cluster represented by O m2 and not O m. Now O m is replaced by O m, while O i is more similar to its current medoid O m2 than to O m. So O i stays in the same cluster, and the cost associated is this cost i = 0. O i is a part of a cluster represented by O m 2 and not O m. This time O i is less similar to its current medoid O m2 than O m, so when O m is replaced by O m, O i moves from the cluster represented by O m2 to the cluster represented by O m. Cost associated is negative and is given as cost i = dissimilarity(o i, O m ) dissimilarity(o i, O m2 ). The total cost (CT) of replacing an existing medoid O m by a non-medoid O m is computed by summing the costs calculated above over all the non-medoids, i.e. CT (O m, O m ) = i cost i. The pair of (O m, O m ). that provides a negative minimum total cost is selected. 3.3 PAM for Cache Selection In this work we use PAM to select a subset of the communicators to serve as caches. We use PAM to partition the nodes into K clusters for which the medoids will be assigned the caches. PAM is applied to find the cache assignments for two different approaches. In the first, the dissimilarity between two vertices is measured by the Euclidean distance between the vertices of a graph in a R 2 subspace. In the second, the dissimilarity between two vertices is given by the expected commute time between those vertices. The first approach is the most traditional form of PAM. The second approach can also be directly implemented using the PAM algorithm using (2 5), which shows that 26

27 2.58 x Minimum Average Commute Time Mutiple Runs Figure 3-1. Minimum average commute time over multiple runs. expected commute time can be computed using Euclidean distance by using a spectral embedding of the vertices of the graph into a high-dimensional space. We note that a direct spectral embedding does require that the graph topology be fully connected, and we only consider this scenario in this work. 3.4 PAM Parameters and Performance Since PAM depends on a randomized search, results may vary each time the algorithm is run. Therefore, in order to find the best result, the clustering algorithm is run for several times for each topology, and the clustering corresponding to the minimum average cost is chosen. Figure 3-1 is a plot that shows how the average cost obtained for a commute-time based PAM in a 100-node topology with 5 clusters vary with multiple runs. 27

28 Figure 3-2. Average commute time between vertices and medoids as a function of number of iterations in the PAM clustering algorithm. The results in Figure 3-2 show the average commute time between the non-medoid nodes and the medoids (where the caches will be placed) as a function of the number of iterations in the PAM algorithm. As expected, the plot is monotonically decreasing, however the performance saturates after 6 iterations. 3.5 Motivating Example We use an example of a small, simple network to demonstrate the difference between the results obtained by the distance-based clustering and commute-time based clustering algorithms. A total of 18 nodes are partitioned into 2 clusters. Figure 3-3 shows the cluster assignment and cache assignment for the distance-based clustering algorithm, and Figure 3-4 shows the cluster assignment and cache assignment for the commute-time based clustering algorithm. Solid lines between vertices indicate that the vertices share a communication link. Red and blue node colors differentiate the two clusters, and the circled nodes are the medoids of the clusters, where the caches will be placed. Consider the results when PAM is applied to this topology with the distance-based metric, which is shown in Figure 3-3. The results match with intuition. The network is 28

29 Figure 3-3. Clusters formed by distance-based PAM clustering. Figure 3-4. Clusters formed by commute-time based PAM clustering. partitioned down the middle into two equal-sized clusters, with the node near the middle of each cluster (nodes 0 and 9) assigned as the medoids. When the same topology is clustered using commute-time based PAM clustering algorithm, we get different results. Although the two clusters are the same, the medoids of the clusters have changed to nodes 5 and nodes 10, as shown in Figure 3-4. The medoids chosen by the commute-time based PAM can be reached by every node except for nodes 0 and 9 by 29

30 two paths, thus resulting in a lower commute time for those nodes. If one of the links on the ring fails, then with the commute-time based medoid assignment, the nodes will be able to reroute the caching traffic around the failed link, whereas the distance-based medoid assignment has a critical dependence for all nodes on the links between vertices 0 5 and We note that expected commute time also has the advantage of providing a better medoid location based on network links even in the absence of link failures. To make a rough estimate of the network performance under the two types of clustering, we compute the average hop count between each communicator (vertex) and its corresponding cache (medoid). Since in both case the two clusters formed are symmetric, the average hop count is equal to the hop count of any one of the clusters. Let hc dist and hc com denote the hop counts under distance-based clustering and commute-time based clustering, respectively. Then, it is easy to see that hc dist = 1/9(hc(9, 10)+hc(9, 11)+hc(9, 17)+hc(9, 12) + hc(9, 16)+hc(9, 13) + hc(9, 15) +hc(9,14)) =1/9( ) = 2.67; hc com = 1/9(hc(0, 1)+hc(0, 2)+hc(0, 8)+hc(0, 3)+hc(0, 7)+hc(0, 4)+hc(0, 6)+hc(0, 5)) = 1/9( ) = 1.88; So we see a performance improvement of approximately 30 percent from the use of commute-time based clustering. In the following chapter, we use a network simulation to see if this improvement and the potential robustness to link failures translates into improvements in cache access latencies. 30

31 CHAPTER 4 NETWORK SIMULATION 4.1 Simulation In this chapter, we report on results of using network simulation to evaluate the performance of the proposed clustering algorithms in random connected networks with time-varying links. We evaluate the performance of the distance-based and commute-time based cache placement algorithms by computing the total time required to complete a series of cache requests, from which we compute the average cache access latency. The network simulation uses a slotted protocol. Each topology is simulated over many slots, and the access latencies are averaged over many randomly generated connected topologies. The simulation model is a slotted system, and the following activities take place in each of the slots: 1. All the non-medoid nodes in the network, generate a cache request with a certain cache request probability (p cache ). A node that has already generated a cache request but did not complete the data access yet is not allowed to generate another cache request. 2. Nodes that have generated a cache request, push the request packet to their respective send queue. Each node in the network is assigned an infinite queue, where the packets to transmitted are stored in FIFO basis. 3. To emulate the fact that the update frequency of routing tables is typically much smaller than the packet transmission time, the routing tables are updated by calculating the minimum-hop path between each pair of nodes during every r th transmission interval. We call 1/r the routing update frequency. 4. Each node with a non-empty send queue will try to send the first packet in their send queue to the next-hop for that packet with transmission probability p T. 5. If a node transmits in an interval, then it uses its routing table to find a path between itself and the destination node. 6. If the link between the current node and the next node in the path is up, the packet is sent to the next node, otherwise the packet stays in the send queue of the current node. 31

32 7. After a data packet reaches the intended node, the data access is assumed to be completed, and the current time stamp is stored as receive time for the particular node. 8. The difference between the transmit time and receive time gives the the data access time for the node. 9. At then end of each slot, the link state is updated according to the state transition probabilities of the channel. Table 4-1. Simulation parameters Parameter Value p cache 0.05 p trans 0.6 p bg 0.8,0.6,0.2 p gb 0.05 to 0.5 Number of nodes 100 Number of clusters 5 Routing frequency 0.01,0.02,0.05 Simulations were run for different values of the state transition probabilities. For each set of values, 50 randomly generated topologies were simulated. For each topology, the simulation was run for 10,000 slots. The data access times were averaged for a particular topology were averaged to produce the average latency for that topology, and the overall average latency was determined by averaging these over the 50 different topologies. The parameters of the simulation are collected in 4-1 Different topologies were generated by randomly varying the connectivity distance, fieldsize and coordinates of the vertices or nodes in the network on on the fly. Since we need a connected graph, after generating each random topology, a check is performed to make sure the produced topology is a connected graph. If not we try to convert it into a connected graph by varying the connectivity distance. Connectivity distance is the maximum distance by which two nodes that share an edge, can be apart by. The parameter fieldsize determines the maximum range of the x and y-coordinates in the topology. 32

33 4.2 Results As the p gb decreases, and therefore channels remain down for a longer period of time. We simulate our system i.e. compute the average access latency for a fixed value of p bg and plot it as a function of p gb. As explained by the following figures, for a fixed p bg, as the p gb goes up the access latency increases and commute time based clustering gives a lower access latency as compared to the clustering based on distance-based clustering. We also see that for lower value of p bg also access latency increases, although the commute time based clustering provides us with a better performance. We also vary the routing frequency as a parameter, and as suggested by the results, as the routing frequency increases the overall access latency decreases maintaining a superior performance by the commute time based clustering. Figure 4-1. Average access latency vs p gb for p bg = 0.8 and routing frequency = The results in Figure 4-1 show the average access latency as a function of the probability of transitioning from the good state to the bad state, p bg for the two different 33

34 clustering algorithms with p gb = 0.8, and routing frequency = The results show that the commute-time based cache placement algorithm provides significantly better performance than cache placement based on Euclidean distance. For example at p bg = 0.3, the access latency for commute-time based cache placement is 20, whereas the access latency for distance-based cache placement is 130. For the values considered in this graph, commute-time cache placement produces a reduction in average access latency of at least 85%. Figure 4-2. Average access latency vs p gb for p bg = 0.6 and routing frequency = The results in Figure 4-2 show the average access latency as a function of the probability of transitioning from the good state to the bad state, p bg for the two different clustering algorithms with p gb = 0.6, and routing frequency = The results show that the commute-time based cache placement algorithm provide significantly better performance than cache placement based on Euclidean distance. For example at p bg = 0.3, the access latency for commute-time based cache placement is 100, 34

35 whereas the access latency for distance-based cache placement is 550. For the values considered in this graph, commute-time cache placement produces a reduction in average access latency of at least 80%. If we compare this result with that of Figure 4-1, we would see that due to an increase in the sojourn time in the bad state, i.e. due to an increase in p gb, the performance of both the algorithms have degraded as compared to that in Figure 4-1, although the commute-time based algorithm in this case also performs better than the distance based algorithm. Figure 4-3. Average access latency vs p gb for p bg = 0.2 and routing frequency = The results in Figure 4-3 show the average access latency as a function of the probability of transitioning from the good state to the bad state, p bg for the two different clustering algorithms with p gb = 0.2, and routing frequency = The results show that the commute-time based cache placement algorithm provide significantly better performance than cache placement based on Euclidean distance. For example at p bg = 0.3, the access latency for commute-time based cache placement is 200, 35

36 whereas the access latency for distance-based cache placement is 800. For the values considered in this graph, commute-time cache placement produces a reduction in average access latency of at least 75%. If we compare this result with that of Figure 4-1 and Figure 4-2, we would see that due to an increase in the sojourn time in the bad state, i.e. due to an increase in p gb, the performance of both the algorithms have degraded as compared to that in Figure 4-1 and Figure 4-2, although the commute-time based algorithm in this case also performs better than the distance based algorithm. Figure 4-4. Average access latency vs p gb for p bg = 0.8 and routing frequency = The results in Figure 4-4 show the average access latency as a function of the probability of transitioning from the good state to the bad state, p bg for the two different clustering algorithms with p gb = 0.8, and routing frequency = The results show that the commute-time based cache placement algorithm provide significantly better performance than cache placement based on Euclidean distance. For example at p bg = 0.3, the access latency for commute-time based cache placement is 100, 36

Benefit-based Data Caching in Ad Hoc. Networks

Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta, Samir Das Computer Science Department Stony Brook University Stony Brook, NY 790 Email: {bintang,hgupta,samir}@cs.sunysb.edu Abstract