Presented by: Dardan Xhymshiti Spring 2016

Size: px

Start display at page:

Download "Presented by: Dardan Xhymshiti Spring 2016"

Nelson French
5 years ago
Views:

1 Presented by: Dardan Xhymshiti Spring 2016

2 Authors: Sean Chester* Darius Sidlauskas` Ira Assent* Kenneth S. Bogh* *Data-Intensive Systems Group, Aarhus University, Denmakr `Data-Intensive Applications and Systems Laboratory, EPFL, Switzerland Publication: Type: ICDE 2015 Research Paper 2

3 Skyline is expensive to compute especially in large datasets. The recent multi-core skyline algorithms does not effectively reduce the dominance tests. State-of-the art skyline algorithms outperform multi-core algorithms. Most of the multi-core Skyline algorithms use the Divide-&-Conquer approach which has two drawbacks: If the number of local skyline points is large, the merging step is expensive. Increasing the cardinality of the dataset, the computation becomes expensive. 3

4 To come up with a new multi-core algorithm, which eliminate as much as it can dominance tests. 4

5 Provide an overview about skyline operator. Introduces to the innovative skyline Hybrid algorithm. Provide experiments which shows that this algorithm outperforms multi-core and sequential algorithms. 5

6 How to increase the performance of skyline algorithms: Implementation in GPUs Implementation in Multi-Core CPUs. Implementing in distributing environments like MapReduce. The authors have developed an algorithm called: Hybrid The authors have chosen the Multi-Core CPU to do the implementation of the algorithm because of: Cheaper shared data structures and Parallel work need not be isolated. 6

7 All the skyline points are maintained in a shared global data structure. This data structure gets updated regularly and is read by all threads. The skyline points in the data structure are ordered to detect dominance relationships quickly. The processing of points is done in blocks that guarantee each point is compared to at most α points than in a sequential algorithm. 7

8 Sort-based algorithms (quickly detect dominance relationships) SFS (Sort Filter Skyline) LESS SaLSa (Sort and Limit Skyline algorithm) Object-based space partitioning (quickly detect incomparability) Object-Space partitioning BSky-Tree-P 8

9 Presented by: Dardan Xhymshiti Spring 2016

10 Authors: Michael Shekelyam Gregor Josse Matthias Schubert Institute of Informatics, Ludwig-Maximilians-University Munich Publication: Type: ICDE 2015 Research Paper 10

In many application areas, data is organized as a network of graph. Important task: compute a cost-optimal path between a start node and a target one.

11 In many application areas, data is organized as a network of graph. Important task: compute a cost-optimal path between a start node and a target one. Example: Road networks (Cost criteria: travel time, travel distance, energy consumption etc.) Computer networks (Cost criteria: bandwidth and the latency between routers.) Cost vector: when considering more than one criterion at a time, the cost of complete path is called cost vector. Cost criteria 1 Cost criteria 2 Cost criteria 3 11

12 How to define if a path is an optimal path? 1. Map the cost vector to a value by employing a monotonic combination function, and then sort the paths. The top n paths are the optimal ones. Problems: a) Hard to find a suitable function, b) Different types of cost might have different levels of scale. 2. Compute the pareto optimal (mathematical definition of Skyline) cost vectors. (This is also known as conventional path skyline). Problems: a) The number of pareto optimal paths might increase exponentially as function of distance and the amount of considered cost criteria. b) Showing to the user a large amount of results is not very helpful. 12

13 There actually exist solutions for computing linear path skylines, but they are restricted to the specific case of having just two cost criteria. Problem: cannot be generalized to more criteria. The number of skyline paths increases exponentially with the distance between the locations and the number of cost criteria. Thus the result set might be too big. 13

14 Come up with a new approach of computing the results set of path skylines, which provides better and faster results. 14

15 Recall: Conventional path skyline computes all potential optimal paths, but the result is too big. Idea: reduce the result set, to only show the paths which are optimal under a weighted sum function or linear combination of cost criteria. Intuitively saying: The user weights each type of cost with a percentage describing its importance. How to compute the linear path skyline? Naive approach: compute the conventional path skyline and then compute the convex hull on the resulting cost vectors. (Inefficient). 15

16 What is Convex Hull? Definition: In mathematics, the convex hull or convex envelope of a set X of points in the Euclidean plane or Euclidean space is the smallest convex set that contains X. 16

17 The authors come up with an algorithm called LSCH (Linear Skyline Convex Hull) which constructs a linear path skyline successively while only adding new paths which are members of the result set. Implementation overview: 1. To add a new cost vector, a single search is performed which combine the cost vectors based on the normal vectors of the hyper planes currently limiting the linear path skyline. 2. To identify the areas on the linear skyline where additional results might still exits, the algorithms applies multidimensional convex hull computation. 17

18 Experiments are run in two different types of networks: 1. Munich road network with five cost criteria. 2. Artificial lattice graphs that allow to simulate different problem instances and parameter settings. 18

19 Computing route Skylines algorithms. Convex hull algorithm. 19

20 Presented by: Shahab Helmi Spring 2016

21 Authors: Publication: Type: ICDE 2015 Research Paper 21

With the rapid development and continuing use of vehicle tracking technologies, it is possible to

22 A recent study suggests that the routes provided by a leading navigation service often fail to agree with the routes chosen by local drivers, why? Limited number of travel costs, e.g., distance or travel time. With the rapid development and continuing use of vehicle tracking technologies, it is possible to learn and update individual drivers driving preferences according to their trajectories. 22

23 It proposes a novel problem on personalized route recommendation based on big trajectory data. It proposes techniques to model and update driving preferences from drivers trajectories. The proposed driving preference model can support arbitrary number of travel costs of interest and distributions of cost ratios. Comprehensive experiments were done conducted on a substantial, real trajectory data set to show efficiency and effectiveness. 23

24 1. Indexing the road network. 2. Modeling drivers preferences from their trajectories considering multiple travel costs. 3. Selecting sub-trajectories according to source, destination, departure time and driver s preferences. 4. constructing a small graph (Zohreh) with appropriate edge weights reflecting how the driver would like to use the edges based on the selected trajectories. 5. Returning the shortest route in the small graph as the personalized route to the driver. 24

25 Route Planning. Route Planning Using Trajectories: no driver modeling: Most popular route (MPR) Time period-based most popular route (TPMPR) Top-k popular routes (TKPR) Personalized Route Planning: TRIP: closest work, tested over a smaller dataset, can only model travel time 25

26 GPS records: 52,211 taxis in Beijing. during to One GPS record is collected in every 5 seconds or less. Road network: 6 th street in Beijing. 28,342 vertices and 38,577 edges. 60 km60 km square region. 26

27 Trajectories: 32,379,248 trajectories. starts when a passenger got in the taxi and ends when the passenger got off the taxi. Travel Costs: Travel distance. Travel time. Fuel consumption. 27

28 Presented by: Shahab Helmi Spring 2016

29 Authors: Publication: Type: ICDE 2015 Research Paper 29

30 an episode (serial episode) is usually defined as a totally ordered set of events that occur relatively close to each other. of an episode: how frequently it occurs in a sequence. Frequency count methods: Window-based Non-overlapped occurrences Non-interleaved occurrences Total Frequency can capture the most intense correlation between events. 30

31 Previous studies on frequent episode mining (FEM) mostly process the data in offline mode. Shortcomings: This process may take hours or days. Testing whether an episode occurs in a sequence is an NP-complete problem. Real world application challenges: Fast-growing data. Recency effect: only freshets pattern from recent events are of interest (high frequency trading a trader holds stock only for 22 seconds in average). Time-critical analysis: a delay may lead to drastic loss (predictive maintenance of data centers). 31

32 Online FEM algorithm challenges: 1. Infrequent events at the current moment could become frequent in the future they cannot be discarded. 2. Then a compact and effective data structure is required to store all episode occurrences. 3. Mining all minimal occurrences of episode also becomes a big challenge over the growing sequence. For #1&#2 a data structure, named episode trie, is proposed. It s a prefix tree similar to on we have in FP-Growth. For #3, an algorithm, named mining frequent serial episode via last occurrence (MESELO), is proposed. Claimed to be the first online FEM algorithm. The last occurrence concept is defined for the first time: when was the last time in which an episode occurred using the last occurrence information, MESELO can generate new minimal frequent episodes after receiving new data. Complexity analysis: O(M W ), it s claimed that in most real world applications W is small, so it works well. 32

33 There is no previous work of online FEM. Frequent episode mining on event sequences (main difference is the frequency count method): Alarm sequences in telecom networks Web navigation logs Timestamped fault reports in car manufacturing plants Sales transactions and stock data News Breath first approaches (Apriori-based) Depth first approaches (prefix tree) 33

34 Online frequent itemset mining (approximate and exact methods): MOMENT Can-Tree SWIM FP-Stream FDPM They does not apply on episode mining since in frequent itemset mining the time information is not important while episodes are ordered according to their occurrence time. Hence, keeping the tree up-to-date is hard. Using the last occurrence concept it becomes more efficient. 34

35 Server 1 (for algorithm): Intel Xeon E GHz processor 32 GB memory Windows server 2008 Server 2 (for MySQL database): Intel Xeon E GHz processor 16 GB memory Linux 35

36 36

37 1. Online mode: MESELO- BS is similar to MESELO but does not use the concept of last episode occurrence. MinSupprt 2. Offline mode: the performance of MESELO is compare to baseline algorithms. X axis shows window size MinSupport is 10 Window size 37

Record Placement Based on Data Skew Using Solid State Drives

BPOE-5 Record Placement Based on Data Skew Using Solid State Drives Jun Suzuki 1,2, Shivaram Venkataraman 2, Sameer Agarwal 2, Michael Franklin 2, and Ion Stoica 2 1 Green Platform Research Laboratories,