A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval

Size: px

Start display at page:

Download "A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval"

Jeffrey Chapman
6 years ago
Views:

Information Science Norwegian University of Science and Technology The 11th

1 A Combined Semi-Pipelined Query Processing Architecture For Distributed Full-Text Retrieval Simon Jonassen and Svein Erik Bratsberg Department of Computer and Information Science Norwegian University of Science and Technology The 11th International Conference on Web Information System Engineering Hong Kong, China December, 2010

2 Outline Introduction to distributed inverted indexes Problem definition and motivation Our approach Experimental evaluation Conclusions

3 Inverted index approach to IR apple.com

4 Inverted index approach to IR apple.com?????

5 Document-wise partitioning Each node indexes a subset of documents

6 Document-wise partitioning A query q is broadcasted to all of the nodes and executed concurrently. One of the nodes has to combine results.

7 Document-wise partitioning A query q is broadcasted to all of the nodes and executed concurrently. One of the nodes has to combine results. Main advantages: Simple and fast!

8 Document-wise partitioning A query q is broadcasted to all of the nodes and executed concurrently. One of the nodes has to combine results. Main problems: all of the nodes are involved in processing of each query q disk-seeks on each node New nodes increase the overhead

9 Term-wise partitioning Each node stores a subset of a global index

10 Term-wise partitioning Each query is divided into a number of sub-queries Each node fetches the data and sends it to another node, that receives and processes all of the posting lists.

11 Term-wise partitioning Each query is divided into a number of sub-queries Each node fetches the data and sends it to another node, that receives and processes all of the posting lists. Main advantages: Fewer network messages With n >> q several queries can be executed concurrently Up to q nodes are involved. High throughput and fault-tolerance q disk-seeks in total

12 Term-wise partitioning Each query is divided into a number of sub-queries Each node fetches the data and sends it to another node, that receives and processes all of the posting lists. Main problems: High network load All processing is done by one node Other nodes act as advanced network disks. Load balancing is critical

13 Pipelined query processing (Moffat et al., 2007) A query-bundle is routed from one node to next. Each node fetches the posting data, combines it with the previously accumulated results and sends these to the next node. The last node extracts the top results. The number of accumulators is limited by a target value L. (Lester et al., 2005)

14 Pipelined query processing (Moffat et al., 2007) A query-bundle is routed from one node to next. Each node fetches the posting data, combines it with the previously accumulated results and sends these to the next node. The last node extracts the top results. The number of accumulators is limited by a target value L. (Lester et al., 2005) Main advantages: Work is distributed between the nodes Reduced network load. L limits the transfer size Reduced overhead on the last node.

15 Pipelined query processing (Moffat et al., 2007) A query-bundle is routed from one node to next. Each node fetches the posting data, combines it with the previously accumulated results and sends these to the next node. The last node extracts the top results. The number of accumulators is limited by a target value L. (Lester et al., 2005) Main problem: Long query latency!

16 Outline Introduction to distributed inverted indexes Problem definition and motivation Our approach Experimental evaluation Conclusions

17 Problem definition and motivation Term-wise partitioning many interesting properties and a good potential for improvement. Pipelined higher throughput, but longer latency. Non-pipelined shorter latency, but lower throughput. We want to design an approach that combines the advantages of both methods short latency AND high throughput.

18 Scope and limitations Disk-based document-ordered inverted index. Index access model and compression methods are based on the Terrier Search Engine. Query processing model is based on the approach by Lester et al.

19 Outline Introduction to distributed inverted indexes Problem definition and motivation Our approach Experimental evaluation Conclusions

20 Our observations of pipelined query processing 1. Sequential disk-access and data processing. 2. Accumulators have a worse compression ratio than postings. 3. For some queries, pipelined processing might be worse than non-pipelined. 4. Query route may not minimize the network load f( tariff )= f( quota )= f( rate )= f( sugar )=

21 Our approach Semi-Pipelined Query Processing Sequential disk-access and data processing. Combination Heuristic Accumulators have a worse compression ratio than postings. For some queries, pipelined processing might be worse than non-pipelined. Alternative Routing Strategy Query route may not minimize the network load.

22 Semi-Pipelined Query Processing

23 Semi-Pipelined Query Processing

24 Combination/Decision Heuristic For each query, we want to choose between semi- and non-pipelined processing. Our decision depends on the upper bound estimate for the amount of data to be transferred. We execute a query as non-pipelined when:

25 Alternative Routing Strategy Instead of routing by increasing least term frequency, we route by increasing longest posting list length. Total number of transferred accumulators: quota 44395/ rate / sugar / tariff 80017/ Total number of transferred accumulators: L= red posting list length blue term frequency L acc.set target value

26 Outline Introduction to distributed inverted indexes Problem definition and motivation Our approach Experimental evaluation Conclusions

27 Evaluation A modified, distributed, version of the Terrier Search Engine v2.2.1 ( The 426GB TREC GOV2 Corpus 25 mil. documents queries from the Terabyte Track 05 Efficiency Topics (first are used as a warm-up) 8 nodes Two 2.0GHz Intel Quad-Core, 9GB RAM, 16GB SATA HDD on each node. Gigabit network.

28 Semi-Pipelined Query Processing Throughput (qps) non-pl Latency (ms)

29 Semi-Pipelined Query Processing Throughput (qps) pl nocomp pl comp Latency (ms)

30 Semi-Pipelined Query Processing Throughput (qps) semi-pl nocomp semi-pl comp Latency (ms)

31 Semi-Pipelined Query Processing Throughput (qps) non-pl pl nocomp semi-pl nocomp Latency (ms)

32 Semi-Pipelined Query Processing Throughput (qps) non-pl pl comp semi-pl comp Latency (ms)

33 Combination Heuristic Throughput (qps) Latency (ms) non-pl pl comp comb α = 0.1 comb α = 0.2 comb α = 0.3 comb α = 0.4 comb α = 0.5

34 Alternative Routing Strategy Throughput (qps) non-pl pl comp semi-pl comp altroute+semi-pl comp Latency (ms)

35 Combination of the techniques % Throughput (qps) % non-pl pl comp altroute+comb α= Latency (ms)

36 Outline Introduction to distributed inverted indexes Problem definition and motivation Our approach Experimental evaluation Conclusions

37 Conclusions We have presented an efficient alternative to the state-of-the-art methods. Our method combines three techniques that minimize latency and maximize throughput. Our results outperform both methods and provide a significant improvement in the overall throughput/latency ratio.

38 Thank you!

2 Partitioning Methods for an Inverted Index

2 Partitioning Methods for an Inverted Index Impact of the Query Model and System Settings on Performance of Distributed Inverted Indexes Simon Jonassen and Svein Erik Bratsberg Abstract This paper presents an evaluation of three partitioning methods