Traditional Scaling. Sharding. Replication shard 1. shard. shard. shard. shard 1. shard 2. shard 4. shard 3. shard. shard. shard.

Size: px

Start display at page:

Download "Traditional Scaling. Sharding. Replication shard 1. shard. shard. shard. shard 1. shard 2. shard 4. shard 3. shard. shard. shard."

Edgar Sparks
5 years ago
Views:

1 Search in the Cloud

2 Text Retrieval Task Text viewed as a sequences of terms in fields Document and position for each term are indexed Query is a sequence of terms (typically many more than user actually types)

3 Text Retrieval Scores computed by merging occurrences of terms in query Only top scoring documents are kept Deletion and document edits done by adding new documents and keeping deletion list

4 Traditional Scaling Sharding..n/ n/+...n/ n/+...n/ n/+...n Replication

5 Traditional Scaling Sharding..n/ n/+...n/ n/+...n/ n/+...n 5 Replication 5 5

6 Traditional Scaling Sharding..n/ n/+...n/ n/+...n/ n/+...n? 5 Replication 5 5

7 Consistent Hashing 0

8 Consistent Hashing 0

9 Consistent Hashing 0 0

10 Problems Presumes objects can be moved individually Has very high insertion/deletion rate Has disordered access patterns Often exhibits content/placement correlations

11 Micro Sharding map reduce hdfs for (t in types) yield [key:(t, h(key)%cnt), value:doc] Retrieval Indexer # Retrieval Indexer # Retrieval Indexer #n Content Indexer # Content Indexer # Content Indexer #m n,m >> number of search nodes

12 Search Architecture Retrieval Engine # presentation layer federator federator Retrieval Engine # Retrieval Engine #n Content Indexer # Content Indexer #m

13 Control Architecture federator Retrieval Engine # katta master zookeeper indexer HDFS

14 Quick Results No deletion/insertion in indexes at runtime Reloading micro-s allows large sequential transfers Random placement guided by balancing policy gives near optimal motion Node addition and failure are simple, reliable Random ing also near optimal local = global statistics, x query time improvement load balancing uniform management

15 Building Blocks EC - elastic compute Zookeeper - reliable coordination Katta - and query management Hadoop - map-reduce, RPC for Katta Lucene - candidate set retrieval, index file storage Deepdyve search algorithms - segment scoring

16 Building Blocks EC - elastic compute Zookeeper - reliable coordination Katta - and query management Hadoop - map-reduce, RPC for Katta Lucene - candidate set retrieval, index file storage Deepdyve search algorithms - segment scoring

17 Zookeeper Replicated key-value in-memory store Minimal semantics create, read, replace specified version sequential and ephemeral files notifications Very strict correctness guarantees strict ordering quorum writes no blocking operations High speed 50,000 updates per second 00,000 reads per second

18 Building Blocks EC - elastic compute Zookeeper - reliable coordination Katta - and query management Hadoop - map-reduce, RPC for Katta Lucene - candidate set retrieval, index file storage Deepdyve search algorithms - segment scoring

19 Katta Interface Simple Interface Client - horizontal broadcast for query, vertical broadcast for update InodeManaged - add/removeshard Pluggable Application Interface Pluggable Return Policy Given current return state return < 0 => done return 0 => return result, allow updates return n => wait at most n milliseconds Comprehensive Results Results, exceptions, arrival times

20 Horizontal/Vertical Broadcast..n/ n/+...n/ n/+...n/ n/+...n Replication

21 Horizontal/Vertical Broadcast..n/ n/+...n/ n/+...n/ n/+...n Replication

22 Horizontal/Vertical Broadcast..n/ n/+...n/ n/+...n/ n/+...n Replication

23 Operations federator Retrieval Engine # katta master zookeeper indexer HDFS

24 Impact of Cloud Approach Scale-free programming Deployed in EC (test) or in private farm (production) No single point of failure Real-time scale up/down Extensible to real-time index updates

25 Resources My blog The web-site Source code Katta (sourceforge) Hadoop (Apache) Lucene (Apache)

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop