Euro-Par Pisa - Italy

Size: px

Start display at page:

Download "Euro-Par Pisa - Italy"

Rosamund Wood
5 years ago
Views:

Euro-Par 2004 - Pisa - Italy Accelerating farms through ad- distributed scalable object

1 Euro-Par Pisa - Italy Accelerating farms through ad- distributed scalable object repository Marco Aldinucci, ISTI-CNR, Pisa, Italy Massimo Torquati, CS dept. Uni. Pisa, Italy

2 Outline (Herd of Object Caches) Motivation Features + parallel web server architecture Experiments (a lot of) + Ongoing & Future work

3 (Herd of Object Caches) A very basic storage facility No hardwired policies for deployment, allocation, data coherence, pluggable into different, third-party applications/frameworks proving data management as external service for applications implemented as high-throughput distributed server decoupling computational and storage management in (distributed) application design enforcing a structured development and exploiting persistency, scalability, re-configurability

Permanent, shared storage facility distributed storage facility P a P b P c a facility (distributed server) providing permanent, shared storage to apps (clients) clients may dynamically join/ leave

4 Permanent, shared storage facility distributed storage facility P a P b P c a facility (distributed server) providing permanent, shared storage to apps (clients) clients may dynamically join/ leave the storage facility rotocol app protocol app protocol P 1 P 2 P app 4 P 3 protocol app protocol app P 5 protocol app P 6 set may be hotly enlarged/ reduced on need - storage room change accordingly interaction with s may be delegated to application-specific protocol

5 Why using is efficient (because essential) provide few primitives and no policies for data integrity (e.g. coherence, consistency, ) these are application specific and may be deployed upon (at the protocol level) is a basic building block for broad class of applications may be considered a storage component massive storage, out-of-core applications, high-throughput data servers, shared memory support extendible with application-specific primitives enhances both memory size and throughput by means of parallelism

6 using protocol enforces application requirements on data integrity acting as mediator between the application and it is linked to the application and use API protocol protocol protocol protocol e.g. module app app app app protocol may actually is a distributed application (e.g. reaching consensus, cache invalidation, ) protocol protocol protocol API may also be easily extended (provided some knowledge of internals) app app app app

internals objects storage switching engine home nodes table allocator service clients connections local cache other servers connections client client client interface services (clients & servers) I/O

7 internals objects storage switching engine home nodes table allocator service clients connections local cache other servers connections client client client interface services (clients & servers) I/O layer Poller interface select poll RTsig cache/storage RND LFU allocator client client client client client client O.S. kernel [read(), write(), select(), poll(), ] C++, single-threaded, manage concurrent connections using non-blocking I/O based services (each of them being a state machine managing a single connection) supporting both level-triggered (select, poll, ) and edge-triggered (RTsignal, kqueue, ) I/O events object storage may be managed either as a memory or a cache, remote objects may be cached in a separate write-through cache. Policies are configurable. tested on Linux, MacOS X, and heterogeneous cluster of them

8 API Why does the web work so well? A language with few verbs (get, put, post) Gannon said (Europar04, invited talk) We also believe on such philosophy. As matter of a fact have a four operations API get, put, remove arbitrary length objects. Each object is identified by a key and a home node execute(key, op, data) remotely execute method op with parameter data on object identified by key

9 The Web server Worldwide most used Web server broadly accepted, well-known, well supported opensource MultiThread-MultiProcessor Web server good performance, nevertheless several attempts to improve yet more performances usually used in farm configurations Easy to extend via plug-in modules already existing native memory-based cache module

10 How accelerate a web server/service farming servers out caching, typically reverse proxy (in front of the server) worsen requests latency (miss) complex as much as the web server We would like to improve web server performance without changing web server core, thus relying on correctness, people expertise, thus we add an -based distributed cache behind the server (or the server farm)

11 The big picture

12 The plug-in for patched mod_mem_cache protocol protocol protocol protocol app app app app

13 The plug-in for request cache lookup hit fresh object? yes miss read object no cache remove High-level functional behavior of the native cache module (mod_mem_cache) cacheable object? no yes cache write Reply

14 The plug-in for request get hit fresh object? yes miss read object cacheable object? no no remove High-level functional behavior of the protocol for + architecture (a simple patch to mod_mem_cache) yes put Reply

15 Experiments RLX blade - 24 P4@800MHz (outside the room ) experimenting experimenting +

16 Performance figures (1PE) Arch/Net/OS Mem 512MB GigaEth Linux ker Mem 1GB FastEth Linux ker concurrent connections Msg size (Bytes) Replies/Sec net throughput (Bytes/Sec) net throughput w.r.t. ideal M M 96% M 10 M 11% K M 90% K M 90%

17 Speedup (Hit per sec VS N. servers) 16K 14K 12K 8K objects (perfect) 8K objects 10K 8K 6K 16K objects (perfect) 16K objects 4K 2K 0K

Sustained aggregate throughput Aggregate throughput (MB/sec) 125 100 75 50 25 0 Net

18 Sustained aggregate throughput Aggregate throughput (MB/sec) Net Asymptotic throughput 8K objects 16K objects 85% 75% 85% 86% 88% 88% 88% 88% 90% 90%

19 Summarizing is a building block for storage-oriented components distributed caches, distributed memories, parallel repositories configurable, hot-pluggable, very good performances close-to-ideal net throughput over thousands of concurrent connections close-to-ideal speedup

20 Hoc+ architecture Three architectures experimented: n s n s - n s

21 xperimental environment summary Total size Raw data set 4GB N. of files 100K N. of requests 250K N. of files < ~100K Static pages 100% Data transfered Access log ~9GB N. of distinct files requested ~75K (2.8GB) Avg. file size ~37KB N. of distinct files < 256 KB ~100K Static pages 100% MPM worker configuration (hybrid multi-threaded multi-process) StartServers 4 ThreadPerChild 64 ServerLimit 8 MaxRequestsPerChild 0 MaxClients 512 Log level Notice MinSpare Threads 32 Access log None

22 1-1, compared architectures: 1. MPMT native cache (per process) 2. SPMT native cache (shared) 3. MPMT with no cache 4. MPMT with cache (on the same PE) 5. MPMT with cache (on different PEs)

Comparing reply rate (1-Hoc/1-/k-) 80 MPMT (native cache) 70 native cache switch PE n 60 50 replies measuring points PE 0 PE

23 Comparing reply rate (1-Hoc/1-/k-) 80 MPMT (native cache) 70 native cache switch PE n replies measuring points PE 0 PE n MultiProcessMultiThreaded (150MB native cache per process)

24 Comparing reply rate (1-Hoc/1-/k-) 80 SPMT (native cache) 70 native cache switch PE n replies measuring points PE 0 PE n SingleProcessMultiThreaded (900MB shared native cache)

25 Comparing reply rate (1-Hoc/1-/k-) 80 MPMT (FS cache) FileSystem cache 70 switch PE n replies measuring points PE 0 PE n NoCache MPMT (FileSystem buffer behaves as cache)

26 Comparing reply rate (1-Hoc/1-/k-) MPMT with (in separate processes) PE 0 plug-in switch PE n replies measuring points PE n MPMT with 450MB on the same box

27 Comparing reply rate (1-Hoc/1-/k-) MPMT with (2 PEs) switch1 plug-in PE n switch0 PE 0 replies measuring points PE n-1 PE n MPMT with 900MB on 2 boxes

28 1 -n A single acting as external, shared cache for many s ( farm). Speedup measure. How many s a single may support? Does optimal number n exist?

29 Speedup

30 n -n Many acting as external, shared cache for many s ( farm). Speedup measure. How many s a single may support? Does optimal number n exist?

31 2n-farm vs + n-farm eth1 plug-in eth2 PE 2 PE 0 PE 3 eth1 plug-in PE 1 eth0 PE a PE b PE c PE d PE 2 PE 0 PE 1 PE 3 Replies per second + farm outperform standard farm by 3x with equal HW resources 3,000 2,250 1, % +271% +268% +312% eth0 PE a PE b PE c PE d 0 2n n Processing Elements

32 Current & future work Supporting heterogeneous cluster (done!) integration with the ASSIST environment (ongoing) ASSIST has external shared objects at the language level supporting dynamic reconfiguration (state migration) distributed in-memory File System PVSF-like (beta) SMP scalability (multi-threading) (in agenda) web-services interface (in agenda)

33 Thanks to Alessandro Petrocelli and the whole Pisa HPClab people Thank you! Conclusions Questions? is fast and scalable storage component running heterogeneous clusters hot-pluggable, sustain thousands of flowing concurrent connections easily adaptable for different I/O bound applications, e.g., FS, + improves performances without any change to the core code in the sequential architecture (20% on the same PE, 100% with an additional PE) in several flavors of parallel architectures:1-n, n-n, n-m (with a % gain with equal resource cost) is open source, and come with the ASSIST package

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

An efficient Unbounded Lock-Free Queue for Multi-Core Systems Authors: Marco Aldinucci 1, Marco Danelutto 2, Peter Kilpatrick 3, Massimiliano Meneghin 4 and Massimo Torquati 2 1 Computer Science Dept.