Parallel storage. Toni Cortes Team leader. EXPERTISE training school

Size: px

Start display at page:

Download "Parallel storage. Toni Cortes Team leader. EXPERTISE training school"

Lee Pearson
5 years ago
Views:

1 Parallel storage Toni Cortes Team leader Jan 11 th, 2018 EXPERTISE training school

2 Agenda Introduction Parallel file systems Object stores Back to interfaces dataclay: next generation object store

3 Agenda Introduction Parallel file systems Object stores Back to interfaces dataclay: next generation object store

4 How parallel applications do IO What is a parallel application? Any application that concurrently uses several processing units Jose and Daniele have already discussed about them How do these applications use persistent storage? POSIX Traditional open/close/read/write (more later) Designed for sequential applications MPI IO Extension of what Jose has explained for IO (more later) Designed for parallel applications Object oriented storage Much more the last half of the talk

How these interfaces are implemented Both POSIX and MPI are supported by Parallel File Systems Their native interface is POSIX with extensions Many HPC implementations of MPI over them Examples

5 How these interfaces are implemented Both POSIX and MPI are supported by Parallel File Systems Their native interface is POSIX with extensions Many HPC implementations of MPI over them Examples Luster, GPFS, BeeGFS, OrangeFS, PanFS Object stores are gaining momentum Their native interface is HTTP (REST) MPI implementations are being developed Examples DAOS, Cepth, S3, Swift (OpenStack), Object stores are not yet object oriented As in OO programming languages More later

6 Agenda Introduction Parallel file systems Object stores Back to interfaces dataclay: next generation object store

What is a [Parallel] file systems A filesystem is software module intended to organize and maintain the file namespace store the contents of the files

while block This will have many implication later All files keep a lot of metadata (information defining the data) Rules for this metadata thought for

7 What is a [Parallel] file systems A filesystem is software module intended to organize and maintain the file namespace store the contents of the files and their attributes Some interesting details Most persistent devices offer big granularities (a few Kilobytes) Thus reading a byte implies reading the while block This will have many implication later All files keep a lot of metadata (information defining the data) Rules for this metadata thought for sequential systems Parallel file system A file system ready to Serve many clients at the same time Distribute its data to offer a high bandwidth and low latency

8 Paralle file system architecture Data server Data server C Data server Data server Data server Data server C C C C C C Metadata or file server

9 Data distribution Idea Distribute data blocks among different devices in a deterministic way Can be applied to The whole storage space Each file independently The most typical distribution is round robin With different fault tolerance levels RAID levels

10 Raid 0 Disc

11 Raid 1 Disc

12 Raid 4 Disc P0 P1 P2 P3 P4 P P6 XOR

13 Raid 4 Disc X P0 P1 P2 P3 P4 P P6 XOR

14 Agenda Introduction Parallel file systems Object stores Back to interfaces dataclay: next generation object store

Why object stores Parallel filesystems are struggling to keep the HPC path Object stores have simplified many issues to be more scalable Clear separation of data and metadata Flat name space Access

15 Why object stores Parallel filesystems are struggling to keep the HPC path Object stores have simplified many issues to be more scalable Clear separation of data and metadata Flat name space Access semantics Object store is like a distributed key value store Stores pairs (ID, object) ID is an element in the flat namespace Object is an arbitrary size string of bytes Many applications only need this functionality Traditional REST API Good for usability, not so good for HPC Some file systems are built on top Luster, Cepth

16 Object stores allow random placement Basic tasks of balls into bins games Assign a set of m balls to n bins Motivation Bins = Hard disks Balls = Data items L = max number of data items on each disk Advantages Better load balancing Possibility of rebalancing Finding it back New random same seed Keep a table Idea: Just take a random position! Where should I place the next item?

17 Agenda Introduction Parallel file systems Object stores Back to interfaces dataclay: next generation object store

Access/modification times must be updated every access/modification When do we check these dates? Can be eventually updated?

18 POSIX Why POSIX is not the answer Sequential consistency semantics Once a write completes, any read, from anywhere, must see that write Not always important False sharing problems Size of the file must always reflect the actual size Is this always important? Access/modification times must be updated every access/modification When do we check these dates? Can be eventually updated? Implementing such semantics reduces performance significantly Most applications do not even reach to 1% of potential IO performance [Bill Gropp 2017]

MPI Designed for parallel applications Extension to MPI (message passing) Better semantics: some examples Each process sees the portion of file it will use

19 MPI Designed for parallel applications Extension to MPI (message passing) Better semantics: some examples Each process sees the portion of file it will use Avoids false sharing Asynchronous operations Enables overlapping computation and IO Offers collective operations Better use of devices by building large operations

20 MPI IO: some basic ideas What is visible by each task Definition of etype Basic element size Definition of the subfile Bytes that can be used Build the filetype Access to data Specify the desired bytes Pattern defined by each read/write operation Specify the location in the buffer All operations Synchronous or asynchronous Collective or individual File a b c etype filetype Subfile buftype Buffer

21 MPI IO example MPI_Aint lb, extent; MPI_Datatype etype, filetype, contig; MPI_Offset disp; MPI_Type_contiguous(2, MPI_INT, &contig); lb = 0; extent = 6 * sizeof(int); MPI_Type_create_resized(contig, lb, extent, &filetype); MPI_Type_commit(&filetype); disp = 5 * sizeof(int); etype = MPI_INT; MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", MPI_MODE_CREATE MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_File_set_view(fh, disp, etype, filetype, "native", MPI_INFO_NULL); MPI_File_write(fh, buf, 1000, MPI_INT, MPI_STATUS_IGNORE);

22 Main problems These interfaces are too low level Data semantic is lost and cannot be exploited

23 Agenda Introduction Parallel file systems Object stores Back to interfaces dataclay: next generation object store

24 Different data models Memory vs. Storage The pillars of dataclay New devices to come Data sharing Moving code close to the data

and/or objects We have a different data model for the persistent data

25 Why persistent data is different than volatile? Today We have one data model for volatile data Traditional data structures and/or objects We have a different data model for the persistent data Relational database, NoSQL database, files Future Store data in the same way as when volatile Store objects and relations

Data selection (No more SQL queries) In memory Data never queried using SQL Data linked according to needs of program Next data item found by following a link, not a query Persistent data

26 Data selection (No more SQL queries) In memory Data never queried using SQL Data linked according to needs of program Next data item found by following a link, not a query Persistent data should behave in a similar way Following a link is faster than a query over the whole dataset Programs do not need to make any differences whether Data is in memory or in persistent storage

27 Different data models New devices to come Persistent memories (SCM) Byte addressable The pillars of dataclay Data sharing Moving code close to the data

28 The pillars of dataclay Different data models New devices to come Data sharing Moving code close to the data

29 Data sharing Transparent access to data from different sources Regardless of its location According to the data owner s permissions 29

30 The pillars of dataclay Different data models New devices to come Data sharing Moving code close to the data We store object methods with data Can be executed where the object is stored/cached

31 Storage API Integration of programming model & storage platforms

32 Storage API Object Interface Method constructor(alias) C.getByAlias(alias) o.makepersistent o.deletepersistent init finish Objective Retrieve an object with alias // currently only in the Java flavour Retrieve an object with alias // currently only in the Python flavour Persist the object in the datastore Alias Recursive Location Delete the object from the datastore Do any initialization action before starting to execute the application Do any finalization action after executing the application 32

33 Word Count Python from dataclay import api initialization prepares the dataclay environment from my_collections import MyWords uses the registered class... if name == main : data = MyWords.get_by_alias( experiment1 ) r = defaultdict(int) for block in data.blocks: partial_result = word_count(block) reduce_count(r, partial_result) r = compss_wait_on(r) print data=in) def word_count(data): Unmodified word_count function map_count = defaultdict(int) for word in data: map_count[word] += 1 return map_count

34 Word Count (v2) Python from dataclay import api initialization prepares the dataclay environment from my_collections import MyWords, MyResult registered classes... if name == main : data = MyWords.get_by_alias( experiment1 ) r = MyResult(); r.make_persistent( result1 ); for block in data.blocks: partial_result = word_count(block) reduce_count(r, partial_result) r = compss_wait_on(r) print data=in) def word_count(data): Unmodified word_count function map_count = defaultdict(int) for word in data: map_count[word] += 1 return map_count

35 Iterating collections in parallel Use case: execute a method on the objects in a collection COMPSs assigns one worker per object For applications using a standard iterator: Executes the method (task) in the node where the object is getlocations Blocking usually needed Object method granularity may be too small It implies grouping objects in the same backend dataclay collections simplify it by offering locality aware iterators Iterate all the objects in the current node Task 1 Task 2 Task 3 Task 4 Task 5 Task 6

36 Word Count (v3) Python from dataclay import api from dc_classes.contrib.collections import StorageList... if name == main : data = StorageList.get_by_alias( experiment1 ) r = defaultdict(int) for block in data.split(): partial_result = word_count(block) reduce_count(r, partial_result) r = compss_wait_on(r) print data=in) def word_count(data): Unmodified word_count function map_count = defaultdict(int) for word in data: map_count[word] += 1 return map_count

37 Storage API Runtime Interface Method getlocations newreplica getid getbyid newversion consolidateversion executetask getresult TaskContext Objective Retrieve the locations where a particular object is Create a new read only replica of an object in the datastore Gets the OID of an object Retrieve an object from its identifier Create a new version of an object in the datastore (COMPSs) Consolidate a version of an object in the datastore (COMPSs) Execute the task into the datastore (COMPSs) Retrieve the result of the execution into the datastore (COMPSs) Define a task context (task enter/exit actions) (COMPSs) 37

.. Molecule m = newmolecule(...);... m.makepersistent();.

38 dataclay: now the pain Data API Make object persistent [Using alias] Delete persistent object Retrieve objects By Alias if set By OID/reference Execute method Management API Register class Get classes (stubs) Manage model/data contracts Register a class ClassRegistrator( BioNamespace, Molecule, classpath) Application... Molecule m = newmolecule(...);... m.makepersistent();... dataclay Molecule.class Molecule.class Original class Get a registered class Stub class GetStubs( Molecule, destinationpath) 38

39 Summary If persistent storage is not a problem You are lucky Enjoy it! Otherwise Think about Us! Think about Pierlauro!

40 Thanks to dataclay team Anna Queralt (PhD) Jonathan Martí (PhD) Daniel Gasull Alex Barceló Rizkallah Touma Enrico LaSala Pierlauro Sciarelli Former team members Ernest Artiaga (PhD) Juanjo Costa (PhD) Paola Garfias (PhD) Jaime Ivan Lopez (PhD)

41 Thank you

BSC and integrating persistent data and parallel programming models

www.bsc.es Barcelona, September 22 nd, 2015 BSC and integrating persistent data and parallel programming models Toni Cortes Leader of the storage-system research group Barcelona Supercomputing Center Centro