Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak

Size: px

Start display at page:

Download "Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak"

Camilla Hopkins
5 years ago
Views:

1 Cloud Programming on Java EE Platforms mgr inż. Piotr Nowak

2 Distributed data caching environment Hadoop Apache Ignite "2

3 Cache what is cache? how it is used? "3

4 Cache - hardware buffer temporary storage fast io storage types CPU cache - close to processor, faster than main memory disk cache - pages in RAM web - web servers(hardware and software)/browsers store previous responses "4

5 Cache Caching keeps data in memory that either are slow to calculate/process or originate from another underlying backend system. Caching is used to prevent additional request round trips for frequently used data. In both cases, caching could be used to gain performance or decrease application latencies. "5

6 Cache Most caching solutions are based on map like data structures and JCache API tries to standardize the most common use cases. In case of you have advanced needs, you probably have to use some implementation specific features, like with JPA, but the standard will definitely make it easier to swap between caching libraries in the future. Standardization makes it easier for developers to move from a project to another, which are probably using different caching libraries. "6

7 Cache read/write methods Read-Through client <- cache <- storage Write-Through client -> cache -> storage Write-Behind client -> cache > storage Refresh-Ahead client <- cache < storage "7

8 Cache Read-Through When an application asks the cache for an entry, for example the key X, and X is not already in the cache, a get request will be sent to persistent storage to load X from the underlying data source. If X exists in the data source, the persistent storage will load it, return it, and place it in the cache for future use and finally will return X to the application code that requested it. This is called Read-Through caching. Refresh-Ahead Cache functionality may further improve read performance (by reducing perceived latency). "8

9 Cache Write-Through when the application updates a piece of data in the cache (that is, calls put(...) to change a cache entry,) the operation will not complete (that is, the put will not return) until data will successfully be to the underlying data source. This does not improve write performance at all, since you are still dealing with the latency of the write to the data source. Improving the write performance is the purpose for the Write-Behind Cache functionality "9

10 Cache Write-Behind modified cache entries are asynchronously written to the data source after a configured delay, whether after 10 seconds, 20 minutes, a day, a week or even longer. Note that this only applies to cache inserts and updates - cache entries are removed synchronously from the data source. For Write- Behind caching a queue is used. Write-behind queue of the data that must be updated in the data source. When the application updates X in the cache, X is added to the write-behind queue (if it isn't there already; otherwise, it is replaced), and after the specified write-behind delay persistent storage will be called to update the underlying data source with the latest state of X. Note that the write-behind delay is relative to the first of a series of modifications in other words, the data in the data source will never lag behind the cache by more than the write-behind delay. "10

11 Cache Write-Behind advantages: user does not have to wait for data to be written if queue contain several version of the same variable, only newest is saved - write operation reduction - writecombining in case of write to persistent storage fail - data re-queue can be applied database load increase can be tuned with increase in writebehind interval "11

12 Cache Refresh-Ahead a developer have to configure a cache to automatically and asynchronously reload (refresh) any recently accessed entry from the cache loader before its expiration. The result is that after a frequently accessed entry has entered the cache, the application will not feel the impact of a read against a potentially slow cache store when the entry is reloaded due to expiration. The asynchronous refresh is only triggered when an object that is sufficiently close to its expiration time is accessed if the object is accessed after its expiration time, a synchronous read from the cache store to refresh its value will be performed. "12

13 Distributed data caching storage type JVM heap on-heap - Garbare Collector managed off-heap - OS managed "13

14 Distributed data caching use case Data amount rapid increase many concurrent users JVM heap growth causes application performance drop Garbage Collection load off-heap memory for temporary data storage "14

15 Distributed data caching JVM heap limited size RDBMS too slow disc storage too slow off-heap JVM memory fast expandable doesn t affect Garbage Collection managed by OS "15

16 Distributed data caching speed up data read/write operation cache as read/write buffer cache can store LFU / LRU data "16

17 Centralized Cache Cache managed by Name Node user specifies paths from filesystem to be copied to cache Cache data stored by Data Node Cache data copy from File System by default HDFS caching advantage increases with increase of data frequency usage "17

18 Cache example 1. Copy the requisite files to the FileSystem: $ bin/hadoop fs -copyfromlocal lookup.dat /myapp/lookup.dat $ bin/hadoop fs -copyfromlocal map.zip /myapp/map.zip $ bin/hadoop fs -copyfromlocal mytgz.tgz /myapp/mytgz.tgz $ bin/hadoop fs -copyfromlocal mytargz.tar.gz /myapp/mytargz.tar.gz 2. Setup the application's JobConf: JobConf job = new JobConf(); job.addcachefile(new URI("/myapp/lookup.dat#lookup.dat"), job); job.addcachearchive(new URI("/myapp/map.zip", job); job.addcachearchive(new URI("/myapp/mytgz.tgz", job); job.addcachearchive(new URI("/myapp/mytargz.tar.gz", job); 3. Use the cached files in the Mapper or Reducer: public static class MapClass extends MapReduceBase implements Mapper<K, V, K, V> { private Path[] localarchives; private Path[] localfiles; public void configure(jobconf job) { // Get the cached archives/files localarchives = DistributedCache.getLocalCacheArchives(job); localfiles = DistributedCache.getLocalCacheFiles(job); } public void map(k key, V value, OutputCollector<K, V> output, Reporter reporter) throws IOException { // Use data from the cached archives/files here output.collect(k, v); } }

19 Cache example Ignite<String, Integer> cache = Ignite.grid().cache( mycachename"); // Put operation which returns previous value. Integer oldval = cache.put("hello", 1); // Put operation which does not return previous value. boolean success = cache.putx("world", 2); // Get operation. Integer hello = cache.get("hello"); // Reload entry from persistent store. Integer v1 = cache.reload("hello"); // Remove operation which returns removed value. Integer val = cache.remove("hello"); "19

20 Off-Heap Memory Entries eviction to off-heap GridCacheEvictionPolicy several pre-defined eviction policies LRU FIFO Random can be activated when cache size reaches defined maximal value "20

21 Off-Heap Memory by default off-heap is disabled xml configuration  <bean class="org.gridgain.grid.cache.gridcacheconfiguration">  <property name="name" value="mycache"/>... </bean>  <property name="offheapmaxmemory" value="#{10 * 1024L * 1024L * 1024L}"/> "21

22 Cache Distribution Models local replicated partitioned "22

23 Cache Distribution Models - local no data distributed to other nodes ideal for read-only data good for read-through where data is loaded from persistent storage on misses still feature distributed cache advantages "23

24 Cache Distribution Models - replicated data replication to all other nodes impact on performance and scalability size of cache on each node limited with smallest RAM amount on cluster node best for high data availability tasks suits well systems where read operations exceeds write best for systems where are small changes in stored data which must be fast propagated to all other nodes "24

25 Cache Distribution Models - partitioned best scalability creates a cluster with huge distributed in-memory storage spread on whole available cluster memory cache data updates are cheaper than in replicated mode update on primary node (default 1 backup) update on backup node if configured data stored on certain node cause increase in network traffic to avoid traffic increase - access data on node which cache it - affinity colocation "25

26 Links apache/hadoop/filecache/distributedcache.html "26

Coherence An Introduction. Shaun Smith Principal Product Manager

Coherence An Introduction. Shaun Smith Principal Product Manager Coherence An Introduction Shaun Smith Principal Product Manager About Me Product Manager for Oracle TopLink Involved with object-relational and object-xml mapping technology for over 10 years. Co-Lead