Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform

Data Virtualization Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform Introduction Caching is one of the most important capabilities of a Data Virtualization platform to provide the right combination of high performance, low latency of information, minimum source impact, and reducing cost of needless data replication. In fact, the options for caching and how it can be flexibly configured to work in tandem with real-time query optimization and schedule batch operations are among the top differentiators between standard federation products and bestof-breed data virtualization platforms, such as the Denodo Platform. This is because Data Virtualization is being used today as an integral information fabric or data services platform in different scenarios to meet different objectives realtime BI, EDW extension, data abstraction layer, application data access, secure data services, etc. and the caching capabilities must be powerful and flexible to meet these needs. In this article we will first discuss intelligent caching strategies and uses in general, and then focus specifically on caching in the Denodo Platform. However it should be kept in mind that a particular use case may be best served by using caching in combination with other optimization techniques and features in Denodo and also considering the overall solution architecture, which are beyond the scope of this article. Caching Serves Many Purposes Caching can be useful for several reasons; to manage real-time performance across disparate sources with varying latencies, to minimize data movement based on frequent query patterns, to reduce or manage the impact of data virtualization on source systems, and, finally, to mitigate the problems of a source system being only intermittently available. Caching for Performance When using the Denodo Platform to integrate various data sources and publish the derived data entities to consuming applications, you might be faced with the situation where some of your data sources are slower than the others and cause overall performance degradation. This might be because the data sources are inherently slower than the others or it might be because the data sources are already heavily used and this results in slower response times. For example, getting data from web services, from flat files that must be parsed, or from web sites (using ITPilot) is typically slower than querying data in a relational database or data warehouse. If you are combining data from these different sources into a derived view within the Denodo Platform, the slower sources can reduce the overall performance of queries on the derived view in certain cases. In these situations, the cache in the Denodo Platform can be used to reduce the performance bottlenecks. You can configure the Denodo Platform to cache data from the slow data sources and use the cached data in response to any queries against that data source. To return to the above example, if data from the web service is cached and this cached data is used for subsequent queries against the web service and, by definition, queries against the derived view using the web service data the performance can increase dramatically by removing the latency of the web services invocations from the execution path. Obviously, the cache in the Denodo Platform should be used judiciously caching every data source is just another form of data replication and also means that the data retrieved for every query is the cached data and not the live data from the originating data source. But using caching for selected base views can dramatically improve performance of queries on the base view and also queries on any derived views that are using the base view. It is important to note that, for the usage pattern of performance improvement our recommendation is to cache the data from base views and only when necessary for performance reasons. If you cache data from derived views, you could be caching data not only from the slow data source but also from other data sources that have perfectly acceptable performance characteristics. 1

Caching to optimize frequent user queries When there is a pattern of queries with a high frequency of users calling for the same data, these queries can be cached. Subsequent queries that match or are even a subset of the original query can be served from the cache using post-processing. The real-time needs of such queries must be analyzed to determine the time-to-live in the cache. Also the cache patterns may be regional or departmental in a federated data virtualization deployment with multiple Denodo Platform servers. For example, the retail store inventory status for European, Asian, American stores may be cached on distributed regional Denodo Platform servers and shared among them. As contrasted with the performance improvement scenario, caching for frequent user queries can be at a higher level derived view in the integration tree, and not just base views. Caching to minimize source system impact Organizations that expose their source systems to data virtualization are both excited and alarmed at first. A multitude of worrying questions can spring to mind what happens if anyone and everyone start querying my operational systems in real-time? What will be the performance impact on my operational users who depend on these systems? This is where intelligent caching combined with role-based security or custom policies can help. While all users can be exposed to consistent canonical views of disparate data, the Denodo Platform can modulate different SLAs for different users. Based on granular user and role-based security (discussed in other articles) as well as custom policies that can be parameterized based on any external input such as network traffic, source loads, time of day, etc., the Denodo Platform can serve a real-time view of data to certain priority users and partially cached data to others. Also cache refresh can be triggered based on event messages sent to the Denodo Platform based on a certain threshold of changes to the sources. In this way, intelligent caching is able to minimize source impact, while meeting differentiated user needs. Caching to protect against intermittent system availability The Denodo Platform can provide access to a wide variety of data sources and, due to the varied nature of these sources, there will be different availability profiles for these data sources. Even the data sources within the organization will have different availability depending upon the nature of the data source. For example, an operational database might be configured for 24/7 availability with high availability clustering and redundancy whereas a data source in a regional sales office might only be available during local office hours. When the data sources are external to the organization often owned and controlled by a totally different entity then the question of system availability becomes more pressing. Caching data from these sources within the Denodo Platform can help mitigate against the actual source data not being available. If feasible, the data can be cached and queries for this data can be served from the cache rather than from the actual data source which may or may not be available. If the data source is available, then the cache can be refreshed from the source to keep the cached data up to date. Denodo Platform 5.0 - Explanation of Caching Modes While the above patterns are not exhaustive, the Denodo Platform has a very advanced caching system with a number of operating modes and options that allow you to configure the caching to suit your particular needs. First we ll describe these modes and configuration options and then we ll discuss when they should be used. It is important to note that these modes and options can be applied individually to each view that is cached, and differently at multiple levels in the integration tree. They are not global in their application. In the Denodo Platform v5.0, there are the following cache modes: Caching Off As the mode name indicates, in this mode no data will be cached and all queries will be against the originating data sources. In this example and others data source can be taken as the immediately underlying view or the original data source depending on context. This is the default cache mode for the Denodo Platform. 2

Figure 1 - Cache configuration in Denodo Platform Full Mode In previous versions of the Denodo Platform, this was called the preload mode. When this mode is used for a view, it is assumed that the cache contains all of the data for that view. Therefore, queries on this view will always use the cached data and will never hit the data source. The data source will be only accessed for refreshing the data in the cache. Cache loading, and subsequent refreshing, is always explicit, meaning that you need to run special queries (indicating cache_preload = true in the query context) to load or refresh the cache. The Denodo Platform also allows you to incremental refreshes of the cached data by executing queries that include cache_preload = true and cache_invalidate = true in the query context. The cache_preload means that the data is retrieved from the data source and is saved in the cache. The cache_invalidate means that any existing data in the cache is invalidated and replaced by the new, fresh data. These options, coupled with a query predicate, make incremental cache refreshes relatively simple and avoid having to perform bulk loads into the cache for large data sets. For example, you can preload the cache overnight ready for the following work day and then, during the day, incrementally refresh the cache using the previously mentioned context options in a query that also uses a WHERE clause to select a subset of the data set to refresh. A better way to incrementally refresh the cache involves being able to identify the data that has changed in the source and refreshing only this data. There are multiple ways by which you can identify the changed data, e.g.: Using any created or updated timestamp columns that might be available in the source to identify changes. That is, the WHERE clause of the refresh query can be used to detect data that has a more recent updated or created timestamp than the data currently held in the Denodo Platform cache. Creating triggers in the data source and sending change notifications via an asynchronous event mechanism such as JMS. The alert message can contain information that identifies the data that has changed e.g. the primary key value(s) for the changed data entities. Integrating to an existing Change Data Capture system, such as IBM InfoSphere CDC, and use this system to detect data changes and send out the appropriate information to the Denodo Platform. Partial Mode 3

Partial Mode Partial caching mode does not require you to have all the data for the view in the cache. When a query is executed against the view, the Denodo Platform will check if the cache contains the data required to answer that query and, if the data is not in the cache, the query will be made against the data source directly. There are a couple of options for the partial caching mode that determine how the cache is loaded and refreshed and how and when the cached data is used in response to queries on the view. These options are: 1. Explicit loads If this option is selected, then loading and refreshing the cache is an explicit operation, in the same way as for the full cache mode. This means that you need to run special queries (using cache_preload = true in the query context) to load or refresh the cache. If this option is not selected, then the cache is automatically loaded with the results of the queries that are executed against the view. Therefore you don t need to perform special queries to load or refresh the cache: the normal operation of the Denodo Platform fills or refreshes the cache as queries are being answered. 2. Match exact queries only If this option is selected, the Denodo Platform will only access the cache for a given view when the query being executed matches exactly with a query previously executed against the view. If this option is not selected, the Denodo Platform will also use the cache for a given view when the query results are expected to be a subset of a previously executed query. For instance, if the query: SELECT * FROM V WHERE A=a has already been executed (and the data saved in the cache), and the Denodo Platform receives the query SELECT * FROM V WHERE A=a AND B=b then the Denodo Platform will respond to the query by applying the filter B=b over the cached results of the first query. Now that we ve described the cache modes and the options for each mode, let s look at when to use these different modes and options. Figure 2 - Integration tree view showing one cache-enabled node (product) and the rest real time 4

Figure 3 - Execution trace of the same view showing the branch (product) coming from cache When to use Full Cache Mode The full cache mode forces the loading of all the data for the view into the cache. This provides some significant advantages for slow performing data sources: The applications will never directly hit the data source. Only refresh operations will access the data source. If the query latency of the data source is the cause of performance problems, using the cache can significantly improve overall performance. Complex operations (such as JOIN, GROUPBY, etc.) involving several views (even views from different data sources) can be delegated to the cache database. Therefore, the performance of these operations is significantly improved. However, full caching mode must be used with care. There are some circumstances that make full caching mode impractical: If the volume of cached data is very large, it might be difficult to refresh all the data within an acceptable period. This is especially for remote data sources (e.g. cloud applications). Incremental caching can help, as mentioned above. If the volume of cached data is extremely large, the cache database will need to be sized appropriately to store and efficiently query all of the data. The query capabilities of the data source may prevent you to obtain all the data. This happens frequently with web services and other API-based data sources which do not provide operations to get all of the data. Most web services and APIs provide operations to get a specific data row e.g. specified customer details or order information. If there is no get all data operation, you cannot retrieve all the data to load into the cache. A typical example of full cache mode being used is to cache data from a particularly slow data source prior to the start of the business day. For example, one customer caches data from their Salesforce CRM system overnight in readiness for the next business day. All queries using data from Salesforce are filled from the cache. As the data is relatively static, the cache is not refreshed during the day the cached data is invalidated when the next preload of Salesforce data occurs before the start of the following business day. Another example of full cache mode is protecting operational data sources from additional loads. A customer loads data from a PeopleSoft HR system into the cache and runs queries against the cached data rather than delegate of queries to the originating data source. This protects the PeopleSoft system for additional loads which would degrade its 5

performance. The cache is periodically refreshed throughout the day, but the get all data operation has less impact on the PeopleSoft system than thousands of queries hitting it during the business day. When to use Partial Cache Mode Partial caching mode does not require you to have all the data for the view in the cache. When a query is executed against the view, the Denodo Platform will check if the cache contains the data required to answer that query and, if the data is not in the cache, the query will be made against the data source directly. Therefore, the partial cache mode can be used in all the previously mentioned situations where having in the cache all of the data for the view is either impossible or impractical. Using the partial cache mode you can pre-load in the cache the most important or most frequently used data and queries needing this data will be fulfilled using the cached data. In fact, you don t even need to know which data is most frequently used by unselecting the Explicit loads option, the data obtained from the data source by previous queries will be automatically cached by the Denodo Platform. Therefore, over time, the cache will contain the most frequently queried data for a given view. When to Select the Explicit Loads Option Explicit loads perform very well when a relatively small subset of the data is queried much more frequently than the rest and when it is easy to predict the subset of data that is the most frequently queried. Product databases are a typical example of where the explicit load option is useful. In many cases, a few popular products are the subject of the majority of queries, reflecting the popular 80-20 rule. Explicitly preloading the data for these popular products can be very effective. Implicit loads perform well when the applications show high temporal locality (i.e. recently queried data has a higher probability of being queried again). For example, in most web applications, it is common for a user to visit the same web page several times during the same session. Implicit loads also avoid having to decide in advance what data should be cached when it might not be easy to predict which data items will be popular. When to Select the Match Exact Queries Only Option This option should be used when the data source does not return all the results for a certain query. For instance, many websites and web services return only the top n results for queries which have too many results. In those circumstances, unchecking this option could return incomplete results for certain queries. Other Cache Features In addition to the different modes to populate, use and refresh the cache, Denodo Platform supports the creation or automatic propagation of indexes and primary keys in the cache tables. This further improves performance and in the full mode, allows the cache to be used as a sort of equivalent to an Operational Data Store (ODS), but at a much lower cost. Denodo Platform ships with a built-in cache database, that can be easily switched to the customer s database software of choice. Most of the popular disk-based, clustered or in-memory data base solutions in the market are supported including IBM DB2, Microsoft SQL Server, Oracle RAC, Oracle TimesTen, SAP HANA, etc. Conclusion The cache within the Denodo Platform serves many purposes; managing real-time performance across disparate data sources with varying latencies, minimizing the movement of data based on frequent query patterns, reducing or managing the impact of data virtualization on source systems, and protecting against intermittent source system availability. Its flexibility in addressing these different requirements comes from the variety of operating modes and options that allow you to configure the caching to suit your needs. Careful use of the cache within the Denodo Platform can dramatically affect the performance and scalability of both the Denodo Platform and your underlying source systems, while greatly reducing the cost, inflexibility and governance problems created by needless data replication. The guidelines described above should help you to understand how and when to use the cache for maximum effect when fine tuning your data virtualization solution from Denodo. Visit www.denodo.com Email info@denodo.com twitter.com/denodo NA & APAC (+1) 877 556 2531 EMEA (+44) (0) 20 7869 8053 DACH (+49) (0) 89 203 006 441 Iberia & Latin America (+34) 912 77 58 55 6