The Complete Guide to Data Integration 2017

Size: px

Start display at page:

Download "The Complete Guide to Data Integration 2017"

Ethel Jennings
6 years ago
Views:

1 1 The Complete Guide to Data 2017 Simplifying data integration for the modern era E-BOOK

2 THE COMPLETE GUIDE TO DATA INTEGRATION The Complete Guide to Data 2017 Simplifying data integration for the modern era AUDIENCE: BI Managers Practitioners Project Managers Solution Architects Unlocking the Big data is here, and it s transforming the very nature of commerce, enabling new insights, and accelerating the generation of business insights. While the concept of big data isn t new, its potential is just now being realized as powerful tools to organize, manage, and analyze, immense volumes of enterprise-generated and third-party data finally become available for mainstream use. However, for many organizations, it s not so easy to unlock the value in this data. While data volume (the amount of data) and velocity (speed that data is generated) is in part what makes it so valuable, volume and velocity also present significant challenges. Still more daunting is the broad variation in the types and sources of data (variety), including highly structured files, semi-structured text, and unstructured video and audio feeds.

3 THE COMPLETE GUIDE TO DATA INTEGRATION Biggest Big Data Challenge for Businesses 49% variety 16% velocity 35% volume In a recent Gartner study, 49% of organizations reported that they struggled the most with the variety of big data compared to 35% citing volume as their most significant problem, and 16% of organizations claimed velocity was the largest problem relating to big data 1. Contending with data from multiple databases and systems has always been a challenge but now with increasingly different types of data, the task has become overwhelming. In addition, with data distributed across disparate systems, sources, and silos, it can be a seemingly impossible challenge to obtain a unified, enterprise-wide view of the information available for analysis. For companies attempting to integrate this onslaught of data in the same manner as was popular 20 years ago with traditional data warehouse approaches it is indeed impossible, or close to it. To extract real value from data organizations must ingest and process data from both internal and external sources and perform near real-time analysis not an easy task. Faced with these challenges, traditional data warehouse solutions cannot keep up with rapidly changing data ecosystems. 1 Gartner, 2014, Survey Analysis: Big Data Adoption in 2013

THE COMPLETE GUIDE TO DATA INTEGRATION 2017 4 In a typical IT environment, traditional data warehouses ingest, model, and store data through an extract, transform, and load process (ETL).

Running these jobs daily means that, at best, the warehoused data is a few hours old, but it is typically a day or more old.

4 THE COMPLETE GUIDE TO DATA INTEGRATION In a typical IT environment, traditional data warehouses ingest, model, and store data through an extract, transform, and load process (ETL). These ETL jobs are used to move large amounts of data in a batch-oriented manner and are most commonly scheduled to run daily. Running these jobs daily means that, at best, the warehoused data is a few hours old, but it is typically a day or more old. Because ETL jobs consume significant CPU, memory, disk space, and network bandwidth, it is difficult to justify running these jobs more than once daily. In a time when APIs were not as prevalent as they are now, ETL tools were the go-to solution for operational use cases. With APIs now in the picture and the sheer variety of data they represent the ETL method is becoming impractical. However, even before the era of APIs and big data, ETL tools posed significant challenges, mainly because they require

5 THE COMPLETE GUIDE TO DATA INTEGRATION comprehensive knowledge of each operational database or application. Interconnectivity is complicated and requires thorough knowledge of each data source all the way down to the field level. The greater the number of interconnected systems that are to be included in the data warehouse, the more complicated the effort is. In this digital era, new requirements arise faster than ever before and previous requirements change as quickly making development agility and responsiveness necessary factors for success. As such, ETL-based data warehousing projects became infamous for appallingly high failure rates. When these projects don t fail outright they are frequently plagued with cost over runs, and delayed implementations. Great care is needed to conceptualize the database and thoroughly define requirements to avoid having to re-work complicated and brittle connections, since tightly coupled interdependencies often trigger unpredictable and far-reaching impacts even when slight changes are made. Another shortcoming of the ETL data warehouse approach is that the business staff rarely gets an opportunity to see the results until after several months of development work has been completed. By this point it is common that requirements have changed, errors have been discovered, or the objective of the project has shifted. Any of these variables might force IT back to the drawing board to collect new requirements, and in all likelihood months of development effort will be scrapped. In fact, Gartner estimated that between 70 and 80 percent of corporate business intelligence projects failed to deliver the expected outcomes Poor-communication-to-blame-for-business-intelligencefailure-says-Gartner

6 THE COMPLETE GUIDE TO DATA INTEGRATION Data warehouses were originally built for operational reporting rather than for interactive data analysis and using a traditional data warehouse for analytic queries requires carefully building just the right structure and performing extensive and specific performance optimization. If you later decide to use the data differently, you must change the data structure and re-optimize, which is a very cumbersome and costly process. The inherent problems of the traditional ETL approach is compounded by the sheer number of data sources available and the myriad ways to access data such as the proliferation of APIs that rely on importing and exporting data, each of which has its own access protocol. While it s technically possible to implement this sort of connectivity through ETL, the actual implementation would be overly complex, difficult to maintain, and costly to extend, problems that are made worse if the APIs do not use data exchange standards such as ODBC or JDBC. In this digital era, new requirements arise faster than ever before and previous requirements change as quickly making development agility and responsiveness necessary factors for success. Because of issues like these, traditional data warehouses simply can t cope with the needs of today s businesses and related overall digital transformation trends. Because of the shortcomings of the traditional data warehouse approach, new approaches to data processing emerged and what came next was multi-dimensional OLAP methodology.

7 THE COMPLETE GUIDE TO DATA INTEGRATION THE TRADITIONAL WAREHOUSE AT A GLANCE + Move large amounts of data Built for operational reporting Significant consummation of bandwidth, CPU etc. Long development cycles (several months) No interactive data analysis High complexity due to high number of potential ways to integrate data OLAP Online Analytical Processing (OLAP), and cubes are other words for multi-dimensional sets of data that essentially serve as a staging space in which to analyze information. These special online analytic processing databases hold data not in tables but in OLAP cubes which are a mechanism used to store and query data in an organized, multi-dimensional, structure specifically optimized for analysis. OLAP databases are designed to pre-calculate as many queries and combinations of data fields as possible in order to provide fast query response. However, while these solutions perform better than classical relational databases, their multidimensional structure makes them inflexible and unable to accommodate changes easily. In addition, storing large amounts of data in a cube causes a performance bottleneck. While OLAP databases are quite useful for basic use cases, large data sets

8 THE COMPLETE GUIDE TO DATA INTEGRATION require using capabilities from additional tools in tandem, which complicates analytical efforts and requires unique skills. ROLAP Another way to organize data for multi-dimensional querying is relational online analytic processing (ROLAP). ROLAP is a form of OLAP that performs multi-dimensional analysis of data stored in a relational database rather than in a multi-dimensional database, which is considered the OLAP standard. Although ROLAP technology performs better than OLAP databases when processing large amounts of data, it cannot beat the speed and efficiency of OLAP on smaller amounts of data. ROLAP databases require a great deal of manual maintenance and are difficult for business users to operate so ROLAP is considered to be more inflexible than OLAP cubes. OLAP and ROLAP are both still popular today but neither technology can keep up with today s demands for near real time data for analytics nor handle unstructured data.

9 THE COMPLETE GUIDE TO DATA INTEGRATION MULTIDIMENSIONAL DATABASES (OLAP, ROLAP) AT A GLANCE Store and query data in an organized way Fast query response due to pre-calculation Fast and efficient for small amounts of data + Problems with large amounts of data Inflexibilty through multidimensional structure Performance bottleneck due to storage limitations of cubes Need for manual maintenance Difficult to use for business users Need for additional tools when dealing with high data volumes Because both the data warehouse and OLAP approaches fall short of business expectations for speedy and comprehensive analytical data access, a new approach surfaced. Self-service business intelligence (SSBI) technologies like Qlik and Tableau introduced an approach to data analytics that enables business users to access and work with corporate information without the IT department s involvement. These SSBI tools have the capability of blending or locally integrating data from the data warehouse with any other data sources not stored in the data warehouse. This is accomplished through pulling copies of the data sources

10 THE COMPLETE GUIDE TO DATA INTEGRATION into a local data store where the analyst can blend or integrate the data as needed. These self-service tools are flexible and relatively easy to implement and provide a good level of independence for data analysts but there are clear disadvantages to the approach. The most prominent disadvantage is that data analysis performed in this manner quickly becomes unmanageable, resulting in redundant work, inconsistent results and in short, chaotic reporting practices when used on a broad scale throughout organizations. Since everybody has the ability to define their own rules and calculations, it is both possible and likely for different groups and individuals to calculate the same KPIs and metrics in different ways, leading to an array of conflicting results and the publishing of both confusing and contradictory information. Because these solutions have no permissions structure, there is no security layer to protect sensitive data which is a severe vulnerability since analysts frequently and casually exchange data files. Also, the ability to transform the data is relatively limited in most cases. Further, because many machines are doing the same work for different users in parallel, powerful computer resources are being used inefficiently, contributing to rising costs and lower system performance. For all of these reasons, pure SSBI tools can fill a limited and short term need but fall short of being an endto-end enterprise level analytical solution.

11 THE COMPLETE GUIDE TO DATA INTEGRATION SELF-SERVICE BI TOOLS AT A GLANCE + Enable business user to perform analysis without IT-support Data blending of external data sources with data warehouse Flexible and easy to implement Different KPI calculations due de-cetralised analytics No security layer Limited data transformation capabilities Inefficient use of resources due to parallel usage As SSBI tools evolved, data scientists were still wrestling with the overall challenge of finding an analytical database as flexible for analytics as relational databases were for transactional data processing. Progressive software vendors sought to overcome the limitations of data warehouses, cubes, and SSBIs and began working towards creating databases that were both flexible and able to process analytical workloads. These analytical databases, or column stores, were the next step in the trend to provide business analysts the tools and flexibility they need. These analytical databases have evolved into massively parallel processing (MPP) analytical databases that are more flexible and more performant than Cubes even in the cases where large amounts of data are being stored and queried. However, these analytical databases require that data be copied into them using processes very similar to the aforementioned ETL processes and have similar drawbacks. The load processes are typically slower than in traditional data warehouse based on row based technology because there is an extra step required

12 THE COMPLETE GUIDE TO DATA INTEGRATION to optimize the data for quick analytical retrieval. This extra step is required to convert the data from a row-based format into a columnar format and then apply field level data compression. Although these extra steps provide significant performance improvements, they also require additional time that delays the analysts ability to analyze the data. It is impossible to access realtime data in analytical databases due to this load time latency. ANALYTICAL DATABASES AT A GLANCE Scalable and able to deal with huge workloads Strong parallel processing High scalability + Slow load processes due to need for conversion from row- to column-based data and data compression No real-time data access Not agile Next came the data lake strategy. Data lakes are storage repositories able to hold a vast amount of raw data in its native format until needed. In many cases data lakes are Hadoopbased systems and they represent the next stage in both power and flexibility. A compelling benefit of the approach is that there is no need to structure (transform) the data before querying it (which would be referred to as schema on write ). In fact, you can assign structure to the data at the time it is being queried (referred to as schema on read ). However, while data lakes are able to hold large amounts of unstructured data in a costeffective manner, they are insufficient for interactive analysis when fast query response is required or if access to real-time data is needed.

13 THE COMPLETE GUIDE TO DATA INTEGRATION The proliferation of data lakes enables the switch from ETL to ELT (extract, load, and transform). Unlike ETL where data is transformed before it s loaded into the database, ELT significantly accelerates load time by ingesting data in its raw state. The rationale behind this approach was that data lakes storage technologies are not picky about the structure of the data. Therefore, no development time is required to transform the data into the right structure before it can be accessed for analytics. This means that all data could be simply parked or dumped into a data lake, and all further operations and transformations could occur within this database if and when needed. While it is a tantalizing approach, the data lake falls short of expectations for several reasons. A primary objective of the data lake is to simplify and accelerate, however the approach often complicates matters with extra steps to prepare data for analytics, and although it provides significant reductions in labor for data loads it still requires that all data be moved or copied to a single location prior to accessibility for analytical purposes. This drawback is shared with the traditional data warehouse using ETL approach since data load latency cannot be eliminated from the analytical data supply chain although the load time latency is greatly reduced for the data lake as compared to a data warehouse. Another disadvantage to the data lake is a phenomenon that has come to be known as the data swamp or data graveyard. The data lake approach often leads to dumping and storing much more data as compared to ETL because of lower cost of storage, but the save everything approach leads to loading and storing much more data than businesses are prepared to analyze. Since any data load takes time and consumes disk space and network bandwidth, unnecessary loads can be expensive and cause additional latency that delays other more analytically valuable data from being analyzed in a timely manner.

14 THE COMPLETE GUIDE TO DATA INTEGRATION Although data lakes and ELT bring data together into one place quickly they cannot provide fast query response as analytical databases do, nor can they provide access to data in real-time. DATA LAKES AND ELT AT A GLANCE hold vast amounts of unstructured data no need to structure data before querying it efficient data load + no real-time analysis possible data needs to be moved to single location before analysis low costs encourage data graveyards which decrease performance and increase costs Looking back at both the traditional data warehouses and the data lakes, one commonality they share is that they rely on having all data in a physical, central repository. The idea was that before you could work with it, you had to corral the data into a single location. However, this assumption has been a barrier to accelerating data accessibility and is what is fundamentally wrong with all of the approaches previously discussed.

15 THE COMPLETE GUIDE TO DATA INTEGRATION While the majority of data analysts were busy exploring the progression from relational databases to Cubes, analytic databases, and data lakes, another camp was looking into using data federation to integrate data for analysis. Data federation allows analysts to instantly run queries joining multiple disparate databases without the need to copy or move data from the original operational sources to a central analytical repository. This approach is clearly a significant improvement on all of its predecessors regarding the immediacy at which data can be analyzed. While the idea is sound and value is self-evident, data federation alone isn t scalable for large amounts of data or for large numbers of simultaneous users. In addition, because it relies heavily on the speed and stability of the source systems and network, its performance is commonly diminished for both data analysis and production operations. So, while data federation is quick and flexible, in itself it is not scalable or particularly dependable. But, it was an important step in the right direction. The next stage of evolution was to combine data federation with caching repositories to address these issues. This hybrid approach used big data solutions to complement data warehousing. The result is a combination of repositories, virtualization, and distributed processes for data management that delivers the best capabilities from several technologies but still falls short of the expectation for a robust, agile, performant data warehouse. Caching can be problematic due to the need to schedule cache loads around performance concerns of source systems and that the cache is loaded into a single repository that may or may not be optimized for different data sets and/or data types.

16 THE COMPLETE GUIDE TO DATA INTEGRATION Still, in moving closer to modern data warehouses, virtual data technology is essential from simple federation to virtualization, as well as virtual views, indices, and semantics. Developing virtual or logical data views is faster than relocating all data physically and can be done with ease through point and click operations. In addition, virtual views can be altered without the need to transform and reload data, as in earlier data warehouse integration approaches meaning the changes can be presented live immediately, without waiting for the data to populate through an overnight process. It is the virtualization of data integration that enables extreme agility in analytical development and significantly reduces build times and costs, all of which leads us to the next breakthrough in data warehousing. DATA FEDERATION AT A GLANCE + joining databases in a central repositiory without need to copy them very fast data access flexible change of virtual views virtual data integration enables extreme agility and reduces buildtimes/costs limited scalability (e.g. many simultaneous users) caching repositories cause performance problems

17 THE COMPLETE GUIDE TO DATA INTEGRATION The First Logical A modern data integration strategy employs what s known as best-fit engineering, whereby each part of the data management infrastructure utilizes the most appropriate technology solution to perform its role, including storing data determined by business requirements and service-level agreements (SLAs). Unlike a data lake, this new architecture has a distributed approach, aligning information storage selection, with information use, and leveraging multiple data technologies that are fit for specific purposes. A hybrid approach can also significantly reduce costs and time to delivery when changes or additions in the warehouse are required. One term for this new architecture is logical data warehouse. Another is virtual data lake. In either case, the premise is that there is no single data repository. Instead, the logical data warehouse is an ecosystem of multiple, fit-for-purpose, repositories, technologies, and tools that interact synergistically to manage data storage and provide performant enterprise analytical capabilities. The original unmet analytical requirements of the traditional data warehouse were to be able to retrieve data using a single query language, get speedy query response, and to have the ability to quickly assemble different data models or views of the data to meet specific needs. By combining data federation, physical data integration, and a common query language (SQL), the logical data warehouse approach achieves all three of these goals without the need to copy or move all of the data to a central location.

18 THE COMPLETE GUIDE TO DATA INTEGRATION Physical data integration is a robust feature of the logical data warehouse that ensures fast query response while decoupling performance from the source data stores and moving it to the logical data warehouse repository. In this manner, the effort-intensive, physical transfer of the data is minimized and simplified, effectively removing lengthy data movement delays from the critical path of data integration projects. In Understanding : The Emerging Practice, Gartner weighed in on this approach, pointing out that it offers flexibility for companies that have different data requirements at different times. For example, many use cases require a central repository, such as a traditional data warehouse or analytic database, where data that is needed frequently, or with the greatest retrieval speed can be stored and optimized for performance. Increasingly, data analysts must be able to explore data freely with guaranteed adequate query performance. Frequent uses cases along these lines are sentiment analysis or fraud detection analysis. These use cases require a distributed technology such as Hadoop to store the massive amounts of data available through social media feeds and click stream activity logs. Additionally, they demand direct access to data sources via data federation. As Gartner rightly indicates, a logical layer is needed on top of these technologies in order to unify the architecture and allow queries and processes to operate on all systems concurrently as needed.

19 THE COMPLETE GUIDE TO DATA INTEGRATION

20 THE COMPLETE GUIDE TO DATA INTEGRATION As the first logical data warehouse, Data Virtuality provides this uniform layer over numerous data storage technologies, unifying these data stores and facilitating the use cases suggested above by Gartner. By routing queries among data stores behind the scenes as needed, the Data Virtuality technology offers great benefits to business users. The business can use the same platform for handling a variety of use cases, for example, far more than could be handled by a traditional data warehouse. Also, new approaches to data integration are possible, enabling users to put business needs first and allow the technology platform to adapt as needed. By decoupling the semantic unified data access layer in which the business users interact from the actual data sources, changes occurring in the original data source can be isolated from interfering with analytical processes. In a profound departure from past data accessibility strategies, business users can interact with data comfortably and easily, focusing on their objective rather than the technological underpinnings. By consolidating relational and non-relational data sources, including real-time data, Data Virtuality enables immediate analysis via SQL query language. Data Virtuality provides a central data cockpit, allowing all data sources, whether analytical or operational, to freely interchange data. Integrated connectors allow data to be immediately processed in analysis, planning, or statistics tools, or written back to source systems as needed. In addition, the logical data warehouse automatically adjusts to changes in the IT landscape and user behavior, offering the highest possible degree of flexibility and speed, with little administrative overhead. In a logical data warehouse project, a few clicks can seamlessly connect all data-producing and data-processing systems, including ERP and CRM systems, web shops, social media applications, and just about any SQL and No-SQL data source,

21 THE COMPLETE GUIDE TO DATA INTEGRATION all in real time. With instant access to the data, users can begin experimenting with these connections and joins until they achieve the results they want. In stark contrast to traditional ETL solutions, the key difference with the logical data warehouse is that there s no need to move the data to analyze it. This greatly reduces development and database structuring time and costs. Equally flexible and responsive, the logical data warehouse is a completely different data integration paradigm than the inflexible traditional data warehouse approach. The logical data warehouse works by intelligently marrying two distinct technologies to create an entirely new manner of integrating data. The first technology is data federation, which connects two or more disparate databases and makes them all appear as if they were a single database. The second is analytical database management providing semantic business-friendly data element naming and modeling allowing flexible ingestion and modeling options. The results are profound. Data federation alone offers flexibility, but can t scale. Analytical database management scales beautifully, but is inflexible. The combination of the two enables breakaway flexibility and performance and represents an entirely new paradigm in the way we think, manage, and work with data. For example, a logical data warehouse can connect to a variety of data sources simultaneously, including classic relational databases like Oracle and MS-SQL; No-SQL databases like MongoDB or Hadoop; column stores like Vertica or SAP HANA; or web services like Google Analytics, AdWords, Facebook, Twitter, and others. Once these have been connected, the resulting integrated overarching view of the data appears within a data analysis tool as if everything was contained in a single SQL

22 THE COMPLETE GUIDE TO DATA INTEGRATION database, accessible with a common query language. Virtually any data analysis tool currently on the market (such as Qlik, Tableau, Aqua Data, etc.) can connect, query, and analyze data over the virtual layer with no need to pull or copy data from any location. The method offers vast new opportunities and possibilities for data exploration, data discovery, rapid prototyping, and intuitive experimentation. Business users can get results instantly and can refactor data models just as quickly. Further, building logical data views as shareable components including common KPIs and metrics can ensure that every report, every visualization, and every query response conforms to the same corporate standards and definitions. Data Virtuality acts as a central hub connecting all systems and applications within the enterprise, enabling data exchange between systems, and ensuring the latest data is available anywhere and anytime. LOGICAL DATAWAREHOUSE AT A GLANCE + consolidation of structure, unstructured and real-time data by combination of data federations and analytical database management no need to move data for analysis immediate processing (analysis) or writing back in data sources central hub connectiong all systems and applications within the enterprise secures latest data everywhere at any time needs at least 10 different data sources to show full efficiency no integrated analytical tool

23 THE COMPLETE GUIDE TO DATA INTEGRATION A MODERN DATA WAREHOUSE The logical data warehouse is essential for organizations that wish to combine big data and data warehousing in the enterprise. A VIRTUAL DATA MART A logical data warehouse makes it easy to create a virtual data mart for expediency. By combining an organization s primary data infrastructure with auxiliary data sources relevant to specific, data-driven business units, initiatives can move forward more quickly than if data would need to be on-boarded to a traditional data warehouse. AN EVOLVING CORPORATION Modern data integration allows rapidly changing organizations to quickly combine data from disparate business units and provide BI & analytical transparency to top management. This kind of flexibility is crucial for strategic changes, mergers and acquisitions, and other sensitive operations where there s no time to waste building a central data warehouse. E-COMMERCE Modern data integration offers a compelling solution for e-commerce and retail organizations with a great number of different systems in the IT landscape. For example, a typical e-commerce business has an ERP system, CRM, web and mobile apps, analytics programs, online marketing, social media marketing, and other tools. With a logical data warehouse all of these data sources can be joined quickly and flexibly to provide 360 degree views of customers, products, etc.

24 THE COMPLETE GUIDE TO DATA INTEGRATION DIGITAL MARKETING Digital marketing is extremely data-driven, relying on the volatile flow of real-time data. A logical data warehouse offers the only viable way to manage complexity of this kind, easily connecting to a host of digital marketing data providers for affiliate marketing, performance marketing, personalization, and other approaches. MAKING DATA ACTIONABLE Modern data integration methods go the extra mile by making data actionable. In addition to receiving the data in one direction for analysis, a user can return data, or essentially trigger actions based on the data. For example, the solution can analyze data from ERP, CRM, and a web shop simultaneously to trigger marketing campaigns unconstrained by traditional business hours. REAL-TIME ANALYSIS The logical data warehouse excels at manipulating real-time data and can flexibly model and re-model the data to fit the latest analytical initiatives. INTEGRATING BIG DATA The open-source, big data solution Hadoop, is adept at analyzing unstructured data and performing batch analysis, but performs poorly in interactive situations. To achieve real-time functionality, companies must combine the traditional data warehouse with modern big data tools, and often multiple ones, such as an Oracle warehouse with Hadoop and Greenplum. Unifying these data sources into one common view provides instant access to a 360 degree view of your organization.

25 THE COMPLETE GUIDE TO DATA INTEGRATION In this digital era, harnessing large amounts of data to make astute business decisions and improve operations is an imperative. While our ability to generate data still far outstrips our ability to effectively analyze it, we are making great progress in balancing these out. Exciting new approaches are merging big data solutions with traditional enterprise data strategies. Without the need for a central repository, logical data warehouses hold enormous promise. By offering an ecosystem of multiple, best-fit repositories, technologies, and tools, businesses can now effectively and rapidly analyze realtime data in pursuit of valuable insight. For organizations sifting through reams of data for treasure, these virtual data lakes represent the Holy Grail that can help them tailor products and fulfill desires we haven t yet dreamed of.

26 THE COMPLETE GUIDE TO DATA INTEGRATION work together to offer impressive in-memory data processing for big data applications. Although there has been hope that the in-memory capabilities of Spark would solve many of the latency issues related to Hadoop, both technologies have limitations and fall short of a one-size fits all solution. Apache Hadoop is an open-source software framework providing distributed storage and processing of very large data sets data sets so large that it would not be economical to store them in most any other data storage technology. Hadoop accomplishes this by using a multiple server, clustering approach that removes many earlier constraints regarding the storing and processing of large data sets. To process data, Hadoop s MapReduce function abandons the convention of moving the data over a network to the application server for processing. Instead of moving data to the application server, MapReduce analyzes data on the individual servers and then compiles the results from the individual servers into a single response to the query. Hadoop itself is not a single system, but rather an ecosystem of numerous interconnected products that allows users to run various types of analytics and operations on any type of data. Hadoop is an open source system so it is constantly evolving and improving. While Hadoop is complex to use, startups and established companies alike are quickly creating tools to simplify and expand the use of Hadoop. For example, executing queries within the Hadoop ecosystem originally required extensive knowledge of new and lesser known programming languages such as map-reduce, pig, and python. The results of this custom coding was that these queries could be performed on data types previously impossible to query such as unstructured data, but at the cost of there being fewer programmers available to write

27 THE COMPLETE GUIDE TO DATA INTEGRATION and run these queries. Currently, however, there are numerous products available that allow using the very popular SQL query language to analyze data stored in Hadoop. Classic Hadoop is in itself batch-oriented and as such, is capable of analyzing vast amounts of data with relative ease by distributing the work across a number of different Hadoop nodes that act in parallel to provide the results. However, analyzing smaller amounts of data requires just as much complexity and programming as the processing of large data sets so overall it is a rather slow method to query data. and related technologies are making an effort to improve Hadoop query performance by adding a fast, in-memory, data processing engine with development APIs. The objective is that technologies such as these will eventually allow data workers to execute streaming, machine learning, or SQL workloads on Hadoop in a timely manner and with less custom coding. While almost any analytical task can be undertaken with Hadoop, including analysis of very large amounts of data like fraud and sentiment analysis, overall, it remains a relatively immature technology whose ecosystem is not yet fully integrated and requires custom coding at several junctures for complete functionality. Because it s highly technical and difficult to use, most often success within Hadoop comes in the form of an inexpensive data archive.

28 THE COMPLETE GUIDE TO DATA INTEGRATION Data Virtuality GmbH develops and distributes the software DataVirtuality, which affords companies an especially simple means of integrating and connecting a variety of data and applications. The solution is revolutionizing the technological concept of data virtualization and generates a data warehouse consisting of relational and non-relational data sources in just a few days. Using integrated connectors, the data can be immediately processed in analysis, planning or statistics tools or written back to source systems as needed. The data warehouse also automatically adjusts to changes in IT landscape and user behavior, which lends companies using DataVirtuality the highest possible degree of flexibility and swiftness with minimum administrative overhead. Founded in 2012, the Leipzig and San Francisco-based company originated from a research initiative of the Chair of Information Technology at the Universität Leipzig and is financed by Technologiegründerfonds Sachsen (TGFS) and High-Tech Gründerfonds (HTGF). COMPANY CONTACT: Nick Golovin, Ph.D. Founder and CEO Data Virtuality GmbH phone: nick.golovin@datavirtuality.com

How to integrate data into Tableau

1 How to integrate data into Tableau a comparison of 3 approaches: ETL, Tableau self-service and WHITE PAPER WHITE PAPER 2 data How to integrate data into Tableau a comparison of 3 es: ETL, Tableau self-service