Big data easily, efficiently, affordably. UniConnect 2.1

Size: px

Start display at page:

Download "Big data easily, efficiently, affordably. UniConnect 2.1"

Claribel Pearson
5 years ago
Views:

1 Connecting Data. Delivering Intelligence Big data easily, efficiently, affordably UniConnect 2.1 The UniConnect platform is designed to unify data in a highly scalable and seamless manner, by building on an organisation s existing tools, processes and skills. It enables organisations to meet their most pressing data challenges, including those of cost and inefficiency, while ensuring that they are futureproofed for the revolutionary potential that big data can bring. A Percipient Technology White Paper Author: Ravi Shankar Nair Chief Technology Officer April , Percipient Partners Pte. Ltd., All Rights Reserved. Reproduction Prohibited.

UniConnect 2.1: Big data easily, efficiently, affordably Data storage challenges A Data Warehouse (DWH) is one of the most important artifacts in an organization.

2 UniConnect 2.1: Big data easily, efficiently, affordably Data storage challenges A Data Warehouse (DWH) is one of the most important artifacts in an organization. It stores transactional information from transactional (OLTP) systems and provides downstream aggregation of data. This aggregated data is then used for MIS, reporting, analytics and discovery. When the active use period is completed, old data is archived, and for large data sets, compressed into tapes (to service future regulatory and compliance requirements). Here is a traditional DWH architecture: To cater to exploding data storage needs, and the boom in analytics usage, companies are turning to Hadoop-powered big data platforms. In some large organisations, it is not unusual to find DWHs separately maintained by different departments or country businesses. Besides, in recent years, digital advances have enabled access to many more sources and types of data. This proliferation of data has led data experts to conclude that DWHs are no longer sufficient. To cater to exploding data storage needs, and the boom in analytics usage, companies are turning to Hadoop-powered big data platforms to augment existing DWH capabilities and reduce costs. It is estimated that globally, 2.5 exabytes (2.5 billion gigabytes) of data is now generated every day. The State Bank of India currently generates about four terabytes (4,000 gigabytes) of new data per day, and is estimated to soon require storage running into petabytes. Similarly, Walmart s one million online customer transactions every hour has resulted in over 2.5 petabytes (2.5 million gigabytes) of stored data. 2016, Percipient Partners Pte. Ltd., All Rights Reserved. Reproduction Prohibited. 1

3 Hadoop-based big data technologies allow data architecture to be scaled-out rather than scaled-up. Scalability is key While all platforms claim to be scalable and sustainable, this has traditionally driven the need to scale-up for every block increase of data by adding more CPUs, hard-disks, network cards, and of course paying correspondingly higher licensing costs. Hadoop-based big data technologies allow data architecture to be scaled-out rather than scaled-up. This means decoupling the hardware from the software requirements of big data storage. Instead of expensive servers, scaling-out can be achieved through massively distributed, Map Reduce-based computational power and the use of significantly cheaper commodity machines. Percipient s research suggests that a structured warehouse costs an estimated 1 USD per GB, compared to about 20 cents for big data storage. A hybrid solution Not surprisingly, some big data providers suggest that traditional DWHs can and should be replaced altogether. However, most experts now recommend a hybrid of DWH and Hadoop platforms. This is not only because of the administrative challenges of uprooting and migrating processes from one platform to another. Each platform also brings with it specific strengths and weaknesses. A DWH is generally recommended for data that is actively used and requires high level of cleansing, organization, consistency, SQLlanguage queries, and security. Alternatively, big data platforms are ideal for data that is provisional, requires quick access, contains undetermined relationships, or benefits from unrestricted analytical explorations. However, organisations choosing a hybrid solution face another problem. When combining data sets from both platforms, the data needs to be duplicated from one platform to the other, leading to extra network bandwidth demand, time lags, wasted storage and the security risks of multiple data copies. Percipient s UniConnect offers a unique solution. By deploying a UniConnect access layer, data can be unified directly from data sources, and from both DWH and Hadoop platforms, without copying data. 2016, Percipient Partners Pte. Ltd., All Rights Reserved. Reproduction Prohibited. 2

UniConnect Easily Slips-in With No Disruption To Existing System/Process By deploying a UniConnect access layer, data can be unified directly from data sources, and from both DWH and Hadoop

4 UniConnect Easily Slips-in With No Disruption To Existing System/Process By deploying a UniConnect access layer, data can be unified directly from data sources, and from both DWH and Hadoop platforms, without copying data. Below is a simple comparison of the strengths of a Traditional DWH vs Hadoop platform vs UniConnect s capabilities: Data Requirement DWH Hadoop Analytical processing (OLAP) Batch-based Batch based Low latency processing ANSI SQL Query language >1000 concurrent users Parallel processing System and users governance Unrestricted explorations Real time data loading Real time data querying Querying of compressed data Real time integration UniConnect Interactive 2016, Percipient Partners Pte. Ltd., All Rights Reserved. Reproduction Prohibited. 3

5 How does UniConnect work? Traditional data warehouse technologies offer ACID (Atomicity, Consistency, Isolation and Durability) properties. These require developers to hard code queries in the business layer. However, in a rapidly growing analytics environment, organisations face serious redeployment and downtime, especially as queries evolve and change. To offer a more flexible approach, new data storage software such as Neo4J, MongoDB and Cassandra have emerged. These systems apply a dynamic schema, an embedded in-memory database and a configurable metadata layer. However, all the examples above continue to rely on data duplication to connect data across disparate sources. UniConnect overcomes these problems through in-memory processing of structured, unstructured and real time data sources, which are then brought to a relational database, as necessary. UniConnect also allows for ultra quick processing and avoids the delays that plague high-end ETL (Extract-Transform-Load) tools. Such tools are used to transform data from multiple sources, including mainframe product processors, in nightly batches. They are unable to process tasks in parallel, and therefore cannot process large volume data or late files urgently. UniConnect can be deployed when heavy and time consuming tasks are required to meet critical deliverables, without replacing existing tools, and preserving institutionalised processes. UniConnect achieves this by providing extreme parallelism, which is fully extensible to an organisation s overall needs. In addition, as mentioned earlier, many organisations have embraced a Hadoop-based big data solution to save on storage costs. Hadoop relies on Map Reduce, a software framework able to process high volumes of data but at the cost of slow processing speeds. When accessing data from a Hadoop platform, UniConnect replaces Map Reduce with an innovation called SkipMR, thereby shortening processing time by more than 15 times for the same volume of data. 2016, Percipient Partners Pte. Ltd., All Rights Reserved. Reproduction Prohibited. 4

6 UniConnect for Analytics Besides offering organisations a simplified access layer, UniConnect also offers a more efficient analytics discovery layer. It does this in a number of ways. Firstly, Uniconnect offers a single window and a single language, simple SQL, by which to query both structured and unstructured data. Simple SQL is traditionally the preserve of structured DWH platforms. Unstructured data stored on a Hadoop platform requires an entirely different Hive Query Language (HQL) capability, but Uniconnect dispenses with the need for this. Secondly, UniConnect facilitates connectivity from R and Weka, using a JDBC interface. This connectivity gives data scientists the ability to use UniConnect to access the full range of R and Weka algorthms and statistical packages, including linear and nonlinear modeling, timeseries analysis, classification, and clustering. Thirdly, UniConnect is uniquely integrated with the powerful but flawed Spark cluster computing framework. While Spark is said to run programmes up to 100 times faster than Hadoop MapReduce in memory, this memory untilisation is particularly high when data unification is required. By offloading the data management to UniConnect, this drain can be avoided, while users contnue to enjoy Spark s execution engine and stack of libraries, including MLlib for machine learning and GraphX for graph computing. 2016, Percipient Partners Pte. Ltd., All Rights Reserved. Reproduction Prohibited. 5

7 Here is a summary of UniConnect s key functionalities: UniConnect provides a high performing query engine combining both data warehouse and big data platforms, without the need to duplicate data Business Requirements Operational Efficiency UniConnect Core Functionalities ŸUnifies data across multiple sources without copying ŸDirect access to HDFS/Hive ŸSupports in-memory processing ŸData compression and access to compressed data ŸCloud based deployment for each LOB, if required Scale-out Data Storage ŸScalable and expandable using commodity machines ŸExtensible licensing model Data Security ŸLeverages security model of the underlying platforms ŸUser restricted access ŸAdmin user interface supportive of audit protocols Real-Time User Engagement ŸAble to integrate real time messages with structured & unstructured data ŸReads real-time URL data (JSON format) Reporting & Advanced Analytics ŸSimple drag & drop or SQL queries. ŸExposes APIs for external reporting applications ŸIntegrated with Spark for processing power, machine learning algorithms and graph computations ŸConnectivity with R, access o R statistical packages ŸData retrieval as well as data write back 2016, Percipient Partners Pte. Ltd., All Rights Reserved. Reproduction Prohibited. 6

Hybrid Data Platform

Hybrid Data Platform UniConnect-Powered Data Aggregation Across Enterprise Data Warehouses and Big Data Storage Platforms A Percipient Technology White Paper Author: Ai Meun Lim Chief Product Officer Updated Aug 2017 2017,