The Emerging Data Lake IT Strategy An Evolving Approach for Dealing with Big Data & Changing Environments bit.ly/datalake SPEAKERS: Thomas Kelly, Practice Director Cognizant Technology Solutions Sean Martin, Founder and CTO Cambridge Semantics
We re living in an amazing world of information sharing, connecting with family, neighbors, vendors, and customers all over the world 2
Telling the world about what we like and don t like #HIMYMfinale @MLB is now following Cognizant Technology Solutions and Cambridge Semantics 3
What we re doing and how we re succeeding 4
We re deciding what advertising that we want to see and what we don t Unsubscribe Influencing how business and customers engage 5
Many businesses have emerged that embrace this model of customer engagement 10 million stays in 2013, without owning a hotel Grew to nearly $75B in annual retail revenue in 2013, without opening a storefront Shares over 40 million photos each day and we ve said Goodbye to businesses that didn t 6
Retail Engaging in a more personalized shopping experience, retailers are building a stronger relationship with each customer 7
Customer Service Delivering a positive and successful experience for each customer 8
Life Sciences and Healthcare Combining health, genetic, clinical, and public sciences data to bring effective therapies to patients sooner 9
Financial Services Delivering innovative products and services, based on a 360 view of the Customer, across all business lines, engaging all available data assets, internal and external 10
The Challenges That We're Addressing Onboarding and Integrating Data is Slow and Expensive Transforming data from a growing variety of technologies Custom coded ETL Existing ETL processes are not reusable Optimization for analytics is time-consuming and costly Often wait until there is a defined need for a set of data, delaying benefits realization while waiting to onboard the data Data Provenance is Often Poorly Recorded Data meaning is lost in translation Data transformations tracked in spreadsheets Post-onboarding, maintenance and analysis cost for onboarded data is high Recreating data lineage is manual, time-consuming, and error-prone 11
The Challenges That We're Addressing Target Data is Difficult to Consume Optimization favors known analytics, but not well suited to new requirements A one-size-fits-all canonical view is used rather than fit-for-purpose views Or, lacks a conceptual model to easily consume the target data Difficult to identify what data is available, how to get access, and how to integrate the data to answer a question Industrializing the Big Data Environment is Difficult to Manage Proliferation of data silos leads to inconsistency/syncing issues Conflicting objectives of opening access to data assets while managing security and privacy requirements Velocity of business change rapidly invalidate data organization and analytics optimizations Managing the integration/interaction with the multiple data management technologies that make up the Big Data environment 12
The Data Lake is made up of four key components Data Lake Management Data Ingestion Data Management Query Management Delivering Low Cost, High Performance Storage Flexible, Easy-to-Use Data Organization Performance-Optimized Analytics Automation of most manual Development and Query Activities Self-Service End-User Features Intelligent Processing 13
Data Ingestion Data Sources Desktop and Mobile Social Media and Cloud Operational Systems Linked Data Data Ingestion Model- Driven Semantic Tagging On-Demand Query Streaming Scheduled Batch Load Data Lake Management Data Management Query Management Internet of Things IoT Self-Service 14
Data Management Data Sources Desktop and Mobile Social Media and Cloud Data Ingestion Model- Driven Semantic Tagging Data Lake Management Data Management Data Movement NoSQL In Memory Provenance Map Reduce Query Management Operational Systems On-Demand Query Streaming Columnar Graph Semantic Linked Data Internet of Things IoT Scheduled Batch Load Self-Service HDFS Storage Structured and Unstructured Data 15
Data Lake Management Data Governance Focus on Shared Data Standard Models Controlled Vocabulary Data Sources Common Definitions Standards-based Data Views (FIBO, CDISC/RDF) Desktop and Mobile Social Media and Cloud Data Mappings Source-to-Target Transformations Models Data Ingestion Business-Focused Business Unit Data Organization and Terms Optimized to Assist Analytics Model- Driven Semantic Tagging (ontologies, Data Lake Management Data Assets Catalog Data Management Data Movement NoSQL Internal and External Data Assets Defined Data Orgs taxonomies, thesauri) In Memory Processes Schedules Provenance Capture Provenance Map Reduce Workflow Monitoring Monitor and Manage Data Lake Operations Access Management Query Management Authorization and Access Rules Rule-based Security Group, Role, and User Level Authorization Auditable Access Operational Systems On-Demand Query Streaming Columnar Graph Semantic Linked Data Internet of Things IoT Scheduled Batch Load Self-Service HDFS Storage Structured and Unstructured Data 16
Query Management Data Sources Desktop and Mobile Social Media and Cloud Operational Systems Linked Data Internet of Things IoT Data Ingestion Model- Driven Semantic Tagging On-Demand Query Streaming Scheduled Batch Load Self-Service Data Lake Management Data Management Data Movement NoSQL In Memory Columnar Graph Semantic Provenance Map Reduce HDFS Storage Structured and Unstructured Data Query Management Semantic Search Data Discovery Analytics Directed to the Best Query Engine Capture and Share Analytics Expertise Query Data, Metadata, and Provenance 17
Semantic Technology Delivers Smart Data Integrates a network of internal and external data assets, insulating end users from the details of the underlying technologies Captures expertise (logic, inferencing) and integrates it with the data, delivering smart data to non-expert users Manages a comprehensive inventory of the data assets Secures access to the right data assets by the right users 18
Key W3C Standards in Semantic Technology Resource Description Framework (RDF) Framework for storing and integrating data and data definitions in the form of subjectpredicate-object expressions, or triples. Relationships are organized in a logical graph model. Reduced development time and cost; faster time-tobusiness value. Web Ontology Language (OWL) An ontology is a comprehensive model of data definitions and relationships that is human- and machine-readable. Ontologies are inheritable and extensible. Improved application quality, flexible iterative / investigative approach, easily adapts to business change. SPARQL Query Language SQL-like query language for semantic data that can leverage the ontological relationships and constructs to execute smarter queries. Access multiple internal and external databases simultaneously in a single query. Access and integrate data across business silos. Inference Reasoning over data through business rules. Expertise is captured and embedded in the ontology model, accessible through user queries. This is the smart in Smart Data. Easier end user access to expertise; intelligent systems capabilities. Linked Data Connects data contained in different databases, allowing queries to find, share and combine data so insights can be identified across the Web. Connect disparate databases to navigate and integrate data regardless of location or technology platform. RDB to RDF Mapping Language (R2RML) Preserving current investments in relational technology, R2RML maps relational data to an ontology. SPARQL can query RDF and relational databases simultaneously. Low cost of entry to use Semantic Technology to deliver high-value solutions 19
The Common Model is the Data Glue Source Systems Lead (SFA system) Quote (Quote system) Order (OMS system) Contract (CMS system) Common Model ( Data Glue ) Different business entities in physical systems actually share many of the same concepts, meanings, and relationships Semantic data science exposes common business concepts and connects them with their physical expression in production systems Data is glued together by its business meaning, rather than physical structures dictated by the underlying technologies The conceptual model can be directly used by both business and IT users to operationalize data services, understand the data landscape, track data lineage, and conduct downstream analytics. 20
Semantic Models Relate Data by Business Meaning Life Style Life Events Personal Network Music Customer Entertainment Preferences Interests Purchasing Profession 21
Implications to the Existing IT Architecture and Practices Manages Secure Access Extends Existing Investments in IT Architecture Self-Service Data Feeds and Analytics Easier Access to External Data Reduction of Data Mart Silos User Tools to Discover and Optimize Data Relationships Structured and Unstructured Data, Voice, and Video Builds Out Enterprise Data Models, with Integration Hub Capabilities Infrastructure Capacity Elasticity Data Analysis Automation 22
Data Lake Approach to Meeting Business Needs Business Needs Onboard New Data Connect External Data Integrate Data between Business Units or Business Partners Capture and Embed Expertise Traditional Technologies and Practices Comprehensive analysis creates rigid structure that is difficult to change, or Minimal definition of data organization requires detailed understanding of data contents External data is collected and loaded into the analytics repository. Data is streamed, or is refreshed on a scheduled frequency. Governance activities establish common vocabulary, and data definitions Shared data is copied to an integrated database. Organization-specific definitions may require duplicating certain data in marts Expertise often captured in the reporting and analytics; change management challenge when updates required. Data Lake Technologies and Practices Flexible data model can be revised or extended without redesign of the database Agile, evolutionary refinement of the data organization, leveraging new insights as users work with the data External data can be sourced from databases, spreadsheets, Web pages, news feeds, and more; data is queried through common methods, without regard to location, with real-time values delivered at query time. And, systems of record publish existing data specifications or ontology model; each organization defines data in a manner that is best suited for its business. Federation and virtualization features provide choices in which data to copy and which data to retain in the system(s) of record All models can be supported through a single copy of the data, maintained in the data lake or system of record. Expertise captured in the data definitions; single, shared definition minimizes change management efforts 23
Lessons learned from early adopters Prioritize Onboard Connect Load Organize Customize Search Secure Prioritize data onboarding by the data s ability to contribute to customer engagement Onboard data assets as they become available Connect to available internal and external data assets Load the data unfiltered/untransformed Use models to provide organization to the data Create models that are tailored to the needs of the business groups Make it easy to find data Manage security and privacy, but make it easy to authorize access to data that users need 24
Addressing Challenges - Privacy vs Personal Value - Granularity of customer understanding - Delivering strategic objectives when projects tend to have a technical focus - Opening access to data - Need for executive sponsorship - Access to external data - Establishing firewalls - Persistent, pervasive data quality issues 25
Clues to better customer engagement will be found in the ever-growing volume of data that we re creating 26
A Data Lake Strategy helps you to create a personalized, engaging experience with each customer Visibility Provenance Agile Internet Scale Open, yet Secure Smart Self-Service Universal Data Access Adaptable 27
Questions? 28
Thank you! 29