MarkLogic Technology Briefing Edd Patterson CTO/VP Systems Engineering, Americas Slide 1
Agenda Introductions About MarkLogic MarkLogic Server Deep Dive Slide 2
MarkLogic Overview Company Highlights Headquartered in Silicon Valley Founded in 2001 40% CAGR revenue growth Privately held Patented, award-winning technology Slide 3 1
Select Customers Financial Services and Other Customers Government Customers Media Customers Slide 4
Inside MarkLogic Server Slide 5
MarkLogic Powers the World s Big Data Applications Use all your data to make your organization smarter Analyze a wide variety of structured, unstructured and semistructured data together to gain actionable insights Bring these insights into operational business processes via realtime Big Data Applications A unified Big Data platform for both analytics and applications Any data any volume any structure real-time. E.g. Derivatives contracts, customer records, social media, medical records, intelligence assets, journal articles, etc. Slide 6 Copyright 2010 2012 MarkLogic Corporation. All rights reserved.
Elements of a Big Data Platform Tools / APIs Visualization Data Mining / Analytics Event Processing Metadata Search Analytic DB Operational DB Unstructured Content Ingest / Batch Analytics / Enrichment Archive / Warm Long Tail Data Store Slide 7
How is this usually implemented? BI Tools Applications Stitched together from multiple technologies: Stats (SPSS, SAS, R, ) Stream / Event Processing Search Search Index Each line is an opportunity for latency, ETL bugs, etc. Each component is managed, scaled, supported, etc. separately Analytic DB Operational DB Metadata Unstructured Content Store Applications interface with many technologies, sometimes managed by different groups Bottom Line: Batch Analytics (Hadoop MR) Archive (HDFS) Data governance is compromised Impossible to react in real time Agility is lost
MarkLogic Unified Platform for Big Data BI Tools Applications Stats (SPSS, SAS, R, ) Analytic DB Stream / Event Processing Operational DB Metadata Search Unstructured Content Store MarkLogic Server is An operational DBMS An analytic DBMS An unstructured DBMS A search engine An event processor all in one technology Search Index Batch Analytics (Hadoop MR) Archive (HDFS)
How MarkLogic Server Works Schema-Agnostic Design Slide 10
Data Model MarkLogic Server is a document-centric database Supports any-structured data via hierarchical (XML) data model Document fpml Title Author Trade Product First Last Section Metadata Cashflow Amount ID TradeLeg TradeLeg Trade ID TradeLeg Section Section Section Section Event Event Event Event Slide 11
MarkLogic is Schema Agnostic XML is self-describing <article> <title>marklogic Server:...</title> <author> <first-name>dale</first-name> <last-name>kim</last-name> </author> <abstract>.... <company>mark Logic</company> </abstract> <body> <section> <section>...</section> </section> <section>... index... </section> </body> <copyright>copyright... </copyright> </article> Slide 12
MarkLogic is Schema Agnostic XML is self-describing <article> <article> <title>marklogic Server:....</title> <author> <first-name>dale</first-name> <last-name>kim</last-name> </author> <author> <title> <abstract> <abstract>.... <company>marklogic</company> "MarkLogic Server:..." </abstract> <body> <section> <first-name> "... " <company> "... " <section>...</section> </section> <last-name> <section>.... index.... </section> "Dale" "MarkLogic" </body> <copyright>copyright "Kim"... </copyright> Slide 13 </article> No Schema Needed! <body> <copyright> <section> <section> <section>... " "... index... " "... "
How MarkLogic Server Works Indexing and Query Slide 14
Universal Index UNIVERSAL INDEX Term data base data base STEM be STEM data be Term List 123, 127, 129, 152, 344, 791... 122, 125, 126, 129, 130, 167... 123, 126, 130, 142, 143, 167... 123, 130, 131, 135, 162, 177... 126, 130, 167, 212, 219, 377... Document References 126, 130, 167, <article>... <article>/<abstract>... <section>/<product>... <product>ims</product>... <title> contains "data"... Collection(Red)... Role:Editor + Action:Read... MarkLogic indexes Words Phrases Stemming Structure Values Collections Security Permissions Slide 15
Collections and Security Directories Exclusive, hierarchical, analogous to file system, based on URI Collections Set-based, N:N relationship Security Invisible to your app Slide 16
Scalars How many of the articles that contain data base were written in each of the last 5 decades? UNIVERSAL INDEX data base data base STEM be STEM data be <article> <article>/<abstract> 123, 127, 129, 152, 344, 791... 122, 125, 126, 129, 130, 167... 123, 126, 130, 142, 143, 167... 123, 130, 131, 135, 162, 177... 126, 130, 167, 212, 219, 377......... Document References 126, 130, 167, YEAR <section>/<product>... <product>ims</product>... Volume <title> contains "data"... Collection(Red)... Role:Editor + Action:Read... Slide 17
Range Indexes Maps document ids to values, and values to document ids In a compact memory representation DOC ID VALUE VALUE DOC ID 1 2009 2002 3 3 2002 2003 10 4 2007 2004 5 5 2004 2004 11 8 2011 2007 4 10 2003 2007 17 11 2004 2009 1 17 2007 2011 8... Slide 18
Geospatial Index: A 2-Dimensional Range Index Built-in support for: Point Box Circle Polygon Complex Polygon Polygon Intersection Polygon Containment Fully composable with all other indexes! Slide 19
How MarkLogic Server Works Event Processing Slide 20
Reverse Indexes (Alerting) 1. Load serialized queries as query documents 2. For a given data document, find all queries that match Can provide real-time alerts during loads With no significant performance impact! Can let documents store values as "ranges" Documents about cities self-defining their geo boundaries Person documents defining birthdays as ranges, sequences Can power classifiers and "matchmaker" queries Slide 21
How MarkLogic Server Works Scale-out Slide 22
Databases Scale Out Database of documents Stored in partitions Database Partition 1 Partition 2 Partition 3 Slide 23
Shared-Nothing Architecture E-Node E-Node E-Node D-Node 1 D-Node 2 D-Node 3 D-Node k Forest 1 Forest 2 Forest 3 Forest 4 Forest m Slide 24
How MarkLogic Server Works Analytics Slide 25
Range Indexes: A Built-In In-Memory Column Store Maps document ids to values, and values to document ids In a compact memory representation DOC ID VALUE VALUE DOC ID 1 2009 2002 3 3 2002 2003 10 4 5 8 10 11 2007 2004 2011 2003 2004 2004 2007 2007 5 11 4 17 2004 2009 1 Range Indexes are equivalent to a built-in in-memory column store 17 2007 2011 8... Slide 26
Scalar Queries and Aggregation Slide 27
In-Database MapReduce E-Node start encode decode reduce finish decode map reduce encode D-Node 1 D-Node 2 D-Node 3 D-Node k Forest 1 Forest 2 Forest 3 Forest 4 Forest m Slide 28
Hadoop MapReduce via Bi-Directional Hadoop Connector Raw Data Hadoop? Intermediate Intelligence 3 1 Operational Applications Bulk Loading Progressive 2 Enhancement MarkLogic + Connector for Hadoop Slide 29
Co-Occurrence Slide 30
SQL and BI Tools ODBC SQL Range Indexes Slide 31
How MarkLogic Server Works Transactions Slide 32
MVCC /articles/codd.xml /articles/codd.xml Document Document Title First Author Last Metadata Section Title First Author Last Metadata Section Year Section Section Section Section Section Section Section Section Section Section 523 628 628 c d Creation Timestamp Deleted Timestamp Timestamps can be: Increasing integers (Before MarkLogic 5) Increasing wall time (Starting with MarkLogic 5) Slide 33
MVCC Benefits /articles/codd.xml Very High Throughput Read queries don t require locks Queries and updates do not conflict Title First Document Author Metadata Section Last Year ACID Transactions Internal 2-phase commit between hosts (forest partitions) Section Section Section Section Section 628 Zero-latency between ingestion and indexing Slide 34
The Four Forest Operations Create a new document Into the in-memory stand buffer Mark a document as expired A memory-mapped timestamps document per stand Write buffer out to disk (checkpoint) Our buffers are 100s of megabytes For performance, double buffer Merge A background process Optimization: reduces number of stands in forest Slide 35
Consistency And Throughput 2-phase commit Transactions span forests Recovery Forest Journals Lock-free read queries Query at a point-in-time Repeatable reads Increased throughput Time travel (and near-instant DB rollback) Slide 36
HA/DR Features of MarkLogic Feature Function Benefit Use Case Database backup/restore Make a backup of your database, then restore it Recover from complete data loss Disaster Recovery Journal Archiving/Point-In-Time Recovery Snapshot backup Make a continuous backup; restore to a point in time, or to the point of failure Recover from complete data loss; recover all your data, or to just before a Bad Event Very fast backup using mirrored disk Recover from complete data loss; take a backup in seconds Disaster Recovery Disaster Recovery High Availability Database rollback Roll back to a point in time before a Bad Thing happened Recover in seconds from human error or a rogue application Disaster Recovery High Availability Automatic Failover Using Shared-Disk Local-Disk If a node fails, automatically failover to another node Recover from failure of a data node in a cluster High Availability Flexible Replication (part of Replication option) Maintain a hot copy of (part of) a database in another data center Move parts of a database, parts of documents, closer to users for improved performance Information Sharing Database Replication (part of Replication option) Maintain a hot copy of a database in another data center Recover from loss of a Data Center Disaster Recovery High Availability Distributed Transactions Slide 37 XA support for transactions that go across MarkLogic and other repositories Copyright 2011 2012 MarkLogic Corporation. All rights All rights reserved. reserved. Keep an exact (synchronous) copy of your data in more than one place Disaster Recovery High Availability Information Sharing
OTC: Derivatives and Exotics Repository for Derivatives and other exotic products (trades, options, swaps, etc). Key Requirements - Native JSON support - Real time queries on semi-structured dat. - 7 year retention - Replication -BAR Slide 38 Customers in Production JP Morgan Chase - Derivatives Core Processing Platform enables risk management for $78 Trillion dollars in derivatives Relevant Features Native JSON Support Value based Lookups Tiered Storage Fine Grained Partition Management Enterprise Class Backup and Restore Features Replication Clustering
Equity Risk Systems Currently using traditional RDMS servers and file systems to store intraday and time series data. Customers in Production Where MarkLogic is Providing Similar Solutions Requirements - Scale out on commodity servers. - Ease of data modeling for BLOB s (unstructured data) -Handle Complex Data Slide 39 Relevant Features Schema Agnostic w/ Optional Validation Bi-temporal API User Defined Functions Linear Scalabiluty on commodity hardware Clustering Binary Support Fast native XPath Result and Data Caching
Document Modeling Enterprise Data Group (EDG) is revamping the workflow for the creation and management of the negotiated documents. Instead of capturing the end image and some metadata, we are modeling the creation of the document through templates, xml or similar documents. The business groups need to dynamically change the agreements, and constantly add new information to meet the Slide 40 day to day needs. C tl k Customers in Production MorganStanley Citi Docgenix JetBlue LexisNexis Relevant Features Native support for XML, JSON and Binary Schema Agnostic w/ Optional Validation ACID Compliant CRUD Value based lookups Document Library Services Search Indexing LDAP and Kerberos Integration Clustering Range Indexes
Log Analysis Enable analysis of system and user logs to evaluate user behavior and provide BI. Automatically capture and analyze logs from many sources.(web, DB, DW) Perform correlation of events and performance within time-slices Normalize and enhance data with metadata from applications. Analytics pipeline to compute agregates and statistical data. Dashboards for each source High Data Volumes 1TB/day for 45 days. ~45TB of total data 20M Requests/day Customers in Production Bank of America Enabled Bank of America to map their internal reference data architecture through log analysis Relevant Features Schema Agnostic Data Model Hadoop Integration BI Tool Integration Range Indexes User Defined Functions Transformation Capabilities Processing Framework Visualization Framework Application Server Linear Scalability Slide 41
In Conclusion Slide 42
MarkLogic Server is An operational DBMS with MVCC-based transaction model, with high throughput An analytic DBMS with in-memory column store with in-database MapReduce An unstructured DBMS with XML data model and ad-hoc schema A high-performance search engine with transactional universal index An event processor with serialized queries and alerting A unified Big Data platform Slide 43
Questions?? Slide 44