Scaling Massive Content Stores in the Cloud CloudExpo New York June 2016 @johnnewton Alfresco Founder & CTO
Alfresco Customers Government Financial Services Healthcare Manufacturing Corporate
Somewhere in a secret underground location
someone is trying to store One Billion Documents!!! http://www.warnerbros.com/austin-powers-international-man-mystery
Some have attempted before and failed
Content Use Cases at Scale Enterprise Document Library Medical & Personnel Records Transaction & Logistics Records Government Records & Archives Claims & Case Processing Research & Analysis Real-time Video Internet of Things Discovery & Litigation Loans & Policies
Content Management Applications Document Library Search & Retrieval File Sync & Share Business Process Management Image Management Records Management Media Management Case Management Information Archiving
Content vs. Data vs Files vs. EFSS Data Files EFSS Content and ECM
Content Architecture as a Big Data Problem Context Access Create Manage Distribute Use Activities Directory Relationships Categories Types Search Security Content Object People Processes / Tasks Rules APIs Indexes Metadata Files / Renditions Semantics Solr / ElasticSearch Database Distributed FS Database 10
Content at Scale in the Enterprise Users at Scale Geographic Distribution Read/Write Throughput Concurrency Content Count Volume Size
The Problem with Traditional Approaches Lack of Redundancy Lack of Elasticity Provisioning and Administration Geographic Distribution Lack of Agility
Content Management Architecture Alfresco Share Alfresco Repository Activiti Workflow Engine Database RDS FS Content Store EBS or Ephemeral S3 (or Glacier) Alfresco SOLR Indexes EC2 PIOPS EBS 13
Scaling in Tiers Alfresco Share Alfresco Share Alfresco Activiti Suite Alfresco Transformation Server Alfresco Repository Alfresco Repository Alfresco Activiti Suite Alfresco Transformation Server Alfresco Local Repo (Index Tracking) Alfresco Local Repo (Index Tracking) Alfresco Solr Alfresco Solr
Data Meta-Model Model Metadata Organization Class Property Association Constraint Type Folder Property name Folder A Type Aspect Child Association 1 Billion 15 Billion Child Association contains Type Document Association rendition Property name content Folder B Doc C Aspect Auditable Property who by when Doc D rendition
Next Generation Relational Architectures MySQL with standby Next Generation DBMS AZ 1 AZ 2 AZ 1 AZ 2 AZ 3 Primary Instance Standby Instance Primary Instance Replica Instance async 4/6 quorum PiTR Amazon Elastic Block Store (EBS) EBS EBS mirror Sequential write EBS mirror Sequential write Distributed writes Amazon S3 Amazon S3 Highly-available synchronous vs. asynchronous replication Significantly more efficient use of network I/O Self-healing, Fault-tolerant, Instant crash recovery
Index and Search Architecture Full-Text Query x 20 instances Text Extraction Term-hit Highlighting Results Process Results Processing Metadata Query Facets & Buckets Security Filters Credit: Ryan Tobora ThinkBig, Teradata http://thinkbig.teradata.com/solrcl oud-terminology/ Metadata Injection & Path Processing Shingles ACL Processing
File Storage Architecture APIs File System Protocols Direct Streaming Metadata Metadata Content Content Metadata Aurora S3 EBS Storage Layer Amazon Glacier Archive Layer In Place Content AWS Import/Export
BM4 Test Execution Environment 1.2B Docs Simulate 500 Users Selenium / Firefox 1 hour constant load 10 sec think time UI Test UI Test ELB UI Test x 20 m3.2xlarge Alfresco with Share and Repo Alfresco Alfresco Alfresco x 10 c3.2xlarge Sharded Solr Cloud Solr Solr Solr x 20 m3.2xlarge Simulate AWS Import/Export (in place) Aurora x 1 db.r3.xlarge sites folders files transactions dbsize GB 10,804 1,168,206 1,168,206,000 15,475,064 3,185
Benchmark Results Document load rate 1000 documents per second (with 10 nodes) 3 Million per Hour! Load rate was consistent even passing the 1B document Sub-second login times and good responses for other actions Open Library: 4.5s Page Results: 1s Navigate to Site: 2.3 Aurora indexes used efficiently at 3.2TB No indications of any size-related bottlenecks with 1.1 Billion Documents CPU loads: Database: 8-10% Alfresco (each of 10 nodes): 25-30%
What a Difference Load Balancer ECM ECM ECM Search Search Search FS FS FS HSM HSM HSM Hardware Hardware Hardware DR Plan 3-6 Months Questionable Scale Little Redundancy Lots of $$$ ELB Alfresco Alfresco Alfresco Solr Solr Solr EBS EBS EBS S3 EC2 EC2 EC2 AZ1 AZ2 AZ3 < 30 mins 10x Faster Fault-Tolerant Open, Cost Effective
Well, what am I supposed to do with all this frickin hardware?!!
Thank you john.newton@alfresco.com @johnnewton