WHITEPAPER. The Lambda Architecture Simplified

Size: px

Start display at page:

Download "WHITEPAPER. The Lambda Architecture Simplified"

Isabel Cooper
6 years ago
Views:

1 WHITEPAPER The Lambda Architecture Simplified DATE: April 2016

2 A Brief History of the Lambda Architecture The surest sign you have invented something worthwhile is when several other people invent it too. That means the creative pressure that gave birth to the idea is more general than your particular situation. Even when faced with the same pressures, people will approach an idea in different ways. When Jay Kreps was developing Kafka at LinkedIn, he called it The Log. Facebook (being Facebook) created several independent implementations of stream-oriented processing, including Puma and TailerSwift. Twitter has the adorably named Summingbird. The jargon we seem to be converging on for these kinds of systems is the Lambda Architecture. Lambda Origin In his book, Big Data: Principles and best practices of scalable real-time data systems, Nathan Marz coined the term Lambda Architecture to describe a generic, scalable and fault-tolerant data processing architecture based on his experience in working on distributed systems at Backtype and Twitter. Lambda in a Nutshell The gist of the Lambda Architecture is to model everything that goes on in a complex computing system as an ordered, immutable log of events. Processing the data (say, totaling up the number of website visitors) is completed as a series of transformations that output to new tables or streams. It is important to keep the input unchanged. By breaking data processing into independent pieces, each with a defined input and output, you get closer to the ideal of purely functional programming. Writing and testing each piece is made simpler and parallelization can be automated. Parts of the dataflow can be replayed (say, when code changes or machines fail) and toyed together with other flows. 2

3 This sequenced approach is a nice property to have as it retains data integrity and simplifies troubleshooting. A long time ago, people who did 3D modeling would carve digital blocks into the shapes they wanted. If they wanted to undo something 10 steps back, they were largely out of luck. Then 3DStudio introduced a brilliant feature it called the transform stack. The stack records every change to an object separately, and applies them in real time. This allows the modeler to modify, add, remove, and even reorder their changes on the fly. A sequenced approach to data pipelines is similar, providing a nifty solution for data reprocessing when changes to code occur. Autodesk 3DS Max Taper Modifier So far, this is simply good data engineering hygiene. Any well-run batch processing or map/reduce system will follow the same principles. There s nothing special about stream processing that makes immutable data flows work better. Writing Data in Two Places The special trick that makes Lambda Lambda is the technique of writing data to two places. That s one reason why the logo is the symbol λ. In effect, one half of a Lambda system optimizes for space and the other optimizes for time. Lambda systems incorporate a slower, high-capacity batch-processing system, and a faster stream-processing track. This allows existing map/reduce systems to be upgraded with a new fast track. It also leaves the system of record untouched, which is the main selling point for data teams looking to improve the responsiveness of their data flows. 3

4 Lambda Architecture Diagram - Lambda is an old and venerable technique. Document search engines of a certain age (eg, Yahoo s Vespa) often have a slow index that is compact but difficult to update. To compensate they will also have a fast index, perhaps in memory, where changes are cached until the next index rebuild. Under the hood a search will consult both indexes and merge the results. The problem is, the Lambda Architecture was an evolution on top of the slower batched index. It is not certain that you would do it that way if you were building from scratch. Lucene, for example, uses an incremental index for everything. Jay Kreps, in a thoughtful critique of Lambda, points out that you need two implementations of the same queries and data flow. And of course, you need two copies of the data. If you had a better streaming system, one that could read a table simply by replaying a stream, why would you need both kinds of system? The Lambda Architecture Isn t The Lambda Architecture isn t. What it is, is a sensible set of data engineering practices, which you should be applying anyway, plus a clever (but transitional) double-write approach to add a low-latency fast track to existing big data systems. Throughout the rest of this guide, we will detail the technologies and data processing requirements that will help you implement a simplified Lambda Architecture. 4

5 Rethinking the Lambda Architecture Most companies have responded to the influx of data by adapting their data management strategy. However, managing streaming data still poses challenges for many enterprises. Complicating the matter further, most enterprises need instant access to both historical and real-time data, which require specific considerations and solutions. Of the many approaches to managing real-time and historical data concurrently, the Lambda Architecture is by far the most talked about, and accepted today. A Fork in the Road Like the physical aspect of the Greek letter, the Lambda Architecture forks into two paths: one is a streaming (real-time) path, the other a batch path. Thus, it accommodates a real-time highspeed data service along with an immutable data lake. Oftentimes a serving layer sits on top of the streaming path to power applications or dashboards. 5

6 Many Internet-scale companies, like Pinterest, Zynga, Akamai, and Comcast, are using a memory-optimized database to achieve the high-speed data component of the Lambda Architecture. These companies are splitting the input stream to push data into both an inmemory database and a data lake, like HDFS, in parallel. In this era of ubiquitous big data, it is not enough for companies to merely process data. Analyzing data to detect patterns, which can be immediately applied to maximizing operational efficiency, is the real driver of business value. MemSQL: A Complete Solution for Lambda MemSQL delivers real-time analytics on a rapidly changing data set, making it an ideal match for the characteristics of the Lambda Architecture speed service. Other data stores have limitations that inhibit high-speed data ingestion, lack analytical capabilities, or cannot scale affordably. MemSQL offers a complete solution: the ability to handle millions of transactions per second while performing complex multi-table join queries. Let s dig into some of the key innovations that make MemSQL an ideal solution for simplifying the Lambda Architecture. Scalability MemSQL uses a distributed shared nothing architecture that scales on commodity hardware and local storage, supporting petabytes of data. MemSQL is a memory-first, relational database that also offers a disk-based columnstore. In-memory optimization provides high-speed data ingestion while simultaneously delivering analytics on the changing data set. The disk-based columnstore provides historical data management and access to historical data trends to leverage in combination with the hot data to deliver real-time analytics. Multi-model, Multi-mode MemSQL supports the ingestion of unstructured, structured and semi-structured data. Flexibility to align a structure to data in support of analytics meets the business requirements of the operation. Real-time analytics requires a real-time data structure, which MemSQL supports through a fully relational model. Furthermore, MemSQL supports the ingestion of unstructured and semi-structured (JSON) data into key-value pairs. 6

7 Full ANSI SQL support makes MemSQL readily accessible to data analysts, business analysts and data scientists reducing application code requirements. Plugging data visualization and query tools into the analytics architecture delivers immediate value from data to the business. MemSQL also has extended SQL including JSON support. Traversing a JSON document is similar to SQL with extensions to traverse the key-value pairs. Open Source Connectors MemSQL offers several connectors for smooth integration with additional data sources. One example is MemSQL Streamliner: an integrated Apache Spark solution. Streamliner provides easy deployment of Apache Spark a critical component for building real-time data pipelines that delivers advanced data enrichment and transformation. Another important connector is the MemSQL Loader, which can important data from HDFS, as well as import and synchronize data from Amazon S3. 7

8 Lambda In Production In this section, we will take a look at examples from innovative companies using a Lambda Architecture built for real-time data processing and exploration. Real-Time Analytics at Comcast Our first example comes from the Comcast Xfinity data team, who built a data processing infrastructure that focuses on real-time operational analytics. Using a combination of MemSQL and Hadoop, Comcast can proactively diagnose potential issues in an instant and deliver the best possible video experience. The Comcast architecture writes one copy of data to a MemSQL instance and a separate copy to Hadoop. Log Collection Real-Time Analytics ~ 1 second ~ 30 minutes Analysts query live data Alerts on complex objects Optimize CDN efficiency This enables Comcast to run real-time analytics on massive, ever-changing datasets, while also making their analytics infrastructure more performant. Instead of just logging all Xfinity data and analyzing it hours or days later, Comcast has the power to get both viewership and infrastructure monitoring metrics the moment they occur. HDFS provides a quasi-infinite data store where they can run machine learning jobs and other offline analytics. Watch the Comcast team s recorded session from Strata+Hadoop World to learn how Comcast architected their Xfinity platform to work with millions of users, process enormous volumes of data and, at the same time, perform advanced real-time analytics. Recording Here 8

9 Tapjoy Powers its Mobile Ad Platform Tapjoy, the mobile app industry s leading mobile marketing automation and monetization platform, is processing and analyzing real-time and historical data concurrently to power its ad platform. Tapjoy optimizes ad performance by taking advantage of the speed and scalability of inmemory computing. With the processing power to run 60,000 queries at a response time of less than ten milliseconds, Tapjoy is able to cross-reference user data and serve higherperforming ads to more than 500 million global users. Above is a diagram of Tapjoy s database architecture. For a more detailed look and explanation, watch Principal Data Analytics Engineer at Tapjoy, David Abercrombie s session at the In-Memory Computing Summit. 9

10 Conclusion The pace of data is not slowing. Applications of today are built with infinite data sets in mind. As these real-time applications become the norm, and batch processing becomes a relic of the past, digital enterprises will implement memory-optimized, distributed data systems to simplify Lambda Architectures for real-time data processing and exploration. What should I do? Start by asking questions. What data systems do you currently have in place? Are you complicating matters with database infrastructure that can be consolidated? What applications do you plan to build in the next week/month/year? How much data will be streaming into those applications? How quickly will you need answers from your data set? By answering questions like these, you will have a clear starting point for where to improve your existing data management system, and how to prepare for the applications you plan to build. From there, you can narrow which technologies to try for a proof of concept. If you need help along the way, we would love to hear from you. Send us an at info@memsql.com or give us a call at (855)

WHITEPAPER. MemSQL Enterprise Feature List

WHITEPAPER. MemSQL Enterprise Feature List WHITEPAPER MemSQL Enterprise Feature List 2017 MemSQL Enterprise Feature List DEPLOYMENT Provision and deploy MemSQL anywhere according to your desired cluster configuration. On-Premises: Maximize infrastructure