Text transcript of show #280. August 18, Microsoft Research: Trinity is a Graph Database and a Distributed Parallel Platform for Graph Data

Hanselminutes is a weekly audio talk show with noted web developer and technologist Scott Hanselman and hosted by Carl Franklin. Scott discusses utilities and tools, gives practical how-to advice, and discusses ASP.NET or Windows issues and workarounds. Text transcript of show #280 Microsoft Research: Trinity is a Graph Database and a Distributed Parallel Platform for Graph Data Scott talks via Skype to Haixun Wang at Microsoft Research Asia about Trinity: a distributed graph database and computing platform. What is a GraphDB? How is it different from a traditional Relational DB, a Document DB or even just a naive in-memory distributed data structure? Will your next database be a graph database? (Transcription services provided by PWOP Productions) Our Sponsors http://www.telerik.com Copyright PWOP Productions Inc. Page 1 of 6

Lawrence Ryan: From hanselminutes.com, it's Hanselminutes, a weekly discussion with web developer and technologist, Scott Hanselman. This is Lawrence Ryan, announcing show #280, recorded live Thursday,. Support for Hanselminutes is provided by Telerik RadControls, the most comprehensive suite of components for Windows Forms and ASP.NET web applications, online at www.telerik.com. In this episode, Scott talks with Haixun Wang from Microsoft Research Asia about Trinity. Scott Hanselman: Hi, this is Scott Hanselman and this is another episode of Hanselminutes. Today I am skyped in with Haixun Wang from Microsoft Research Asia calling in, all the way from China right now? Yes. Scott Hanselman: Fantastic. Thank you for taking the time to chat with me. I wanted to talk to you about your project called Trinity. You do a lot of work in Microsoft Research. You've got a number of projects. You've got Trinity, one called WebQ, and Probase, but Trinity is a data structure that thinks about things in terms of a graph. Give me some background on what Trinity is. Well, Trinity is actually two projects. Trinity first is a graph database which handles graphs - very, very large graph structures - and supports online query processing. It's just like a database which answers queries posted by the users. On the other hand, Trinity is also a graph computation platform for offline large analytics. So if you want to do very, very large jobs on graph data, just like what you want to do for text data or other kinds of data using, for example a MapReduce mechanism, you can use Trinity to process very large data in an offline manner. Scott Hanselman: Okay. So for a developer who might be listening who is familiar with object relational data or document databases like Mongo, what is a graph database? The graph database is special in the sense that it makes the graph exploration much easier than a traditional database. So you can use traditional databases, for example relational database, to support graph data but then you will run into a very, very big problem which is graph exploration will correspond to joint operations on the underlying data and the joining is a very, very expensive operation. Usually, for graph data you would like to explore the graph freely and then each exploration, every step you want to explore, will correspond to a joint operation. So relational databases apparently cannot handle that and the current, other NoSQL-approaches including, for example Hadoop and other key/value paired stores, they won't be able to do that as well because they cannot handle that many joins when they want to process the queries in an online fashion. So graph database is just special. I mean it supports online graph exploration without using joins. That's the most important thing to know about the graph database. Scott Hanselman: Okay. So let me think if I understand this correctly because I do not have PhD. I'm not sure how many of our listeners do. I know that when I've given interview questions before, I've asked junior engineers to, say, can you express a tree structure in an object relational database? And often very beginner people will make multiple tables and then a more advanced person will make a tree structure that has a single self-referential table, and then the tree structure becomes extremely complex as the join-ins* to the table itself kind of start to recur and it becomes pretty hairy to make these kind of infinitely deep self-referential tables. Is that what you're describing to the trouble in creating a graph in an object relational manner? Yes. So imagine you want to, for example, represent a tree or a graph using multiple tables or a single table with self-referencing pointers. So basically what you're doing is that you are using keys to reference, for example, the neighbors of a particular node, or the child node, or a parent node in your tree structure. So in order to find your child node or your neighbor's in a graph, you basically need to search through those keys, right? You need to use an index structure to help you to find those keys immediately and then it will retrieve those records from the database, from the table, and then you'll find your neighbor s or your child nodes. So you need to go through this process. So this is going to be very, very slow because basically you are accessing the index and you are doing a join operation. So by doing a join operation, I mean those values are not connected directly. You need to go through the index. So a graph database can be defined as it can find its neighbor's or its child s nodes without using the index. If a database system can do that, then we can call it a native graph database system. Otherwise, it's just using, for example, multiple tables or relational database to simulate a graph database. Scott Hanselman: I see. So would it be fair to say that often people have data that is graph in shape and then they use the database that they're the most comfortable with and begin to run into trouble, and this is kind of a specific kind of database for a specific kind of data? Yes. So of course a lot of transaction data, they're not really in the form of a graph. More and more applications right now, they are working on very complex data and this kind of data can be nice if they're represented by graphs because graph has a lot of flexibility, flexibility in representing those kinds of structures. Just like you Page 2 of 6

said, people would naturally use traditional data management systems, key/value stores or relational databases to support these kinds of new applications. Whenever the data gets larger and the operations get more complex, they will run into problems where response time is just too long for their online query reprocessing. Scott Hanselman: If I were to create a Naïve in memory graph structure myself and maybe I said I'm going to have a database of a couple of gigabytes in memory and I'll make just a big giant bunch of C# objects, what kinds of issues would I run into with a Naïve implementation that would be solved by a formal implementation like Trinity? So basically, first of all you would need the data structure to be stored in memory because typically the graph data does not have locality which means you cannot really use sequential access to get your information. Usually your nodes, the neighbors of a particular node, are all over the place. So you'd like all those kind of information to reside in the memory, in the main memory instead of on the disk. So if you only have a small graph, then you can do that using all kinds of runtime structures to manage your graph. That is fine. But usually we need to handle very, very large graphs, so many applications they need to handle very large graphs like a social network or bioinformatics, you need to handle the gene sequence of a human being, and those kinds of data you cannot use a single machine to support it because the memory is not enough. Then one approach actually used by a lot of current applications is that they are buying a very, very powerful machine which can have up to, say, one terabyte of memory. So this is sort of the approach people are taking right now, but of course those machines are very costly and it still has the scalability problem because data is getting larger and larger. So that's the major challenge. For example, if you're managing a large social network like Facebook. So Facebook has probably close to one billion, let's say 800 million users and for each user on average you'll have like 150 friends. So for such a graph, we're talking about over one terabyte of information just to represent the graph topology and nothing else, just the graph topology. So you probably cannot use a single machine to do that. And even if you can, a single machine probably does not have that much parallel capability to support so many concurrent online operations. So you want to scale out that system into, for example, a cluster of 15 machines or 200 machines, things like that, and that is actually what we are doing. We're providing this scalability through a cluster of, you know, maybe hundreds or even thousands of machines to support applications at the same time, many concurrent operations at the same time. In this way, of course, we can support a very, very large graph. So we can support, for example, the entire web graph which will take probably 300 to 500 machines. For the user, we provide a very nice interface. It seems to the user that everything is in the main memory of a single machine so he does not have to worry about where to find its neighbors or how the graph is distributed on so many machines on the cluster. So that's what Trinity is doing. Scott Hanselman: So to understand, you just said the entire web graph. Yes. Scott Hanselman: You're talking of like a representation of the web and all of its interlinking nodes, feeling as if it is in memory and available to you like Google or Bing, it's just there and it's instantaneously available. Right, right. I'm only talking about the topology of the graph. Scott Hanselman: Right. So I'm only talking about, for example, a web page that has on average maybe 50 links to other web pages. I'm only storing the nodes, which is basically a URL, and also its outgoing links in the Trinity's memory structure, but the content of a particular web page is not actually stored in memory. Scott Hanselman: Of course. They're of course in a database. So only the topology of the graph is in main memory. Scott Hanselman: Right and that's why I think that the social networking example is such a good one. On your research site, you give the example that a person on average has 120 friends in Facebook, so then if you want to search me, my friends, and my friends' friends, a three-hop search, while that seems quite easy. You know it's easy to say you're looking at 120, plus 120 2, plus 120 3. That would be quite an amazing SQL query and no small feat to represent in a traditional object relational database and you're able to do that in just milliseconds. Right. So this is actually a very, very important operation on a social network graph. So for example, if someone searches for something on Bing or Google, and assuming Bing and Google has the social networking information, then the search engine would like to use the information of your friend, your friend's friend, and friends' friend's friend to sort of provide relevance close to the search, so this operation has to be done in a very, very small amount of time so that the search engine can take that into consideration in providing answers to the user. So currently Trinity can do that assuming the data is in the same distribution as the Facebook data. So Trinity can do this three-hop search within a Page 3 of 6

hundred milliseconds and if you use key/value pairs or any other approach, I think the time is at least a hundred times or even more than what we can do on Trinity. Scott Hanselman: Hi, this is Scott coming to you from another place and time. Are you using Agile Practices to manage your software development? There are lots of tools in the market that manage the steps of a project but most of them focus on individual roles. Get ready for a solution that caters for the success of the whole team. The guys at Telerik introduced TeamPulse to the Agile Project Management Tool that will help you gather ideas, estimate, plan, and track progress in a common workspace. Finally, companies, regardless of their size, can use a lightweight and convenient tool that makes all the stakeholders work as a united team even if they're in different countries. By combining intuitive user interface and the power of Silverlight, TeamPulse removes the road blocks that you typically face in applying Agile in an effective manner. No more lost data, no disparate systems, no lack of critical analytics regarding the health and velocity of your project. See for yourself, get a free copy for five users in one project at telerik.com/teampulse, and please do thank Telerik for supporting Hanselminutes on their Facebook fan page: facebook.com/telerik. We do appreciate it. There wouldn't be a Hanselminutes if there wasn't Telerik helping us. Now, a lot of times people accuse Microsoft of reinventing the wheel although less so with Microsoft Research. Sometimes with Microsoft, people say why don't you use this open source project or why don't you use that open source project. I know that there are other graph databases like Neo4j and Google has, is it Pregle or Pregel? Pregel. Scott Hanselman: Pregel. So there are other large scale graph systems out there. What is unique about Trinity vis-à-vis the other examples? Yeah. So like I've mentioned, the Trinity is both an online query processing system and also an offline batch processing or analytic system. So in this sense you can think of Trinity as a combination of Neo4j and the Pregel. So Pregel is for offline query processing, and Neo4j is for online query processing, but Trinity can provide both capabilities using a single platform. So that's basically the major difference. Also, there's another thing which is about the scalability. So we actually have compared with Neo4j when we started this project and Neo4j, because I mean they have a lot of versions and we have tried a few of them and it seems to me they are first of all disk-resident and second of all, they're not really very easy to distribute among a set of machines in a cluster. So mainly people are using Neo4j for smaller graph applications that can be hosted in one machine, but Neo4j does provide a very, very nice user interface and the programming interface to support user applications. It's just the scalability that we're trying to improve over Neo4j in the aspect. Scott Hanselman: I see, I see. I know that the way that you currently access the data that's inside of Trinity is with a C# API, and most of the samples around asynchrony use that API along with the Task Parallel Library that.net uses for things like Parallel.ForEach. So it seems like you get some free multi-threadedness with the parallel libraries. Do you do additional things within your own libraries to utilize the processors as efficiently as possible? Yes. We actually did a lot of low-level hacking in order to have more parallel capabilities, but in our latest release we actually wrap everything, all those parallel computing capabilities we wrapped it into the API so the user actually does not have to deal with this, explicitly deal with those parallel constructs. He can just submit his queries as if we're just I running on a single machine with a single thread and the system will automatically take care of that and parallelize its computation. Scott Hanselman: Interesting. So the data could be across one machine or hundreds and hundreds of machines, it could be across 400 processors or 400 x 24 processors and they don't have to think about it. It's as if they're dealing with an in-memory structure themselves. Yes, exactly. Scott Hanselman: That's a very comfortable interface. That seems very pleasant. Is it always going to be C# though? Yes. Currently we only provide a C# interface, but of course I mean in the future we can look into the possibility of providing another interface like for other programming languages and other interfaces. Scott Hanselman: What about DSL? What about Domain Specific Language or query languages? Is there a standard around graph databases as there is around object relational databases? That's a very good question. Actually we don't have anything like that right now although we have a side project which is using Trinity as a platform. We're building an RDF data store upon Trinity, and for that project we have the traditional SPARQL as the query language to query the RDF data, but of course, I mean SPARQL is quite limited. The expressive power of SPARQL is quite limited. It's okay for various standard RDF queries, but a lot of operations cannot be expressed using SPARQL. But at this moment, we actually don't have a Domain Page 4 of 6

Specific Language for querying graph data and it's actually one of our research focus and we're thinking about designing more a flexible query platform, programming platform for graph data. But on the other hand, we do have a programming model which is close to Pregel implemented in Google. So for example, the users will just write a very, very simple script for specifying what kind of things a single node will do in each round of computation like, you know, it will accept some messages from its neighbors and perform some actions on its messages and then send out messages to its neighbors. So the users will just provide a script to describe these kinds of actions and the system would take care of their parallel execution. So this is how the offline analytics is being implemented in Trinity. Of course this is just one scenario. In another scenario, for example, we also provide breadth-first search and, as we go into every node during the breadth-first search, then we will execute scripts specified by the user on each node. That's another programming model. Scott Hanselman: Do you see graph databases being something that the average person is going to know about? Like I wouldn't have said 5 years ago or 10 years ago that there would be much interest in document databases but they seem to have really come of age in the last few years. We've got object relational databases. There were, in fact, a lot of object-oriented databases over the last 20 years although I don't think they've necessarily broken into the mainstream. Is graph data something that will be kind of obscure and on the edge or do you think it will be something that the average programmer will be using in 5 or 10 years? I think there's a possibility that the graph is going to be adopted as the major, most important data structure in our future applications. So if you think about it, I mean the data is getting more and more complex, and if the complexity of data is very limited and the size of the data is also limited, then we can very comfortably use a relational database like SQL Server to provide support for these kinds of applications. But the reality is data is getting more and more complex in almost every domain, in search, in scientific computing for example bioinformatics, and social networks and everything. Data is getting more and more complex and you cannot use tables to represent those data and hope you will model the data very, very nicely. On the other hand, of course the size of the data is getting bigger and bigger. So this actually opens up a huge space which a traditional database cannot serve because they're only good for small data and not that complex data, and graph data will be a very important player in this very, very big space. A lot of applications will rely on graph data. We actually have got recently many, many requests about what Trinity can do for some specific applications in the game domain, in standard computing, in power applications and things like that. So I can foresee many, many interesting applications that will require a very, very flexible system that can support graph data. Scott Hanselman: What do you think about some of the newer graph data? I wouldn't say necessarily formal databases but implementations of graph-like databases using distributed memory structures,like for example redis_graph and some naïve implementations of a social graph using Redis, FlockDB, what Twitter used for their graph database. Do you think that these are fully-fledged graph databases that could compete or these are just naïve implementations over a standard distributed inmemory data structure? I think they can support certain kinds of graph operations but definitely not all ranges of graph operations. So for example, in many social networks including Facebook, the user does not have the power to actually go through the huge social networking graph to find information about his neighbors and the neighbors' neighbor and so on and so forth. What social networking, what Facebook provides on their website is basically, when I use a login it provides information about his friends, his direct friends. So there's really just one step of multiple hops on this huge graph. So if this is the case, then the key/value pair is a very, very nice solution. So the key/value pairs will provide the information. You can think of it as providing informational bits for its direct neighbors. But as graph operations get more complex and some applications are, usually analytics applications that require exploring this graph database, exploring this huge graph, allowing those processes in a very large scale, then the key/value pair store will not function very well because fundamentally they still require a lot of joins, a lot of index for exploration, for going from one node to other nodes. So that's the fundamental problem with those kinds of approaches. Scott Hanselman: In conclusion, where do you see your project going? Is it always the goal of a researcher to get a project to become a product or is it simply your goal to expand the knowledge in the space and leave it up to the product people to figure out if they'll sell it or not? So of course we are still research prototype but we're actually working with a few product teams within Microsoft to support their applications. So for example, we have within Microsoft we're dealing with the web graph, we're dealing with the search log and the query graph, and we're dealing with the social graphs. So we're working with the product team to use Trinity to support those kinds of applications. Our vision is to provide a general purpose graph computation platform. So using these applications, we're basically trying to see what is the priority for improving our system and eventually we would like to provide the Page 5 of 6

system as a general purpose system for a new kind of data which is the graph data. Scott Hanselman: Actually one last question just to get a sense of size so the listeners who are going to go off and research this themselves and learn more about your stuff can understand. I know that as of mid-2010, Twitter's graph data called FlockDB had about 13 billion edges and they were doing traffic of about 20,000 writes a second and 100,000 reads a second, so about 13 billion edges and I'm sure they've increased since then. Do you have a sense of how big Trinity can get? So we haven't tried. Well, we're still working with the product team to try to use Trinity for the web graph so we're doing the process of it, but 13-billion edges is actually not that big because for example we are dealing with the web graph which has, for example, 200 billion nodes and for each node there's basically a web page and each node on average will have 40 to 50 edges. So if you times these two together, then it's much bigger than social graph. So basically the web graph is still the largest graph on the earth, man-made graph on the earth. It's much, much larger than the social graph so we're trying to use Trinity to support that. Scott Hanselman: Wow. It's really comforting to know that you can think about edges in terms of billions and billions and not think of it as being too large. That speaks really well for the work that you guys are doing. Well, thanks so much for chatting with me today. I really appreciate it, giving people outside the research community an opportunity to learn more about what you're doing in Microsoft Research. Microsoft Research: Trinity is a Graph Database and a Distributed Thanks for having me. Scott Hanselman: This has been another episode of Hanselminutes. A big thanks to Haixun Wang and the folks at Microsoft Research Asia. We will see you again next week. Page 6 of 6