November 24, 2013 1 Comment
Titan is a distributed graph database capable of supporting graphs on the order of 100 billion edges and sustaining on the order of 1 billion transactions a day (see Educating the Planet with Pearson). Software architectures that leverage such Big Graph Data typically have 100s of application servers traversing a distributed graph represented across a multi-machine cluster. These architectures are not common in that perhaps only 1% of applications written today require that level of software/machine power to function. The other 99% of applications may only require a single machine to store and query their data (with a few extra nodes for high availability). Such boutique graph applications, which typically maintain on the order of 100 million edges, are more elegantly served by Titan 0.4.1+. In Titan 0.4.1, the in-memory caches have been advanced to support faster traversals which makes Titan’s single-machine performance comparable to other single machine-oriented graph databases. Moreover, as the application scales beyond the confines of a single machine, simply adding more nodes to the Titan cluster allows boutique graph applications to seamlessly grow to become Big Graph Data applications (see Single Server to Highly Available Cluster).
Graph Representations On-Disk and In-Memory
When representing a graph within a computer, the design of both its on-disk and in-memory representations effect its read/write performance. It is important that designs are sympathetic to the graph usage patterns of the application (see A Letter Regarding Native Graph Databases). For instance, when on-disk, it is best that co-retrieved graph data be co-located else costly random disk seeks will be required. Titan leverages an adjacency list representation on-disk where a vertex and its incident edges are co-located. Titan also allows for the sorting of a vertex’s incident edges according to application-specific edge-comparators. These vertex-centric indices support fine-grained retrieval of a vertex’s incident edges from disk and (hopefully) from the same page of disk (see A Solution to the Supernode Problem).
Once the graph data is retrieved from disk, it is placed into memory. Titan 0.4.1+ maintains two levels of in-memory caching: warm and hot. Warm caches store the hotspots of the global adjacency list and hot caches store the locally traversed sub-graph of a transaction. Both caches keep their data elements (vertices and edges) as byte-buffers. For traversals that move from vertex-to-vertex using edge labels and properties as filters, edge object creation can be avoided by leveraging the aforementioned vertex-centric sort orders and classic binary search. With typical natural graphs having multiple orders of magnitude more edges than vertices, this greatly reduces the number of objects created and the amount of memory used. Moreover, this allows memory to be accessed sequentially which increases the chance of a CPU cache hit (see Random vs. Sequential Memory Access).
In sum, the updates in Titan 0.4.1+ address multiple levels of the memory hierarchy in that the disk, the global cache, and the local transactions each have appropriate data structures respective of their ultimate role in graph processing. The benefits of these designs are demonstrated in the sections to follow.
Titan Graph Traversals on a Boutique Graph
In order to test both Titan’s on-disk and in-memory representations at boutique scale, a graph version of Amazon’s web rating data was generated. The benefit of this data set is two fold: 1.) it is a good size (37 million edges) to demonstrate a single machine use case and 2.) it supports edge properties which demonstrate the benefit of Titan’s vertex-centric indices in graph processing. The schema for the graph is represented on the left where there are two types of vertices: users and products. The graph is bipartite in that users link to products by reviewed-edges. The reviewed-edges maintain two properties: score (a float between 1-5) and time (a long in Unix time).
The benchmark to follow was executed on two types of Amazon EC2 instances:
m2.xlarge (hard disk drive — HDD) and
c3.xlarge (solid state drive — SSD). To demonstrate Titan 0.4.1′s relative performance benefits, Titan 0.4.0 and a popular 3rd party graph database were tested as well. The same code was run on all systems by leveraging TinkerPop‘s Blueprints graph API. The JVM’s heap settings were
-Xms3G -Xmx3G and default options for all systems were used — except for a warm cache being explicitly enabled for Titan 0.4.1:
cache.db-cache=true cache.db-cache-size=0.5 cache.db-cache-time=0
The benchmark was executed as enumerated below. Note that for cold cache times, the query runtime for the first query was taken. For warm cache times, the transaction is rolled back and the query is executed again. For hot cache times, the query is not rolled back and the query is re-executed. The actual Gremlin traversals are provided in the code block below minus a
.count() which was used to ensure all databases returned the same result set. Finally, note that Titan/Cassandra was used for this experiment though the recent cache developments are available to any of Titan’s underlying storage systems.
- 300 random user ids were gathered from the raw Amazon web review data.
- 100 users were queried with the “user preference”-traversal.
- 100 users were queried with the “user similarity”-traversal.
- 100 users were queried with the “product recommendation”-traversal.
// what products has the user liked in the last year? // user preference user.outE('reviewed').has('score',5f).has('time',T.gte,ONE_YEAR_AGO).inV // what users have liked the same products that the user liked within the last year? // user similarity user.outE('reviewed').has('score',5f).has('time',T.gte,ONE_YEAR_AGO).inV .inE('reviewed').has('score',5f).has('time',T.gte,ONE_YEAR_AGO).outV() .except([user]) // what products are liked by the users that like the same products as the user within the last year? // product recommendation user.outE('reviewed').has('score',5f).has('time',T.gte,ONE_YEAR_AGO).inV.aggregate(x) .inE('reviewed').has('score',5f).has('time',T.gte,ONE_YEAR_AGO).outV .except([user])[0.<1000] .outE('reviewed').has('score',5f).has('time',T.gte,ONE_YEAR_AGO).inV .except(x).groupCount(m)
The benchmark results are available for exploratory viewing at the following URLs (made possible by HighCharts JS).
The plots have more data points on the left-end of the x-axis. To examine the results of these smaller result sets, the plots can be zoomed in by dragging a square over that region of the chart. Each plot line can be activated (or deactivated) by clicking on the respective line’s name in the legend of the plot.
It is important to stress that the results of any benchmark should be understood in the context in which they were executed. This benchmark demonstrates relative performance
- on a boutique graph dataset (i.e. a relatively small graph),
- using default settings for the graph databases,
- against two different long term storage mediums (HDD and SSD),
- within the resource constraints of a single machine (
- executing 3 particular read-based queries,
- and with only a single-user/thread querying the database at any one time.
Cold Caches Demonstrate On-Disk Performance
The diagram above plots the number of vertices returned by the traversal on the x-axis against the time it took for the traversal to complete in milliseconds on the y-axis. Titan’s cold cache performance numbers are due in large part to its sequentialization of the graph by co-locating edges with their vertices on disk (see Scalable Graph Computing: Der Gekrümmte Graph). Furthermore, with vertex-centric indices, edges are not only co-located with their vertices but also sorted such that “a score equal to 5″ is quickly fetched using an offset index. When edges are not co-located with their vertices, then the graph is not laid out sequentially. This can lead to poor fetch times as random disk seeks are required (see purple line). The effect of random disk access is made more apparent on the hard disk drive plot (as opposed to the solid state drive plot above). Note that pure sequentialization is difficult to achieve in any multi-user system which allows for random writes. For this reason, various compaction processes occur to combat defragmentation of the data on the disk.
Due to graph compression techniques based on data co-location and logical adjacency, Titan’s disk footprint remains relatively low.
- Titan/Cassandra — 2.9 gigabytes
- Popular Graph Database — 15.0 gigabytes
Sequentialization and a compressed data format are useful when executing distributed OLAP queries with Faunus. Rows of data (i.e. a vertex and its incident edges) can be sequentially pulled off the disk allowing for global parallel scans of the graph to execute more efficiently (see Faunus Provides Big Graph Data Analytics). Moreover, it is faster to move 2.9 gigabytes of data than 15.0 gigabytes.
Warm Caches Demonstrate Off-Heap Cache Performance
Once data has been pulled off of disk, it is put into memory. Titan has always performed well in pulling graph data off of the underlying storage medium (as seen in the previous section with both HDD and SSD). In terms of in-memory performance, Titan 0.4.1 now boasts a warm cache system. The benefit of this addition is made apparent in the plot above when comparing Titan 0.4.0 and Titan 0.4.1. With Titan 0.4.1, heavily accessed areas of the graph are stored in a global cache composed primarily of byte-buffers. If the data is not in the cache (a cache miss), then Cassandra’s caching mechanisms are leveraged (typically a key-cache and an OS-based page cache). It is important to emphasize that a warm cache miss in Titan 0.4.1 yields runtimes equivalent to those seen in Titan 0.4.0 (which for the complex traversals presented, remain sub-second — see the gray line). These sub-second cache miss times are possible because of Titan’s on-disk representation. Graph sequentialization enables effective use of the OS’s page cache, where “going to disk” will typically not yield random seeks.
Hot Caches Demonstrate On-Heap Transaction Performance
To answer a query, the traversal primitives first check for data in the current transaction’s hot cache, then the global warm cache, and then on a particular page of disk (which again, may be in memory because of the operating system’s page cache). Wherever the data is ultimately found, it is computed on in the local transaction’s hot cache. For traversals that move from vertex-to-vertex by filtering paths using edge properties and labels, Titan does not need to deserialize the edges into objects. Instead, it operates directly on the byte-buffers stored in the warm cache. This greatly reduces the number of objects created (
|V| << |E|), the amount of time spent garbage collecting the young generation heap, and stages Titan nicely for speedy, recursive graph algorithms (to be explored in a future blog post).
Aurelius has consistently touted Titan as a Big Graph Data solution for OLTP graph applications that leverage graphs and transactional loads so large that they require more resources than a single machine can provide. However, for organizations looking to use a graph database, 100s of billions of edges is typically not their starting point. Titan 0.4.1+ better supports boutique shops beginning with graphs and wanting to ultimately procure a massive, universal graph of their domain. Moreover, all this graph processing functionality is provided under the commercial friendly, Apache2, free software license. With Titan 0.4.1+, boutique graph applications can freely move from a single machine, to a highly available single node (e.g. a 3 machine cluster with replication factor 3), to a fully distributed graph represented across an arbitrary number of machines.
This post was embarked on by Dr. Rodriguez after he wrote an email to the Aurelius team stating that Titan’s in-memory cache should be optimized for boutique graph use cases. Pavel Yaskevich took up the challenge and was able to increase traversal speeds using software hotspot profiling. After which, Dr. Bröcheler yielded further optimizations by cleverly leveraging the sequential in-memory byte-buffer representations of the graph data. Dr. Julian McAuley provided Aurelius a pre-processed version of the Amazon web review data under the BSD license and Daniel Kuppitz leveraged that data set in order to painstakingly stage and execute the benchmark presented.