Titan Provides Real-Time Big Graph Data

Titan is an Apache 2 licensed, distributed graph database capable of supporting tens of thousands of concurrent users reading and writing to a single massive-scale graph. In order to substantiate the aforementioned statement, this post presents empirical results of Titan backing a simulated social networking site undergoing transactional loads estimated at 50,000–100,000 concurrent users. These users are interacting with 40 m1.small Amazon EC2 servers which are transacting with a 6 machine Amazon EC2 cc1.4xl Titan/Cassandra cluster.

The presentation to follow discusses the simulation’s social graph structure, the types of processes executed on that structure, and the various runtime analyses of those processes under normal and peak load. The presentation concludes with a discussion of the Amazon EC2 cluster architecture used and the associated costs of running that architecture in a production environment. In short summary, Titan performs well under substantial load with a relatively inexpensive cluster and as such, is capable of backing online services requiring real-time Big Graph Data.

The Social Graph’s Structure and Processes

An online social networking service like Twitter typically supports the 5 operations enumerated below.

  1. create an account: create a new user with provided handle.
  2. publish a tweet: disseminate a <140 character message.
  3. read stream: get a time ordered list of 10 tweets from the followed users.
  4. follow a user: subscribe to the tweets of another user.
  5. get a recommendation: receive a solicitation of potentially interesting users to follow.

These operations lead to the emergence of a property graph structure epimorphic to the schema diagrammed on the right. In this schema, there are user vertices and tweet vertices. When a user tweets, a tweets edge connects the user to their tweet. Moreover, all of the followers of that user (i.e. subscribers) have a timestamped outgoing stream edge attaching their vertex to the tweet. For each user vertex, the stream edges are sorted by time as, in this system, time is a declared primary key. Titan supports vertex-centric indices which ensure O(log(n)) lookups of adjacent vertices based on the incident edge labels and properties, where n is the number of edges emanating from the vertex. For the sake of simulation, the artificially generated tweets are randomly selected snippets from Homer’s The Odyssey (as provided by Project Gutenberg), where the length is sampled from a Gaussian distribution with a mean of 70 characters.

To provide a foundational layer of data, the Twitter graph as of 2009 was first loaded into the Titan cluster. This data includes 41.7 million user vertices and 1.47 billion follows edges. After loading, the 40 m1.small machines are put into a “while(true) loop” (in fact, there are 10 concurrent threads on each worker running 125,000 iterations). During each iteration of the loop, a worker selects an operation to enact using a biased coin toss (see the diagram on the left). The distribution heavily favors stream reading as this is typically the most prevalent operation in such online social systems. Next, if a recommendation is provided, then there is a 30% chance that the user will follow one of the recommended users. This is how follows edges are added to the graph.

A follows recommendation (e.g. “who to follow“) makes use of the existing follows edges to determine, for a particular user, other users that they might find interesting. Typically, some variant of a triangle closure is computed in such situations. In plain English, if the users that user A follows tend to follow user B, then it is most likely that user B is a good user for user A to follow. To capture this notion as a real-time graph algorithm, the Gremlin graph traversal language is used.

follows = g.V('name',name).out('follows').toList()
follows20 = follows[(0..19).collect{random.nextInt(follows.size)}]
m = [:]
follows20.each { it.outE('follows')[0..29].inV.except(follows).groupCount(m).iterate() }
m.sort{a,b -> b.value <=> a.value}[0..4]
  1. Retrieve all the users that the user follows, where name is the user’s unique Twitter handle.
  2. Randomly select 20 of those followed users (provides variation on each invocation — non-deterministic).
  3. Create an empty associative array/map that will be populated with recommendation rankings.
  4. For each of the 20 random followed users, get their 30 most recently followed users that are not already followed, and score them in the map.
  5. Reverse sort the map and return the top 5 users as recommendations.

Note that vertex-centric indices come into play again in line 4 where follows edges (like stream edges) have a primary key of time and are thus, chronologically ordered. The 30 most recently followed users is a single O(log(n)) lookup, where again, n is the number of edges emanating from the vertex.

Titan Serving 50,000–100,000 Concurrent Users

Titan is a OLTP graph database. It is designed to handle numerous short, concurrent transactions like the ones discussed previously. In this section, Titan’s performance under normal (5,900 transactions per second) and peak (10,200 transactions per second) load are presented. We consider what follows to be a reasonable benchmark — no specialized hardware is required (standard EC2 machines), no complex configurations/tunings of either Cassandra or Titan, and all worker code is via the standard Blueprints API.

Normal Load

The normal load simulation ran for 2.3 hours and during that time, 49 million transactions occurred. This comes to approximately 5,900 transactions a second. Assuming that a human user does a transaction every 5-10 seconds (e.g. reads their stream and then publishes a tweet, etc.), this Titan cluster is supporting approximately 50,000 concurrent users. In the table below, the number of transactions per operation, the average transaction times, the standard deviation of those times, and the 3 sigma times are presented. 3 sigma is 3 standard deviations greater than the mean and represents the expected worst case time that 0.1% of the users will experience. Finally, note that creating an account is a slower transaction because it is a locking operation that ensures that no two users have the same username (i.e. Twitter handle).

action number of tx mean tx time std of tx time 3 sigma tx time
create an account 379,019 115.15 ms 5.88 ms 132.79 ms
publish a tweet 7,580,995 18.45 ms 6.34 ms 37.48 ms
read stream 37,936,184 6.29 ms 1.62 ms 11.15 ms
get recommendation 3,793,863 67.65 ms 13.89 ms 109.33 ms
total 49,690,061

After 2.3 hours of the aforementioned transactions, the following types of vertices and edges were added to the pre-existing 2009 Twitter graph. On the right are the statistics given this behavior extrapolated for a day.

2.3 hours 1 day
  • 361,000 user vertices
  • 7.58 million tweets (tweet vertices)
  • 7.58 million tweets (tweets edges)
  • 150 million stream edges
  • 1.12 million follows edges
  • total: 166.6 million elements
  • 3.78 million users vertices
  • 79.33 million tweets (tweet vertices)
  • 79.33 million tweets (tweets edges)
  • 1.57 billion stream edges
  • 11.79 million follows edges
  • total: 1.75 billion elements

Peak Load

To determine how Titan would perform in a peak load environment, the 40 worker machines together executed 10,200 transactions a second in 1.3 hours (49 million total transactions). This simulates approximately 100,000 concurrent users. Transaction numbers and timing statistics are provided in the table below. Note that higher latencies are expected given the higher load and that even though the transaction times are longer than those under normal load, the times are still acceptable for a real-time online service.

action number of tx mean tx time std of tx time 3 sigma tx time
create an account 374,860 172.74 ms 10.52 ms 204.29 ms
publish a tweet 7,517,667 70.07 ms 19.43 ms 128.35 ms
read stream 37,618,648 24.40 ms 3.18 ms 33.93 ms
get recommendation 3,758,266 229.83 ms 29.08 ms 317.06 ms
total 49,269,441

Amazon EC2 Machinery and Costs

The simulation presented was executed on Amazon EC2. The software infrastructure to run this simulation made use of CloudFormation. In terms of the hardware infrastructure, this section discusses the instance types, their physical statistics during the experiment, and the cost of running this architecture in a production environment.

The 40 workers were m1.small Amazon EC2 instances (1.7 GB of memory with 1 virtual core). The Titan/Cassandra cluster was composed of 6 machines each with the following specification.

  • 23 GB of memory
  • 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture)
  • 1,690 GB of storage
  • 64-bit platform
  • 10 Gigabit Ethernet
  • EC2 API name: cc1.4xlarge

Under the normal load simulation, the 6 machine Titan cluster experienced the following CPU utilization, disk reads (in bytes), and disk writes (in bytes) — each colored line represents 1 of the 6 cc1.4xlarge machines. Note that the disk read chart is a 1 hour snapshot during the middle of the experiment and therefore, the caches are warm. In summary, Titan is able to consistently, and without exertion, maintain the normal transactional load.

The cost of running all these machines is provided in the table below. Note that in a production environment (non-simulation), the 40 workers can be interpreted as web servers taking user requests and processing results returned from the Titan cluster.

instance cost per hour cost per day cost per year
6 cc1.4xl $7.80 $187.20 $68,328
40 m1.small $3.20 $76.80 $28,032
total $11.00 $264.00 $96,360

For serving 50,000–100,000 concurrent users, $96,360 a year is inexpensive considering incoming revenue seen from a user base of that size (assume 5% of the user base is concurrent: ~2 million registered users). Moreover, Titan can be deployed over an arbitrary number of machines and dynamically scale to meet load requirements (see The Benefits of Titan). Therefore, this 6 cc1.4xl architecture is not a necessity, but a particular configuration that was explored for the purpose of the presented social simulation. For environments with less load, a smaller cluster can and should be used.

Conclusion

Titan has been in research and development for the past 4 years. In Spring 2012, Titan was made freely available by Aurelius under the liberal Apache 2 license. It is currently distributed as a 0.1-alpha with a 0.1 release planned by the end of Summer 2012.

Note that Titan is but one piece of the larger graph puzzle. Titan serves the OLTP aspect of graph processing. By the middle of Fall 2012, Aurelius will release a collection of OLAP graph technologies to support global graph processing and analytics. All of the Aurelius technologies will integrate with one another as well as with the suite of open source, BSD licensed graph technologies provided by TinkerPop. By standing on the shoulders of giants (e.g. Cassandra, TinkerPop, Amazon EC2), great leaps and bounds in applied graph theory and network science are possible.

References

Kwak, H., Lee, C., Park, H., Moon, S., “What is Twitter, a Social Network or a News Media?,” World Wide Web Conference, 2010.

Rodriguez, M.A., Broecheler, M., “Titan: The Rise of Big Graph Data,” Public Lecture at Jive Software, Palo Alto, 2012.

Broecheler, M., LaRocque, D., Rodriguez, M.A., “Titan: A Highly Scalable, Distributed Graph Database,” GraphLab Workshop 2012, San Francisco, 2012.

Authors

Matthias Broecheler Dan LaRocque Marko A. Rodriguez

Follow

Get every new post delivered to your Inbox.

Join 140 other followers

%d bloggers like this: