Educating the Planet with Pearson
May 13, 2013 3 Comments
Pearson is striving to accomplish the ambitious goal of providing an education to anyone, anywhere on the planet. New data processing technologies and theories in education are moving much of the learning experience into the digital space — into massive open online courses (MOOCs). Two years ago Pearson contacted Aurelius about applying graph theory and network science to this burgeoning space. A prototype proved promising in that it added novel, automated intelligence to the online education experience. However, at the time, there did not exist scalable, open-source graph database technology in the market. It was then that Titan was forged in order to meet the requirement of representing all universities, students, their resources, courses, etc. within a single, unified graph. Moreover, beyond representation, the graph needed to be able to support sub-second, complex graph traversals (i.e. queries) while sustaining at least 1 billion transactions a day. Pearson asked Aurelius a simple question: “Can Titan be used to educate the planet?” This post is Aurelius’ answer.
Data Loading Benchmark
Pearson provides free online education through its OpenClass platform. OpenClass is currently in beta with adoption by ~7000 institutions. To meet the expected growth beyond beta, it is necessary to build the platform on a scalable database system.
Moreover, it is important to build the platform on a database system that can support advanced algorithms beyond simple get/put-semantics. The latter was demonstrated via the initial prototype. To the former, a simulation of a worldwide education environment was created in Titan to alleviate Pearson’s scalability concerns.
| Number of students | 3.47 billion |
| Number of teachers | 183 million |
| Number of courses | 788 million |
| Number of universities | 1.2 million |
| Number of concepts | 9 thousand |
| Number of educational artifacts | ~1.5 billion |
| Total number of vertices | 6.24 billion |
| Total number of edges | 121 billion |
The simulated world was a graph containing 6.24 billion vertices and 121 billion edges. The edges represent students enrolled in courses, people discussing content, content referencing concepts, teachers teaching courses, material contained in activity streams, universities offering courses, and so forth. Various techniques were leveraged to ensure that the generated data was consistent with a real-world instance. For example, people names were generated from sampling the cross product of the first and last names in the US Census Bureau dataset. Gaussian distributions were applied to determine how many courses a student should be enrolled in (mean of 8) and how many courses a teacher should teach (mean of 4). The course names and descriptions were drawn from the raw MIT OpenCourseWare data dumps. Course names were appended with tokens such as “101,” “1B,” “Advanced,” etc. in order to increase the diversity of the offerings. Student comments in discussions were sampled snippets of text from the electronic books provided by Project Gutenberg. University names were generated from publicly available city name and location datasets in the CommonHubData project. Finally, concepts were linked to materials using OpenCalais. The final raw education dataset was 10 terabytes in size.
A 121 billion edge graph is too large to fit within the confines of a single machine. Fortunately, Titan/Cassandra is a distributed graph database able to represent a graph across a multi-machine cluster. The Amazon EC2 cluster utilized for the simulation was composed of 16 hi1.4xlarge machines. The specification of the machines is itemized below.
- 60.5 GiB of memory
- 35 EC2 Compute Units (16 virtual cores and 64-bit)
- 2 SSD-based volumes each with 1024 GB of instance storage
- I/O Performance: Very High (10 Gigabit Ethernet)
The 10 terabyte, 121 billion edge graph was loaded into the cluster in 1.48 days at a rate of approximately 1.2 million edges a second with 0 failed transactions. These numbers were possible due to new developments in Titan 0.3.0 whereby graph partitioning is achieved using a domain-based byte order partitioner. With respect to education, the dataset maintains a structure where most edges are inter-university rather than intra-university (i.e. students and teachers typically interact with others at their own university, and even more so within their own courses). As such, domain partitioning is possible where vertices within the university (i.e. graph community) are more likely to be co-located on the same physical machine.
Once in Titan, the raw 10 terabyte dataset was transformed to 5 terabytes due to data agnostic Snappy compression and the use of Titan-specific graph compression techniques (e.g. “id deltas,” type definitions, and Kryo serialization). After the data was loaded, it was immediately backed up to Amazon S3. This process took 1.4 hours using Titan’s parallel backup strategy. The nodetool statistics for the Titan/Cassandra cluster are provided below.
Address Rack Status State Load Token 10.244.196.10 rack1 Up Normal 329.44 GB Token(bytes[c000000000000000]) 10.244.194.142 rack1 Up Normal 348.62 GB Token(bytes[3000000000000000]) 10.244.196.111 rack1 Up Normal 330.86 GB Token(bytes[b000000000000000]) 10.244.195.155 rack1 Up Normal 333.57 GB Token(bytes[a000000000000000]) 10.244.195.243 rack1 Up Normal 330.91 GB Token(bytes[9000000000000000]) 10.244.195.209 rack1 Up Normal 326.57 GB Token(bytes[f000000000000000]) 10.244.195.93 rack1 Up Normal 355.26 GB Token(bytes[4000000000000000]) 10.244.195.179 rack1 Up Normal 325.73 GB Token(bytes[e000000000000000]) 10.244.195.57 rack1 Up Normal 351.47 GB Token(bytes[1000000000000000]) 10.244.196.27 rack1 Up Normal 332.87 GB Token(bytes[d000000000000000]) 10.244.196.93 rack1 Up Normal 351.81 GB Token(bytes[2000000000000000]) 10.244.195.6 rack1 Up Normal 331.56 GB Token(bytes[8000000000000000]) 10.244.195.7 rack1 Up Normal 327.55 GB Token(bytes[0000000000000000]) 10.244.196.84 rack1 Up Normal 345.2 GB Token(bytes[5000000000000000]) 10.244.195.8 rack1 Up Normal 351.26 GB Token(bytes[6000000000000000]) 10.244.194.178 rack1 Up Normal 338.07 GB Token(bytes[7000000000000000])
Transactional Benchmark
The purpose of the second half of the experiment was to subject the 121 billion edge education graph to numerous concurrent transactions. These transactions simulate users interacting with the graph — solving educational problems, adding more content, discussing ideas with one another, etc. To put the 16 hi1.4xlarge cluster under heavy load, 80 m1.medium machines were spawned. These 80 machines simulate the application servers querying the graph and providing users the front-end experience. Each machine maintained 30 threads in a “while(true)-loop,” randomly selecting 1 of 16 transactional templates below and executing it. Thus, 2,400 concurrent threads communicated with the 16 hi1.4xlarge Titan cluster. A review of the Gremlin queries and their mean runtimes with standard deviations are presented in the table below. Note that many of the transactions are “complex” in that they embody a series of actions taken by a user and thus, in a real-world setting, these would typically be broken up into smaller behavioral units (i.e. multiple individual transactions).
name # of tx avg (ms) std dev description scommentread 25550909 211.07 45.56 student reads the most recent comments for their courses reccourse 5149469 467.37 178.20 students gets recommended courses to take scommentshare 12825909 394.15 57.98 student reads comments in courses and shares a comment scontent 20567687 279.32 81.83 student retrieves all content for a single course in their course list saddfollow 12826912 193.72 22.77 student follows another student scourses 7720965 233.38 79.44 student retrieves a list of all their courses with description classmates 12849769 96.962 22.27 student retrieves a list of all their classmates sprofile 7689669 53.740 22.61 student retrieves their profile recfriend2 5178945 155.75 44.60 student is recommended people to follow (version 1) scourseactivity 10371133 565.76 189.80 student retrieves the top 10 most recent activities in their courses scommentadd 5182558 281.90 44.326 student reads course comments and comments at some depth in the discussion tree recfriend1 5189435 241.33 256.48 student is recommended people to follow (version 2) sreshare 12850473 284.07 68.20 student reads their stream and shares an item with followers ssharelink 5140363 261.58 35.75 student shares a link with followers sdiscussadd 2604696 246.35 34.64 student browses courses and then adds a new discussion topic to a course sstream 76301001 224.93 84.48 student reads their personal stream
The transactional benchmark ran for 6.25 hours and executed 228 million complex transactions at a rate of 10,267 tx/sec. Provided the consistent behavior over those 6.25 hours, it is inferred that Titan can serve ~887 million transactions a day. Given the complex nature of the transactions, a real-world production system of this form should be able to sustain greater than 1 billion transactions a day.
Conclusion
A first attempt at this benchmark was done at the turn of the year 2012. That benchmark was for a 260 billion edge graph (7 billion students) constructed using the same graph generation techniques described in this post. The graph data was loaded, but the transactional benchmark was never executed because the requisite cluster architecture could not be fulfilled by Amazon EC2 (all 32 high I/O machines in the same placement group). Thus, the smaller-scale benchmark presented here was what was possible given EC2′s resources at the time.
While this benchmark demonstrates that Titan can handle the data scale and transactional load of “99%” of applications developed today, Aurelius is not sufficiently satisfied with the presented transactional results. By reducing the transactional load to ~5,000 tx/sec, the runtimes of the queries dropped to the 100 millisecond range. The desired result was to achieve ~100ms query times at 10k tx/sec and 250ms at 25k tx/sec. Fortunately, much was learned about Titan/Cassandra during this exploration. Advances in both inter-cluster connection management and machine-aware traversal forwarding are currently in development in order to reach the desired scaling goals.
A scalable graph database is only half the story — the story presented here. Once scaling has been accomplished, the next problem is yielding knowledge from data. For the last 2 years, Aurelius has been working with Pearson to develop novel algorithms for the education space that move beyond the typical activities seen in current web-systems: profiles, streams, recommendations, etc. Online education is ripe for automating much of the intelligence currently contributed by students and teachers. With a collective graphical model, Aurelius’ algorithms move beyond creating a dynamic interface to providing computational human augmentation. These techniques generalize and should prove fruitful to other domains.
Acknowledgements
This report was made possible due to funding provided by Pearson Education. Over the 5 months that this benchmark was developed, the following individuals provided support: Pavel Yaskevich (Cassandra tuning and Titan development) and Stephen Mallette/Blake Eggleston (RexPro development and testing). Various issues in Cassandra were identified at scale and the Cassandra community readily accepted patches and quickly released newer versions in support of the effort. Note that Titan is an open source, Apache2 licensed graph database. The Titan community has supported the project via patch submissions and testing across use cases and cluster configurations. Finally, Steve Hill of Pearson has done much to ensure Titan’s success — pushing Aurelius to answer the question: “Can Titan educate the planet?”




The team updates the default
With a
The following summer, the prophecy of the Oracle of Delphi comes true. An announcement is made in the
To remedy the situation, 6 more Titan Server machines are added to the cluster for a total of 9 machines. The 























The first step to any repeated analyses of a graph using Faunus is to pull the requisite data from a source location. For the examples in this post, the graph source is Titan/HBase. In the code snippet above, the 
Faunus has two parallel data flows: graph and sideeffect. Each MapReduce job reads the graph, mutates it in some way, and then writes it back to HDFS as 
If the number of traversers at a particular element are required (i.e. a count — above) as oppposed to the specific traverser instances themselves (and their respective path histories — below), then the time it takes to compute a 
While this result might seem outlandish, it is possible to analytically estimate the empirically derived path counts. The average degree of the vertices in the graph is 6, but the total number of 5-step paths is much more sensitive to the connectivity of high degree vertices. When analyzing only the top 25% most connected vertices — the 200k vertices shown in red below the blue line — the average degree is 260. This yields an estimated path count of:
In
President 
Suppose there is a vertex named Dan. Incident to Dan are 1,110 edges. These edges denote the people Dan knows (10 edges), the things he likes (100 edges), and the tweets he has tweeted (1000 edges). If Dan wants a list of all the people he knows and incident edges are not indexed by label, then Dan would have to iterate through all 1,110 edges to find the 10 people he knew. However, if Dan’s edges are indexed by edge label, then a lookup into a hash on 


The plot on the right displays the number of vertices returned by each query over each graph size. As expected, the number of
The runtimes compound in a 





The deployed cluster is diagrammed on the right where each machine maintains its respective software services. The sections to follow will demonstrate how to load and then process graph data within the cluster. Titan will serve as the data source for Faunus’ batch analytic jobs.


