Faunus Provides Big Graph Data Analytics

Faunus is an Apache 2 licensed distributed graph analytics engine that is optimized for batch processing graphs represented across a multi-machine cluster. Faunus makes global graph scans efficient because it leverages sequential disk reads/writes in concert with various on-disk compression techniques. Moreover, for non-enumerative calculations, Faunus is able to linearly scale in the face of combinatorial explosions. To substantiate these aforementioned claims, this post presents a series of analyses using a graph representation of Wikipedia (as provided by DBpedia version 3.7). The DBpedia knowledge graph is stored in a 7 m1.xlarge Titan/HBase Amazon EC2 cluster and then batch processed using Faunus/Hadoop. Within the Aurelius Graph Cluster, Faunus provides Big Graph Data analytics.

Ingesting DBpedia into Titan

The DBpedia knowledge base currently describes 3.77 million things, out of which 2.35 million are classified in a consistent Ontology, including 764,000 persons, 573,000 places (including 387,000 populated places), 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games), 192,000 organizations (including 45,000 companies and 42,000 educational institutions), 202,000 species and 5,500 diseases. (via DBpedia.org)

DBpedia is a Linked Data effort focused on providing a machine-consumable representation of Wikipedia. The n-triple format distributed by DBpedia can be easily mapped to the property graph model supported by many graph computing systems including Faunus. The data is ingested into a 7 m1.xlarge Titan/HBase cluster on Amazon EC2 using the BatchGraph wrapper of the Blueprints graph API.

Faunus’ Integration with Titan

On each region server in the Titan/HBase cluster there exists a Hadoop datanode and task tracker. Faunus uses Hadoop to execute breadth-first representations of Gremlin queries/traversals by compiling them down to a chain of MapReduce jobs. Next, Hadoop’s SequenceFile format serves as the intermediate HDFS data format between jobs (i.e. traversal steps). Within the SequenceFile, Faunus leverages compression techniques such as variable-width encoding and prefix compression schemes to ensure a small HDFS footprint. Global analyses of the graph can execute more quickly than what is possible from a graph database such as Titan as the SequenceFile format does not maintain the data structures necessary for random read/write access and, because of its immutable nature, can more easily be laid sequentially on disk.

ubuntu@ip-10-140-13-228:~/faunus$ bin/gremlin.sh 

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> g = FaunusFactory.open('bin/titan-hbase.properties')
==>faunusgraph[titanhbaseinputformat]
gremlin> g.getProperties()
==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat
==>faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
==>faunus.output.location=dbpedia
==>faunus.output.location.overwrite=true
gremlin> g._() 
12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.IdentityMap.Map]
12/11/09 15:17:50 INFO mapred.JobClient: Running job: job_201211081058_0003
...
gremlin> hdfs.ls()
==>rwxr-xr-x ubuntu supergroup 0 (D) dbpedia
gremlin>

The first step to any repeated analyses of a graph using Faunus is to pull the requisite data from a source location. For the examples in this post, the graph source is Titan/HBase. In the code snippet above, the identity function is evaluated which simply maps the Titan/HBase representation of DBpedia over to an HDFS SequenceFile (g._()). This process takes approximately 16 minutes. The chart below presents the average number of bytes per minute written to and from the cluster’s disks during two distinct phases of processing.

  1. On the left is the ingestion of the raw DBpedia data into Titan via BatchGraph. Numerous low-volume writes occur over a long period of time.
  2. On the right is Faunus’ mapping of the Titan DBpedia graph to a SequenceFile in HDFS. Fewer high volume reads/writes occur over a shorter period of time.

The plot reiterates the known result that sequential reads from disk are nearly 1.5x faster than random reads from memory and 4-5 orders of magnitude faster than random reads from disk (see The Pathologies of Big Data). Faunus capitalizes on these features of the memory hierarchy so as to ensure rapid full graph scans.

Faunus’ Dataflows within HDFS: Graph and SideEffect

Faunus has two parallel data flows: graph and sideeffect. Each MapReduce job reads the graph, mutates it in some way, and then writes it back to HDFS as graph* (or to its ultimate sink location). The most prevalent mutation to graph* is the propagation of traversers (i.e. the state of the computation). The graph SequenceFile encodes not only the graph data, but also computational metadata such as which traversers are at which elements (vertices/edges). Other mutations are more structural in nature like property updates and/or edge creation (e.g. graph rewriting). The second data flow is a step-specific statistic about the graph that is stored in sideeffect*. Side-effects include, for example:

  • aggregates: counts, groups, sets, etc.
  • graph data: element identifiers, properties, labels, etc.
  • traversal data: enumeration of paths.
  • derivations: functional transformations of graph data.
gremlin> g.getProperties()
==>faunus.graph.input.format=org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat
==>faunus.input.location=dbpedia/job-0
==>faunus.graph.output.format=com.thinkaurelius.faunus.formats.noop.NoOpOutputFormat
==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
==>faunus.output.location=output
==>faunus.output.location.overwrite=true
gremlin> hdfs.ls('dbpedia/job-0')
==>rw-r--r-- ubuntu supergroup 426590846 graph-m-00000
==>rw-r--r-- ubuntu supergroup 160159134 graph-m-00001
...
gremlin> g.E.label.groupCount()
...
gremlin> hdfs.ls('output/job-0')
==>rw-r--r-- ubuntu supergroup 37 sideeffect-r-00000
==>rw-r--r-- ubuntu supergroup 18 sideeffect-r-00001
...
gremlin> hdfs.head('output/job-0')
==>deathplace	144374
==>hasBroader	1463237
==>birthplace	561837
==>page	8824974
==>primarytopic	8824974
==>subject	13610094
==>wikipageredirects	5074113
==>wikiPageExternalLink	6319697
==>wikipagedisambiguates	1004742
==>hasRelated	28748
==>wikipagewikilink	145877010

The Traversal Mechanics of Faunus

It is important to understand how Faunus stores computation within the SequenceFile. When the step g.V is evaluated, a single traverser (a long value of 1) is placed on each vertex in the graph. When count() is evaluated, the number of traversers in the graph are summed together and returned. A similar process occurs for g.E save that a single traverser is added to each edge in the graph.

gremlin> g.V.count()
==>30962172
gremlin> g.E.count()
==>191733800

If the number of traversers at a particular element are required (i.e. a count — above) as oppposed to the specific traverser instances themselves (and their respective path histories — below), then the time it takes to compute a combinatorial computation can scale linearly with the number of MapReduce iterations. The Faunus/Gremlin traversals below count (not enumerate) the number of 0-, 1-, 2-, 3-, 4-, and 5-step paths in the DBpedia graph. Note that the runtimes scale linearly at approximately 15 minutes per traversal step even though the results compound exponentially such that, in the last example, it is determined that there are 251 quadrillion length 5 paths in the DBpedia graph.

gremlin> g.V.count() // 2.5 minutes
==>30962172
gremlin> g.V.out.count() // 17 minutes
==>191733800
gremlin> g.V.out.out.count() // 35 minutes
==>27327666320
gremlin> g.V.out.out.out.count() // 50 minutes
==>5429258407462
gremlin> g.V.out.out.out.out.count() // 70 minutes 
==>1148261617434916
gremlin> g.V.out.out.out.out.out.count() // 85 minutes
==>251818304970074185

While this result might seem outlandish, it is possible to analytically estimate the empirically derived path counts. The average degree of the vertices in the graph is 6, but the total number of 5-step paths is much more sensitive to the connectivity of high degree vertices. When analyzing only the top 25% most connected vertices — the 200k vertices shown in red below the blue line — the average degree is 260. This yields an estimated path count of:

This number is consistent with the actual 5-path count calculated by Faunus. Both the computed and analytic result demonstrate a feature of natural graphs that all graph analysts should be aware of — combinatorial explosions abound (see Loopy Lattices).

gremlin> g.V.sideEffect('{it.outDegree = it.outE.count()}').outDegree.groupCount()
12/11/11 18:36:16 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
...
==>1001	6
==>101	4547
==>1016	10
==>1022	5
==>1037	9
gremlin>

Conclusion

Faunus is a freely available, Apache 2 licensed, distributed graph analytics engine. It is currently in its 0.1-alpha stage with a 0.1 release planned for Winter 2012/2013. Faunus serves as one of the OLAP components of the Aurelius Graph Cluster.

In the world of graph computing, no one solution will meet all computing needs. Titan supports use cases in which thousands of concurrent users are executing short, ego-centric traversals over a single massive-scale graph. Faunus, on the other hand, supports global traversals of the graph in uses cases such as offline data science and/or production-oriented batch processing. Finally, Fulgora will serve as an in-memory graph processor for heavily threaded, iterative graph and machine learning algorithms. Together, the Aurelius Graph Cluster provides integrated solution coverage to various graph computing problems.

Related Material

Jacobs, A., “The Pathologies of Big Data,” Communications of the ACM, 7(6), July 2009.

Norton, B., Rodriguez, M.A., “Loopy Lattices,” Aurelius Blog, April 2012.

Ho, R., “Graph Processing in Map Reduce,” Pragmatic Programming Techniques Blog, July 2010.

Lin, J., Schatz, M., “Design Patterns for Efficient Graph Algorithms in MapReduce,” Mining and Learning with Graphs Proceedings, 2010.

Authors


Marko A. Rodriguez Stephen Mallette Vadas Gintautas Matthias Broecheler

Follow

Get every new post delivered to your Inbox.

Join 115 other followers

%d bloggers like this: