Polyglot Persistence and Query with Gremlin

Gremlin Data Lab

Complex data storage architectures are not typically grounded to a single database. In these environments, data is highly disparate, which means that it exists in many forms, is aggregated and duplicated at different levels, and in the worst case, the meaning of the data is not clearly understood. Environments featuring disparate data can present challenges to those seeking to integrate it for purposes of analytics, ETL (Extract-Transform-Load) and other business services. Having easy ways to work with data across these types of environments enables the rapid engineering of data solutions.

Some causes for data disparity rise from the need to store data in different database types, so as to take advantage of the specific benefits that each type exposes. Some examples of different database types include (please see Getting and Putting Data from a Database):

  • Relational Database: A relational database, such as MySQL, Oracle or Microsoft SQL Server, organizes data into tables with rows and columns, using a schema to help govern data integrity.
  • Document Store: A document-oriented database such as MongoDB, CouchDB, or RavenDB, organizes data into the concept of a document, which is typically semi-structured as nested maps and encoded to some format such as JSON.
  • Graph Database: A graph is a data structure that organizes data into the concepts of vertices and edges. Vertices might be thought of as “dots” and edges might be thought of as “lines”, where the lines connect those dots via some relationship. Graphs represent a very natural way to model real-world relationships between different entities. Examples of graph databases are Titan, Neo4j, OrientDB, Dex and InfiniteGraph.

Gremlin is a domain specific language (DSL) for traversing graphs. It is built using the metaprogramming facilities of Groovy, a dynamic programming language for the Java Virtual Machine (JVM). In the same way that Gremlin adds upon Groovy, Groovy adds upon Java, by providing an extended API and programmatic shortcuts that can cut down on the verbosity of Java itself.

Gremlin Gremlin comes equipped with a terminal, also known as a REPL or CLI, which provides an interface through which the programmer can interactively traverse the graph. Given Gremlin’s role as a DSL for graphs, performing interactions with a graph represent the typical usage of the terminal. However, given that the Gremlin terminal is actually a Groovy terminal, the full power of Groovy is available as well:

  • Access to the full APIs for Java and Groovy
  • Access to external JARs (i.e. 3rd party libraries)
  • Gremlin and Groovy’s syntactic sugar
  • An extensible programming environment via metaprogramming

With these capabilities in hand, Gremlin presents a way to interact with a multi-database environment with great efficiency. The following sections detail two different use cases, where Gremlin acts as an ad-hoc data workbench for rapid development of integrated database solutions centered around a graph.

Polyglot Persistence

Data Lab Terminal

Loading data to a graph from a different data source might take some careful planning. The formation of a load strategy is highly dependent on the size of the data, its source format, the complexity of the graph schema and other environmental factors. In cases where the complexity of the load is low, such as scenarios where the data set is small and the graph schema simplistic, the load strategy might be to utilize the Gremlin terminal to load the data.

MongoDB as a Data Source

Consider a scenario where the source data resides in MongoDB. The source data itself contains information which indicates a “follows” relationship between two users, similar to the concept of a user following another user on Twitter. Unlike graphs, document stores, such as MongoDB, do not maintain a notion of linked objects and therefore make it difficult to represent the network of users for analytical purposes.

The MongoDB data model consists of databases and collections, where a database is a set of collections and a collection contains a set of documents. The data for these “follows” relationships resides in a database called “network” and is in a collection called “follows.” The individual documents in that collection look like this:

{ "_id" : ObjectId("4ff74c4ae4b01be7d54cb2d3"), "followed" : "1", "followedBy" : "3", "createdAt" : ISODate("2013-01-01T20:36:26.804Z") }
{ "_id" : ObjectId("4ff74c58e4b01be7d54cb2d4"), "followed" : "2", "followedBy" : "3", "createdAt" : ISODate("2013-01-15T20:36:40.211Z") }
{ "_id" : ObjectId("4ff74d13e4b01be7d54cb2dd"), "followed" : "1", "followedBy" : "2", "createdAt" : ISODate("2013-01-07T20:39:47.283Z") }

This kind of data set translates easily to a graph structure. The following diagram expresses how the document data in MongoDB would be expressed as a graph. Follows Graph

To begin the graph loading process, the Gremlin terminal needs to have access to a client library for MongoDB. GMongo is just such a library and provides an expressive syntax for working with MongoDB in Groovy. The GMongo jar file and its dependency, the Mongo Java Driver jar, must be placed in the GREMLIN_HOME/lib directory. With those files in place, start Gremlin with:

GREMLIN_HOME/bin/gremlin.sh

Gremlin automatically imports a number of classes during its initialization process. The GMongo classes will not be part of those default imports. Classes from external libraries must be explicitly imported before they can be utilized. The following code demonstrates the import of GMongo into the terminal session and then the initialization of connectivity to the running MongoDB “network” database.

gremlin> import com.gmongo.GMongo
==>import com.tinkerpop.gremlin.*
...
==>import com.gmongo.GMongo
gremlin> mongo = new GMongo()    
==>com.gmongo.GMongo@6d1e7cc6
gremlin> db = mongo.getDB("network")
==>network

At this point, it is possible to issue any number of MongoDB commands to bring that data into the terminal.

gremlin> db.follows.findOne().followed
==>followed=1
gremlin> db.follows.find().limit(1)         
==>{ "_id" : { "$oid" : "4ff74c4ae4b01be7d54cb2d3"} , "followed" : "1" , "followedBy" : "3" , "createdAt" : { "$date" : "2013-01-01T20:36:26.804Z"}}

The steps for loading the data to a Blueprints-enabled graph (in this case, a local Titan instance) are as follows.

gremlin> g = TitanFactory.open('/tmp/titan')              
==>titangraph[local:/tmp/titan]
gremlin> // first grab the unique list of user identifiers
gremlin> x=[] as Set; db.follows.find().each{x.add(it.followed); x.add(it.followedBy)}
gremlin> x
==>1
==>3
==>2
gremlin> // create a vertex for the unique list of users
gremlin> x.each{g.addVertex(it)}
==>1
==>3
==>2
gremlin> // load the edges
gremlin> db.follows.find().each{g.addEdge(g.v(it.followedBy),g.v(it.followed),'follows',[followsTime:it.createdAt.getTime()])} 
gremlin> g.V
==>v[1]
==>v[3]
==>v[2]
gremlin> g.E
==>e[2][2-follows->1]
==>e[1][3-follows>2]
==>e[0][3-follows->1]
gremlin> g.e(2).map
==>{followsTime=1341607187283} 

This method for graph-related ETL is lightweight and low-effort, making it a fit for a variety of use cases that stem from the need to quickly get data into a graph for ad-hoc analysis.

MySQL as a Data Source

MySQL

The process for extracting data from MySQL is not so different from MongoDB. Assume that the same “follows” data is in MySQL in a four column table called “follows.”

id followed followed_by created_at
10001 1 3 2013-01-01T20:36:26.804Z
10002 2 3 2013-01-15T20:36:40.211Z
10003 1 2 2013-01-07T20:39:47.283Z

Aside from some field name formatting changes and the “id” column being a long value as opposed to a MongoDB identifier, the data is the same as the previous example and has the same problems for network analytics as MongoDB did.

Groovy SQL is straightforward in its approach to accessing data over JDBC. To make use of it inside of the Gremlin terminal, the MySQL JDBC driver jar file must be placed in the GREMLIN_HOME/lib directory. Once that file is in place, start the Gremlin terminal and execute the following commands:

gremlin> import groovy.sql.Sql
...
gremlin> sql = Sql.newInstance("jdbc:mysql://localhost/network", "username","password", "com.mysql.jdbc.Driver")
...
gremlin> g = TitanFactory.open('/tmp/titan')              
==>titangraph[local:/tmp/titan]
gremlin> // first grab the unique list of user identifiers
gremlin> x=[] as Set; sql.eachRow("select * from follows"){x.add(it.followed); x.add(it.followed_by)}
gremlin> x
==>1
==>3
==>2
gremlin> // create a vertex for the unique list of users
gremlin> x.each{g.addVertex(it)}
==>1
==>3
==>2
gremlin> // load the edges
gremlin> sql.eachRow("select * from follows"){g.addEdge(g.v(it.followed_by),g.v(it.followed),'follows',[followsTime:it.created_at.getTime()])} 
gremlin> g.V
==>v[1]
==>v[3]
==>v[2]
gremlin> g.E
==>e[2][2-follows->1]
==>e[1][3-follows>2]
==>e[0][3-follows->1]
gremlin> g.e(2).map
==>{followsTime=1341607187283}

Aside from some data access API differences, there is little separating the script to load the data from MongoDB and the script to load data from MySQL. Both examples demonstrate options for data integration that carry little cost and effort.

Polyglot Queries

A graph database is likely accompanied by other data sources, which together represent the total data strategy for an organization. With a graph established and populated with data, engineers and scientists can utilize the Gremlin terminal to query the graph and develop algorithms that will become the basis for future application services. An issue arises when the graph does not contain all the data that the Gremlin user needs to do their work.

In these cases, it is possible to use the Gremlin terminal to execute what can be thought of as a polyglot query. A polyglot query blends data together from a variety of data sources and data storage types to produce a single result set. The concept of the polyglot query can be demonstrated by extending upon the last scenario where “follows” data was migrated to a graph from MongoDB. Assume that there is another collection in MongoDB called “profiles”, which contains the user demographics data, such as name, age, etc. Using the Gremlin terminal, this “missing data” can be made part of the analysis.

gremlin> // a simple query within the graph
gremlin> g.v(1).in    
==>v[3]
==>v[2]
gremlin> // a polyglot query that incorporates data from the graph and MongoDB
gremlin> g.v(1).in.transform{[userId:it.id,userName:db.profiles.findOne(uid:it.id).name]}
==>{userId=3, userName=willis}
==>{userId=2, userName=arnold}

The first Gremlin statement above represents a one-step traversal, which simply asks to see the users who follow vertex “1.” Although it is now clear how many users follow this vertex, the results are not terribly meaningful. It is only a list of vertex identifiers and given the example thus far, there is no way to expand those results as that data is representative of the total data in the graph. To really understand these results, it would be good to grab the name of the user from the “profile” collection in MongoDB and blend that attribute into the output. The second line of Gremlin, the polyglot query, looks to do just that. It expands that limited view of the data by performing the same traversal and then reaching out to MongoDB to find the user’s name in the “profile” collection.

Polyglot Query

The anatomy of the polyglot query is as such:

  • g.v(1).in – get the incoming vertices to vertex 1
  • transform{...} – for each incoming vertex, process it with a closure that produces a map (i.e. set of key/value pairs) for each vertex
  • [userId:it.id, - use the “id” of the vertex as the value of the “userId” key in the map
  • userName:db.profiles.findOne(uId:it.id).name] – blend in the user’s name by querying MongoDB with findOne() to look up a “profile” document in MongoDB, grabbing the value of the “name” key from that document and making that the value of the “userName” field in the output

With the name of the users included in the results, the final output becomes more user friendly, perhaps allowing greater insights to surface.

Conclusion

Loading data to the graph and gathering data not available in the graph itself are two examples of the flexibility of the Gremlin terminal, but other use cases exist.

  • Write the output of an algorithm to a file or database for ad-hoc analysis in other tools like Microsoft Excel, R or Business Intelligence reporting tools.
  • Read text-based data files from the file system (e.g. CSV files) to generate graph data.
  • Traversals that build in-memory maps of significant size could benefit from using MapDB, which has Map implementations backed by disk or off-heap memory.
  • Validate traversals and algorithms before committing to a particular design, by building a small “throwaway” graph from a subset of external data that is relevant to what will be tested. This approach is also relevant to basic ad-hoc analysis of data that may not yet be in a graph, but would benefit from a graph data structure and the related toolsets available.
  • Not all graph data requires a graph database. Gremlin supports GraphML, GraphSON, and GML as file-based graph formats. They can be readily inserted into an in-memory TinkerGraph. Utilize Gremlin to analyze these graphs using path expressions in ways not possible with typical graph analysis tools like iGraph, NetworkX, JUNG, etc.
  • “Data debugging” is possible given Gremlin rapid turnaround between query and result. Traversing the graph to make sure the data was loaded correctly from the Gremlin terminal, is important for ensuring that the data was properly curated.
  • Access to data need not be limited to locally accessible files and databases. The same techniques for writing and reading data to and from those resources can be applied to third-party web services and other APIs, using Groovy’s HTTPBuilder.
  • Pull data into a graph to output as GraphML or other format, which can be visualized in Cytoscape, Gephi or other graph visualization tools.

Gremlin Running The power and flexibility of Gremlin and Groovy make it possible to seamlessly interact with disparate data. This capability enables analysts, engineers and scientists to utilize the Gremlin terminal as a lightweight workbench in a lab of data, making it possible to do rapid, ad-hoc analysis centered around graph data structures. Moreover, as algorithms are discovered, designed and tested, those Gremlin traversals can ultimately be deployed into the production system.

Authors


Stephen Mallette Marko A. Rodriguez

Follow

Get every new post delivered to your Inbox.

Join 134 other followers

%d bloggers like this: