April 21, 2014
The web is composed of numerous web sites tailored to meet the information, consumption, and social needs of its users. Within many of these sites, references are made to the same platonic “thing” though different facets of the thing are expressed. For example, in the movie industry, there is a movie called John Carter by Disney. While the movie is an abstract concept, it has numerous identities on the web (which are technically referenced by a URI).
- The Internet Movie Database (IMDb): Production information
- Rotten Tomatoes: Ratings and reviews
- NetFlix: Online streaming
- Amazon.com Instant Video: Online streaming
- Amazon.com Products: DVD purchase
- So forth and so on…Who is John Carter?
Social sites represent the identities of people. The “thing” known as Marko A. Rodriguez has an identity on Twitter, LinkedIn, and WordPress. Each identity is referring to the same platonic concept (noumenon), but in doing so, provides a different interpretation of its essence (phenomenon). The question then becomes — “What is Marko A. Rodriguez?” Each website only presents a particular view of Marko which is ultimately a lossy representation. For Marko is embedded within the larger world of yet more “things” (i.e., his artifacts of creation, his institutions of study, his social collaborations) and their embeddings, ad infinitum. Therefore, no single site with finite storage will be able to fully grasp Marko. To approach a full understanding of the platonic concept of Marko, one must bear witness to all the phenomenological aspects that are and indirectly related to Marko. In doing so, one realizes the plantonic Marko A. Rodriguez for which no concept can bear.
A Unified Graph of People, Institutions, Artifacts, and Concepts
Aurelius collaborated with the Digital Library Research and Prototyping Group of the Los Alamos National Laboratory (LANL) to develop EgoSystem atop the distributed graph database Titan. The purpose of this system is best described by the introductory paragraph of the April 2014 publication on EgoSystem.
EgoSystem was developed to support outreach to former Los Alamos National Laboratory (LANL) employees. As a publicly funded research organization, there are many reasons why LANL would want to maintain contact with its “alumni.” Scientific research is often collaborative. Former employees know the Lab and its work, and often have colleagues who remain employed at LANL. These relationships fuel intra- and interdisciplinary projects. Government research agencies are also encouraged to partner with the private sector. Productizing LANL output becomes an opportunity for a public-private or commercial entity via a technology transfer process. Some small businesses (and jobs) owe their existence to ideas that were first developed at LANL. Public support for the ongoing research at LANL plays a role in ensuring support for adequate funding levels for that work. Outreach to alumni can encourage them to serve as ambassadors and advocates for LANL and its mission.
From a technical standpoint, EgoSystem discovers and aggregates information on LANL employees (past and present), their created artifacts, institutions, and associated concepts/tags/keywords throughout the web. Moreover, as employees relate to other individuals (e.g. coauthors, followers), their respective identities are aggregated as well. This automated discovery process yields a social graph that presents EgoSystem users with a consolidated view of the multiple facets of an individual as portrayed by the numerous websites in which they use. Simple questions can be answered: what do they Tweet about?, what are their publications?, what research concepts do they like?, where have they been employed?, who are their coauthors?, etc. In essence, EgoSystem removes the manual process of having to iteratively use a search engine to find Marko A. Rodriguez on LinkedIn, Twitter, ArXiv, etc. and presents the information in an integrated manner along with novel social-based statistics only possible when the information in these disparate data silos is linked and processed as a whole (see A Reflection on the Structure and Process of the Web of Data). The following two sections will describe EgoSystem’s data model (its graph schema) and its data processes (its graph traversals).
The EgoSystem Graph Data
There are two categories of vertices in EgoSystem.
- Platonic: Denotes an abstract concept devoid of interpretation.
- Identity: Denotes a particular interpretation of a platonic.
Every platonic vertex is of a particular type: a person, institution, artifact, or concept. Next, every platonic has one or more identities as referenced by a URL on the web. The platonic types and the location of their web identities are itemized below. As of EgoSystem 1.0, these are the only sources from which data is aggregated, though extending it to support more services (e.g. Facebook, Quorum, etc.) is feasible given the system’s modular architecture.
- Person: Microsoft Academic, LinkedIn, Twitter, SlideShare, Mendeley, Homepage
- Institution: Wikipedia, LinkedIn, Homepage
- Artifact: Microsoft Academic, ArXiv, SlideShare, Mendeley
- Concept: Microsoft Academic, LinkedIn, SlideShare, Wikipedia
The graph diagrammed below demonstrates the four platonic types, their realization as identities in the various online web systems mentioned above, and how the vertices semantically relate to one another. Note that the properties of the vertices and edges are not diagrammed for the sake of diagram clarity, but include properties such as
logoUrl, and various publication statistics such as
At the top of the diagram there is a yellow colored, platonic person vertex. That person has an email address, MS Academic page, Mendeley page, two institutional affiliations, and a Twitter account. These vertices are known as the identities of the platonic person. The Twitter identity follows another Twitter identity that is associated with yet another person platonic. In this person’s MS Academic identity, an article was authored. That article is not the platonic article, but the identity as represented by MS Academic (as referenced by an MS Academic URL). The platonic article has another identity represented in Mendeley. That is, both MS Academic and Mendeley make reference to the same plantonic article. The MS Academic article identity relates to a concept identity (i.e. a tag or keyword). Such concepts are bundled by a platonic representing the abstract notion of that concept irrespective of how it is portrayed by the particular websites. In summary, identities relate to other identities via semantically-typed edges and all identities referring to the same platonic are bundled together under the respective platonic.
The use of a platonic is a way of yielding an n-ary relationship in a binary graph. This relates to concepts in the Semantic Web community around blank nodes, owl:sameAs, and named graphs. The platonic makes it easy to partition the data sets whereby the Twitter subgraph can be analyzed irrespective of the MS Academic authorship subgraph. However, the benefit of having all this data aggregated (by means of the plantonics) and stored in the Titan graph database is that it is possible to perform novel traversals/queries that span the once isolated data silos with limited scalability concerns (see Faith in the Algorithm, Part 2: Computational Eudaemonics).
The EgoSystem Graph Processes
There are two general processes that occur in EgoSystem:
- Discovery processes: data aggregation and validation
- User processes: graph analysis, ranking, scoring, and presentation
The graph schema presented previously is populated using EgoSystem’s “discovery process.” Given some basic information about an individual (their name, place of work, educational institution), EgoSystem makes use of search APIs and its existing graph structure to find matches in the various web systems discussed. Suppose there is a person named Jie Bao and all that is known is that his research institution is Rensselaer Polytechnic Institute (RPI). Unfortunately, there are numerous people with the name Jie Bao and thus, numerous Twitter accounts as well. However, if we know that Jie Bao is at RPI and there is a person named Li Ding at RPI as well (for which EgoSystem has a Twitter account for), and one particular Jie Bao Twitter account follows Li Ding’s Twitter account, then there is some certainty that that Jie Bao Twitter identity is the one EgoSystem is looking for. Furthermore, if we know Li Ding’s MS Academic identity, then with great certainty we can assume his Jie Bao coauthor is in fact referring to the same platonic Jie Bao currently being discovered. In this way, external web service API calls along with internal graph searches provides the means by which EgoSystem is able to continuously iterate over its data to further populate and refine the graph data structure. Moreover, this process is non-linear in that sometimes a homepage web-scrape provides a person’s Twitter account which is assumed to be 100% accurate. In short summary, a collection of such rules, ranked by their accuracy, continually fire as background processes in EgoSystem to help update and refine the social graph.
When a user interacts with EgoSystem, there are more salient processes that are ultimately encoded as Gremlin graph traversals. The simplest graph traversal is a one-step hop from the platonic vertex to its identities. Such walks are used to populate a splash page about that platonic. A screenshot of the Marko A. Rodriguez (
uuid:123-456) splash page is presented below.
Notice the tag cloud associated with Marko in the screenshot below. Note that no person identities have direct references to concepts. The diagrammed tag cloud is derived from the concepts associated with Marko’s MS Academic and SlideShare artifacts. This inference is defined using the following multi-step Gremlin traversal that ultimately generates a frequency distribution. The returned distribution is used to determine the font-size of the tags in the tag cloud.
g.V('uuid','123-456').out('hasIdentity') .has('service',T.in,["msacademic","slideshare"]).out .out('hasConcept').name.groupCount
EgoSystem’s processes get more interesting when doing path analysis in the graph. In the interface above Aric Hagberg is “saved” (top right of the interface). With source and sink vertices, common path problems can be readily solved using EgoSystem.
- Who do Marko and Aric know in common? (coauthors, Twitter followers, etc.)
- Is there a path from me to Marko via Aric?
- Which of Marko and Aric’s shared social connections currently work at LANL?
A collection of other complex traversals provided by EgoSystem are presented below.
- Which paths exist from Marko to current LANL employees?
- Does Marko have any collaborators that work at LANL and are currently in or near Washington DC? (uses Titan’s Elasticsearch geo-index/search capabilities)
- Which of Marko’s coauthors does he not follow on Twitter?
- Provide a histogram of the concepts of the authored publications of all LANL employees. (uses Faunus for OLAP graph analytics)
- Which of my collaborators has worked with someone with graph theory experience?
The questions that can be asked of such a graph are only bound by the data which is aggregated and when numerous sources of variegated data is integrated into a universal social graph, the queries quickly become boundless.
EgoSystem was designed and developed by the Los Alamos National Laboratory and Aurelius to solve problems related to social searching employees and alumni of the lab. The principles are general and can be applied to any organization wishing to understand the greater social context in which they are embedded. No organization is an island unto itself and realizing how their employees and artifacts effect and are effected by others provides an organization a birds eye view of its surrounding world.
EgoSystem was designed and developed by James Powell (LANL), Harihar Shankar (LANL), Marko A. Rodriguez (Aurelius), and Herbert Van de Sompel (LANL). Joshua Shinavier contributed ideas to the textual content of this blog post. The story behind EgoSystem’s development is a testament to its mission statement of helping LANL to maintain connections with its alumni. Marko A. Rodriguez was a graduate researcher at the Digital Library Research and Prototyping Team in 2004 under the tutelage of Johan Bollen and Herbert Van de Sompel (in association with Alberto Pepe). While in this group, he studied information science from a graph theory perspective. Since then, Marko went on to create Aurelius and through his lasting ties with LANL, was invited to collaborate on the research and development of EgoSystem.
Powell, J., Shankar, H., Rodriguez, M.A., Van de Sompel, H., “EgoSystem: Where are our Alumni?,” Code4Lib, issue 24, ISSN:1940-5758, April 2014.