Using Pail with a Graph Schema.

Using Pail with a Graph Schema.

In a previous post I wrote about using Thrift, Pail and Cascalog. In this post I'll expand on that, or rather simplify that example with an extension library, Pail-Graph. More specifically this is about using Pail with a Graph Schema, which is a little bit more specialized than just using thrift and Pail. The use of Graph Schema and the likeness of the example Graph Schema comes from Nathan Marz' Book on Big Data. Like everything else I've done lately this is all based on David Cuddeback's clj-pail, clj-thrift, pail-thrift and pail-cascalog libraries. Pail-graph wraps and extends each of the libraries. The Pail-Graph library mostly simplifies using a graph-schema with pail and Cascalog. Pail-Graph is available on Clojars. All of the code in this post is available in the example.clj in the library. Just like the last time, clone my repository, fire up a REPL and follow along!

Graph Schema

This is the easy part. I'm using the same graph schema before, and one that is somewhat similar to what Nathan Marz describes in his Big Data book. You may also want to read this post about Graph Schema and thrift. The gist of this schema is that there is a single Data Unit that is the entire database. The Data Unit is a Union of possible values, one of which is a PersonProperty. A PersonProperty is a structure that contains a person Id and a property. The property is a Union of values which could be simple values or structures. First name and last name vs Location which is a structure of any or all of the following; Address, city, county, state, country and zip code.

The graph schema looks like this.

Creating Thrift objects

This part is just like the other example. We just need to build some DataUnit's with the build function from clj-thrift.

Now we have a list of thrift objects. Opening a pail and writing them is easy.

Here's what Pail looks like.

└─(17:45:%)── tree example_output                                       ──#(Tue,Jan14)─┘
├── friendshipedge
│   ├── 636155fb-7126-4d78-b977-cc90daee62ed.pailfile
│   └── 8dadaae2-8602-499f-a6f4-339b909712a0.pailfile
├── pail.meta
└── property
    ├── first_name
    │   ├── 636155fb-7126-4d78-b977-cc90daee62ed.pailfile
    │   └── 8dadaae2-8602-499f-a6f4-339b909712a0.pailfile
    ├── last_name
    │   ├── 636155fb-7126-4d78-b977-cc90daee62ed.pailfile
    │   └── 8dadaae2-8602-499f-a6f4-339b909712a0.pailfile
    └── location
        ├── 636155fb-7126-4d78-b977-cc90daee62ed.pailfile
        └── 8dadaae2-8602-499f-a6f4-339b909712a0.pailfile

That's about it for getting data into a pail. What I skipped was setting up a PailStructure which defines partitioning. In Pail-Graph, unlike Pail-Thrift, there is the additional work of
defining a Tap Mapper. This is where I make you read my previous posts if you haven't already. I don't want to explain Pail Structures and Partitioning again. First is my 'Thrift-pail-cascalog and clojure' post. Second is my 'Pail-Fressian' post. You'll understand and appreciate all of this that much more if you read them.

Pail-Graph adds a TapMapper setting to the Pail-Structure that was defined in clj-pail. A tap mapper takes a list of property-paths and processes them in a way that is corollated to the behavior of the partitioner. The key is in the property-paths. A property path is a vector of field id's and names leading to a final property within the Thrift object. In this case it is
DataUnit. Here is how we get the property-paths for a DataUnit.

The Tapmapper function only needs to recieve one of these entries and return a path to a given property as the partitioner would have defined it. We can get
a list of the taps from a pail connection or PailStructure once the tapmapper function is defined and assigned to the PailStructure. The tap map for DataUnit,
with the current pail structure which created the Pail above, looks likes this.

There are tap mapper functions defined for all 4 partitioners supplied in the Pail-Thrift and Pail-Graph libraries. There is an additional null tapmapper which is the default value for
any PailStructure. The tap mapper for the Union-name-property partitioner looks like this. The most complicated part is creating a reasonable key name for a given property path. The second function is the one used in the PailStructure definition and which returns the mapper to the PailStructure.

Now that all of that is done, getting Cascalog taps is easy. We just ask for the property we want with get-tap.

Get-tap takes all the work out of getting taps from a partitioned Pail. We no longer need to remember what our partitioner is doing every time we want to create a tap.
If there is ever any question about the taps available for a Pail we can list them with list-taps.

Next to getting taps another somewhat painful part of using thrift with Cascalog is getting the values back out of the taps. For simple properties like first name, last name and age this is no problem. But when it comes to more complex properties which are structures of optional values this is more of a problem. Location is an example of this type of structure. In my previous post the solution was this function.

This is fine, but it is not generic which means for each structure like Location, there needs to be another Cascalog operator to pull it apart. Pail-Graph provides a solution to this in it's field-keys function. Field-keys gives a list ordered by field id, of any structure or union. Using field-keys allows the creation of a more generic Cascalog operator.

Now we have a Cascalog function that can deconstruct any structure that might come along in the PersonPropertyValue union. Using all of this together is very easy and nice
compared to the version in the 'Thrift, Pail, Cascalog and Clojure' post. Compared to that example, getting taps using operators to extract the data values is much easier here with Pail-Graph.

Using Pail with a Graph Schema and Cascalog can be a fairly streamlined experience with just a little introspection of the datatypes being written to any given pail. Graph Schema provides some structure where there could be none, and at the same time provides some infrastructure to make life easier when it comes to actually working with the data. If you have any idea's on how to improve this further, please fork me. Leave any comments below.

Leave a Reply

Your email address will not be published. Required fields are marked *