Thrift, Pail, Cascalog and Clojure!

Thrift, Pail, Cascalog and Clojure!

Thrift, Pail, Cascalog and Clojure have been consuming my work for the last 2 months. It isn't that hard. It's just that there isn't much in the way of documentation.  In this post I'll show how to get all of these things working together. It's actually pretty easy and it's very slick once it's working. Many thanks to David Cuddeback and his clj-thriftclj-pailpail-thrift and pail-cascalog libraries.

At some point in time I read this  post about Thrift and Graph Schemas by Nathan Marz. Then I started reading his book 'Big Data'. Nathan outlines the use of Pail for managing data. Pail simplifies things dramatically over the pains I've seen with other tools, or worse, written from scratch in java. Pail is no different than many other Big data technologies, it is Java. Thankfully we have Clojure to fit everything together seamlessly. Cascalog can handle the map reduce, Graph, Thrift and Pail can manage the data. It was looking like a fun project.

So lets get all of this working in Clojure. Its like many other things, its not all that hard once you know how. If you want to follow along this project is available on my GitHub.

The Graph Schema

The graph schema is where it all starts. Using a Union for properties allows for the schema to adapt and change without impacting code, yet still enforcing a consistent shape to the data.  In this example we have a Union called Person Property Value. This is where all properties go in this simple example. Location is the only complex property, it is a struct with several optional values. At the intermediat level here is a structure called person property, This serves to connect an id with a property.  Finally we have a DataUnit Union. This is the Thrift object that we will be storing. Everything in the database is a Data Unit.

Now to compile this thing and create some java code.

There is a plugin for leiningen called thriftc that can automate this. Take a look at this project's project.clj.  If you use thriftc then leiningen will take care of it all.

lein clean; lein compile

Underneath it all this is pretty much all there is to it.

thrift -v --gen java -out ./ people.thrift


Now we just need to create some thrift objects.  That turns out to be pretty easy.

After a couple of these it becomes clear how this maps directly to the schema. It's also pretty obvious that it would be easy to automate.

Now that we have some thrift objects, Thrift and pail can work together to serialize these into the database. But it might be good to try looking inside these things. This is pretty easy with clj-thrift.

Getting the current value of the Data Unit union gives us a Person Property structure which has an id and a personPropertyValue named property.

We just need to go one level deeper to get to the id or the property. To get the value from a structure we need to give thrift/value a key.

Asking for the property gives us the PersonPropertyValue union.

To get the value of the PersonPropertyValue union we ask for the current value.

if we try the exact same code with a location data unit we see that yet another level is needed to get all the key values from the structure.

Now we know enough to wrap these ideas up into functions. One function for the top level structure values like id and property in the PersonProperty
structure or id1 and id2 in the friendshipedge structure.
Another function for union values contained in a a secondary union (PersonPropertyValue) like 'name', and finally a special function to extract the
values from the lowest level structure such as 'city' in the location structure.

Here is how they all work.


Pail turns out to be not all that hard to use. But it does take some setup.  First we need a Pail Structure. This tells pail how to behave. It gives pail the serializer which we get from thrift. The Pail Structure also gives pail our partitioner which tells pail how to partition the data. Here is our Pail Structure.

The important bits are the type, serializer and partitioner.  For now we are using the Union-partitioner that came with Pail-Thrift. The only other special piece of this is that this file must be precompiled,  So in your project.clj you'll need something like this.

You'll want to do a 'lein clean; lein compile' anytime you change this.

The partitioner is fairly straightforward as well. It has two primary methods make-partition and validate. Make-partition returns a vector of folder names, validate simply checks the first entry in that same list for validity and returns a vector of  [true (rest dirs)] if it's ok. false otherwise.  This partitioner only looks at the top level Union and returns the field id for the current field as the directory name. Validate checks to see if the number given is in the list of field ids.

Using this pail is pretty simple. First we create a Pail Spec from our DataUnitPailStructure, then we create the pail from the PailSpec.

Now we write to it. We've only got seven objects so this will do.

If we look in our example_output directory we will see that we have 2 directories.  1 and 2. looking at the schema will show that 1 is the field id for property and 2 is the field id for friendship edges.

--- tree example_output
├── 1
│   └── 74fe8e95-cadb-47da-801d-3ff898edfc12.pailfile
├── 2
│   └── 74fe8e95-cadb-47da-801d-3ff898edfc12.pailfile
└── pail.meta

That's great but it's not what we need or want. There are many other reasons to vertically partition data, in addition to all of those reasons vertical partitioning also makes using Cascalog with this data much easier. And while we're at it lets change the directory names to the field names so we can see what is going on. Nathan argues well for using field id's and it makes sense to divorce the field names from the database structure. But when it comes to understanding how this stuff works names are good.

Back to the partitioner

We need a partitioner that will not only give back names, but also drill deeper into the object if the name contains 'property'. If we do just that much we  will have a much more powerful generic partitioner.

Here's our new partitioner, you can see that all we do is look for the name and go one level deeper if the name is "property" and that 2nd tier structure has a :property field.

Now all we need to do is change our Pail Structure to point at this new partitioner.

Don't forget to recompile.

 ---lein clean; lein compile.

Reconnect to our pail and write our objects.

Now let's check our output file tree.

── tree example_output
├── friendshipedge
│   └── dd5208a2-1cf1-4185-930f-2cb0ecc1e837.pailfile
├── pail.meta
└── property
    ├── first_name
    │   └── dd5208a2-1cf1-4185-930f-2cb0ecc1e837.pailfile
    ├── last_name
    │   └── dd5208a2-1cf1-4185-930f-2cb0ecc1e837.pailfile
    └── location
        └── dd5208a2-1cf1-4185-930f-2cb0ecc1e837.pailfile

Now that's more like it!  This behavior is important for many reasons not the least of which is ease of use with Cascalog.


When using Cascalog the first thing you need is a generator.  There are three flavors of generators but what we want is a Tap. To be specific, a Pail Tap.  These are easy to get because we just use our pail-connection. All we have to do is connect to a pail using the PailStructure and path or open an existing pail to create a connection.

Heres a basic pail tap.

The problem with this tap is that it will bring back all the thrift objects it can find without any idea of what is in them. It could be handled but it's very messy and complicated. What we really want to do is leverage the vertical partitioning that is built into our data with Pail.  This turns out to be very easy.  One of the arguments to create a pail tap can be a vector of paths where each path is a vector. That tap then only brings back objects from those paths. Here is a way to create all the taps we need.

Let's try out the first name tap and see what happens.

That worked pretty well but everything is still inside a thrift object which isn't very useful. We need an operator to deconstruct it.  All we need is a defmapfn.  These have been renamed in Cascalog 2, everything that was def???op is now def???fn.  Additionally these guys are now real functions so we can use them like functions to test them out.  We need two operators, and we can use the functions we created before.

These work as advertised.

Now let's wrap up a full query using our taps and operators in a function.

Here it goes.

Now that's cool.

Thrift, Pail, Cascalog and Clojure, all together!

Now that these are all working together it's time to explore.  All of this code is available and ready to run on my GitHub.

Some things to consider are using field id's instead of field names for your vertical partitioning. It will mean that you can change a property name without impacting your existing data. Just don't change the field numbers in your schema.  Cascalog taps could easily be generated from a thrift data type. A generic defmapfn that could deconstruct any structure could also be created.  So wether the data is stored in directories of field id's or names makes no difference to us. With just a little bit of code all of this could disappear under the covers.

Now I'm wondering what it might be like to use Fressian objects instead of Thrift objects.

Please leave comments below.

Leave a Reply

Your email address will not be published. Required fields are marked *