Fressian, Pail and Cascalog

Fressian, Pail and Cascalog

In my previous post I wrote about using Thrift, Pail and Cascalog. In this post I'll replace thrift and graph schema with Fressian and native data types. It turns out that Fressian, Pail, and Cascalog go together like peanut butter and jelly. As before this is based on David Cuddeback's clj-pail and pail-cascalog libraries. Instead of pail-thrift I have a new Pail-Fressian library which handles the details of using Fressian with Pail. Pail-Fressian is available on Clojars. All of the code in this post is available in the example.clj in the library. Clone my repository, fire up a REPL and follow along!

Leaving out thrift greatly simplifies everything.In fact, if you haven't read my previous post, you should go do that so you can fully appreciate the simplicity of using Fressian instead of Thrift. You can learn more about Fressian by watching Stuart Halloway's presentation on Fressian at Clojure Conj last year. Fressian is not a serializer, but does a really good job at it.


Look ma! No Schema!

We don't need a schema for this. Although I did make some simple types to make things easier. How you do it is totally up to you. This is an incredibly flexible system. These types are roughly modeled after the thrift objects I used before. But they are actually simpler while retaining the same flexibility as the graph schema unions and structures. There is a PersonProperty that holds an id and a property, and there are three properties, FirstName, LastName and Location.


creating some data

The code to create some of these is very straight forward. Since we have fressian, there's no need to build any special objects. We just put the data together how we want it.


These data objects look as simple as you would expect.



The Pail Partitioner

The Pail partitioner is also fairly straight forward. The partitioner has no problems looking around at these objects, and we've given them types which means we can control pretty much anything we want just based on the type name. This partitioner uses the type name as the directory name. If the type name ends in [Pp]roperty, it tries to get the type of the :property field and that becomes a second level directory. Here's the make-partition function from the partitioner. You will want to make a partitioner to fit your data, but this partitioner might be a good place to start.


Now that we have some data and a partitioner, We need a Pail Structure. it looks just like the others. it's just got Fressian written all over it instead of thrift. This is a gen-class so remember to recompile, and restart your REPL when you change anything. Thankfully there's not much to change.



Create a Pail

Now we need a pail so we can write some data. This is the same as the thrift example.


Wow, that was easy. Here's how the pail looks.

example_pail
├── PersonProperty
│   ├── FirstName
│   │   └── be3242ba-2922-427a-9d72-109b6c5ed9fb.pailfile
│   ├── LastName
│   │   └── be3242ba-2922-427a-9d72-109b6c5ed9fb.pailfile
│   └── Location
│       └── be3242ba-2922-427a-9d72-109b6c5ed9fb.pailfile
├── friendshipedge
│   └── be3242ba-2922-427a-9d72-109b6c5ed9fb.pailfile
└── pail.meta


Cascalog

Now let's get some data back out. We can get a basic tap right at PersonProperty and take a look at what we have.


These look a lot different from the raw thrift objects we got back in the thrift example. Because they are native clojure data types they are a lot easier on the eyes.
We only need one function to deconstruct these three data types, and it's an easy one. Because defmapfn's are functions we can try it out without cascalog.


Let's put this all together into a query! We need some taps for our pail partitions and some queries to use them.


And here it goes.


Using Fressian instead of thrift makes almost everything easier. Even though Fressian is not a serializer, it makes a great serializer and it works beautifully with Pail. The simplicity
of the data objects in this example verses the thrift and graph example simplifies everything from beginning to end. Fressian, Pail and Cascalog make a very flexible and powerful system.

3 Comments

  1. Create articles on Clojure and Pail – really helps to see things tied together after reading Big Data. How would this work with consolidation? Pre-consolidation, this is my directory structure:


    $ tree
    .
    ├── PersonProperty
    │   ├── FirstName
    │   │   ├── 5ed33c10-aabc-4c26-919f-6e10169bb170.pailfile
    │   │   └── e335fffe-8974-4655-a44e-013b9cd4290a.pailfile
    │   ├── LastName
    │   │   └── 5ed33c10-aabc-4c26-919f-6e10169bb170.pailfile
    │   └── Location
    │   └── 5ed33c10-aabc-4c26-919f-6e10169bb170.pailfile
    ├── friendshipedge
    │   └── 5ed33c10-aabc-4c26-919f-6e10169bb170.pailfile
    └── pail.meta

    After consolidation, it looks like this:


    $ tree
    .
    ├── PersonProperty
    │   ├── FirstName
    │   ├── LastName
    │   └── Location
    ├── e6
    │   └── conse6fb31d4-bf02-477c-ab20-c825d775ea40.pailfile
    ├── friendshipedge
    └── pail.meta

    When I run the cascalog queries, I get no results…any thoughts?

    • After digging more, the key is the validate function. Rather than validating anything, this version will maintain vertical partitioning (and query-ability) after consolidation.


      (p/validate
      [this dirs]
      (let [[entity attribute & others] dirs]
      [(boolean (and entity attribute)) others]))

      Thanks again for the great articles!

Leave a Reply

Your email address will not be published. Required fields are marked *