Fresh data with Prismatic Schema, Fressian, Pail and Cascalog.

Fresh data with Prismatic Schema, Fressian, Pail and Cascalog.

This is the fourth post in a series on using Pail and Cascalog in Clojure.

In the first post I wrote about using Thrift, Pail and Cascalog. My initial goals were to explore Nathan Mar's Lambda Architecture from clojure. In the second post I used Fressian instead of Thrift to serialize clojure data to and from a pail. The simplicity of using Fressian with native clojure data types was really nice.
In the last post, Using pail with graph schema I expanded on the use of graph schema, Thrift and pail by using a tap mapping abstraction on top of pail and graph schema which makes it easy to create taps from vertically partitioned pails.

This post is about doing the same thing but with Prismatic Schema instead of Graph Schema, clojure data types, rather than thrift objects, and fressian as a serializer instead of thrift. Replacing Graph Schema and Thrift with Prismatic Schema and clojure data types simplifies everything and gives a few benefits over using thrift objects. In the process of creating this example the tap mapping abstraction has found it's way into clj-pail-tap, an extension library for David Cuddeback's clj-pail library. The end result is the Pail-Schema library which is much simpler than the thrift based libraries. For the most part pail-schema simply combines clj-pail-tap and pail-fressian so that it is then a simple matter of creating a pail-structure which uses them.

All of the code in this post is available in the pail-schema-example repository. Just like the last time, clone my repository, fire up a REPL and follow along!


Prismatic Schema

Using Prismatic Schema to define and enforce the data shape in a database does not seem to be exactly what Prismatic had in mind when they created it. However it does work very well
as a Graph Schema replacement. Instead of just throwing exceptions when data does not fit the schema, Prismatic Schema actually prints mostly reasonable messages that will tell you
why your data doesn't fit the schema. In addition to that, with Schema 2.0, Prismatic schema can now do coercion of your data. These two things alone make Prismatic Schema a strong
competitor to Graph schema. Add in that everything including the schema and the data is native clojure code, and things are really looking good. There are a lot of advantages, and I'm having a hard time finding any disadvantages. Writing code is really simple when apples are just apples.

  • Better validation messages
  • Coercion
  • Schema is clojure code
  • Data is clojure data

As with the other examples I've recreated the same Schema here, using Prismatic Schema. It defines a Data Unit which is the thing all database entities are made of. A Data Unit can be a Person Property or a Friendship Edge. A Person Property can either first-name, last-name, age or location. Location is a map which contains :address, :city, :county, :state, :country and :zip any of which can be nil.


This schema maps almost exactly to the graph schemas used in the previous examples. As before this schema consists of a single Data Unit which is a union of properties. Each property for a person becomes a single Data Unit. Simple properties like first name, last name and age are defined inline as a part of the Person Properties union. Although they could also be defined separately in the same way that the Location property is defined.


Constructors

In addition to creating the schema, it is also helpful to create some constructors that will make creating a Data Unit a simple task. These constructors are similar to those used in the Pail-Fressian example. In this example I also retained the type
hints although they provide no real value since they do not persist through coercion or when read back from a pail.

The very first function, 'master-schema' provides a way for any functions to get a handle to the Data Unit schema. Everything else is just to help make it easy to create the various parts of a Data Unit.



Creating some Data Units

Creating our DataUnits is as straight forward as can be, and what we end up with plain old clojure data.



Validation

All we've done so far is create some basic clojure data maps. Now we can validate them with our schema. This is all we need to validate the du1-1 data unit. On success we get the original du1-1 as a return.


Here is a not so friendly way to validate everything in our list. If anything fails, you won't really know which one it is.



Coercion

Of all the Data Unit's defined, there is only one that is invalid, du1-4 has age as a string rather than an integer. The error message is only slightly better than the exception we would get from thrift.


But we can coerce du1-4 into the shape we want. Prismatic schema currently provides two coercers, and you can also write your own. The two provided are 'json-coercer-matcher' and 'string-coercer-matcher'. One thing to be aware of, if you are counting on type hints, meta does not survive coercion. Nor does it survive the roundtrip from a pail. You can see that by examining the types before and after a coercion or returning from a query.

From the code:

  • Json-coercer-matcher

    "A matcher that coerces keywords and keyword enums from strings, and longs and doubles
    from numbers on the JVM (without losing precision)"

  • String-coercer-matcher

    "A matcher that coerces keywords, keyword enums, s/Num and s/Int,
    and long and doubles (JVM only) from strings."

Prismatic's coercer takes a schema and coercion matcher and returns a coercion function. Here are two simple wrappers for using both coercers with our Data Unit schema.


To get a good version of the du1-4 Data Unit, all we have to do is call the coercer on it. Then we can add it to the object list for insertion to the database.



Defining a Pail

Defining a pail is a little bit different now since clj-pail-tap is adding some extra functionality over the old Pail Structure definition found in clj-pail. There is now the option of a Schema, rather than a data Type, and there is also Tap Mapper and Property path generator entries. Otherwise it is still rather straight forward. We need to specify the Fressian serializer, and a partitioner that knows how to look at native clojure data rather than disecting Thrift objects. Overall this part is not so different than before.



The Partitioner

We've already seen people/master-schema and the fressian-serializer is the one we get from pail-fressian. The rest is code we'll need. Pail-Schema provides a fairly generic partitioner, tapmapper and property-path generator. I experimented with type/meta information, and named schemas, all of which seem like they might make things simpler and more flexible but in the end were not that helpful or persistent.

The easiest thing is still the way that things work with thrift. Look at the property names/keys in the data and use those to create partitions. This also means that the tap mapper can use the schema to do it's work and everything will be consistent. The partitioner doesn't look too much different from the two level property name partitioners in the other posts. It looks for anything ending in [Pp]roperty and looks for :property inside of that for a second level directory.



The Tap Mapper

The Tap Mapper code is supposed to take the output of the property path generator and return a map of property paths where the compounded property name is the key. It is totally up to you how to construct the keys, but the results should match the paths that the partitioner creates.


It's quite possible that code could be shorter. But it does what it should. Now the Tap Mapper needs to be fed with a list of property paths. With thrift it was easy to traverse the java objects and return a list of property paths. With Prismatic Schema it's about the same. Pail-Schema provides this functionality with the property-paths function. Property-paths is a descent parser which could use some fleshing out. It currently does not support named schemas, and there is the possiblity of other problems as well. It does currently support this simple schema, which is good enough for now.

That finishes up our Pail Structure, now we just need to use it. Remember your Pail Structure is a gen-class so it needs to be :aot compiled. To play with the pail structure we just need an instance. Then we can ask it all sorts of things. To see everything it can do take a look at the defrecord. One of the easier things we can do is ask the partitioner for the partition target of one of our data units. We can also get the tap mapper function.



Tap Maps

Part of the core functionality of clj-pail-tap is to also provide easy ways to get to the tap maps. These functions will use a pail structure or a pail connection, and work with whatever you have set up wether it is a thrift type or a prismatic schema. So in addition to the functions in the Pail structure we can also do things like this.



Using the Pail

Our Pail Structure seems to be working fine. But we haven't written anything to the pail yet. As with the previous examples this part is pretty simple.


The pail now has some data and the entire pail looks like this.

─(16:41:%)── tree example_output
example_output
├── friendshipedge
│   └── 23c58dc8-def8-4613-a11e-8101cddf4432.pailfile
├── pail.meta
└── person-property
    ├── age
    │   └── 23c58dc8-def8-4613-a11e-8101cddf4432.pailfile
    ├── first-name
    │   └── 23c58dc8-def8-4613-a11e-8101cddf4432.pailfile
    ├── last-name
    │   └── 23c58dc8-def8-4613-a11e-8101cddf4432.pailfile
    └── location
        └── 23c58dc8-def8-4613-a11e-8101cddf4432.pailfile


The simplist Cascalog Query

This is where it starts to get fun. With thrift, we always had a thrift object to deconstruct. With these data objects there is no real need. We can look at them and use them as they are. Deconstructing the data for cascalog is simpler, and more flexible than with thrift data. We can do a raw query with no deconstruction and still see what we've got.



partial deconstruction

It could be that what we want back from cascalog is not so deconstructed at all. Maybe all we want is to deconstruct it far enough to be joined by the query. In this case that means getting two values, id and Person Property, whatever it is. The defmap for that is really simple, and it works for every kind of Person Property we have.


Using this defmap with location gives us location as the map it was original created as. This is potentially a much more useful format downstream.



Getting Pail Taps

Notice that getting taps is very easy with the tap mapper doing all the work underneath. It's also possible to leverage the tap-mapper in other ways. We could create a tap that includes all person properties but not friendship edges. That person property defmap would be very unhappy if it encountered a friendship edge. In this particular case there is an easy way to make sure we only get person properties, (pail/get-tap mypail :person-property) would do the trick. But if the schema were more complex that might not get it. It is possible to create a tap from multiple paths. Leveraging the tap mapper we can select the paths explicitly by keyword. The following function uses the base pail->tap function to create a custom tap which explicitly lists first-name, last-name, location and age.



full deconstruction

Of course it's also very easy to deconstruct the data units when we query them. There are really only two flavors of person property, Location and the other simple properties. The defmap functions for both of them are much simpler than those created for thrift and it definitely seems like these could be improved upon.


Now we can query for and deconstruct everything. Here's what that looks like.



Conclusion

I've been working on various forms of using Pail within Clojure for several months now. Using the Fressian Pail with no schema but just constructors was really nice, but seemed a bit loose in some ways. Maybe that is ok, but I definitely prefer having some sort of schema to keep the shape of the data consistent. Prismatic schema does that by offering validation and coercion, both of which are nicer than what Graph Schema and Thrift offers. Keeping all the data in native data formats has a transparency that is refreshing. Using Prismatic Schema with Pail-Fressian and clj-pail-tap has a simplicity and power that is hard to dismiss. I'll be continuing to use this framework and see where it leads.

Leave a Reply

Your email address will not be published. Required fields are marked *