Turning on useful output forGradle test targets.

By default, Gradle’s output while running tests tends to be quite terse:

com.comp.package.SomeTest > shouldDoThis PASSED

com.comp.package.SomeTest > shouldNotFail FAILED
   java.lang.AssertionError at SomeTest.java:53

Which is not exactly helpful when tracking down what went wrong! If you’d like to get more info out of gradle, then here’s a couple of suggestions to help you along the way.

Command line tweak

If all you’re looking for is a simple tweak to use from the command line then the -i, (or –info), option might be just what you’re looking for:

gradle -i test

This sets the log level, which defaults to NONE, to a more useful INFO. Meaning the tests will happily pump out reams of standard out & error text for you to stare at. Moving up one level you can also use the -d, (or –debug), if you’re after even more gumpf to scroll through.

Oh, and by the way, when you’re just wanting to re-run a single test while tracking down the issue, then use the single.test option, which takes the name of the test class, minus the final ‘Test’ e.g. ‘MyFirstTest’ becomes:

 gradlew -i -Dsingle.test=MyFirst test

Gradle build.gradle tweaks

If you’re looking for a more permanent change, then there are two options that I use regularly. To get these into your project add the following snippet to your build.gradle file:

test {
    testLogging {
        exceptionFormat = 'full'
        //showStandardStreams = true
    }
}

Output full exception info

I generally have exceptionFormat set to ‘full’ for all projects, as this ensures any exception that causes a test to fail will have its full details emitted. So, for example, if you’re a fan of Hamcrest or Junit style asserts, this setting will give you the assertion message that caused the test to fail & the stack trace, e.g.

com.comp.package.SomeTest > shouldNotFail FAILED
 java.lang.AssertionError:
 Expected: is "the value I expected"
 but: was "actual value"
 at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
 at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:8)
 at com.comp.package.SomeTest.shouldNotFail(SomeTest.java:53)

For me, this should be the default in Gradle!

Output full stdout & stderr streams

Something I generally don’t leave on, but which I like to switch on while debugging any particularly stubborn ‘only failing under Gradle’ style tests, is the showStandardStreams option. This does what it says on the tin. It’s like having the ‘–info‘ switch on all the time. Be warned though, it generally spams your output a lot, (depending on your log level). Hence, is not recommended to be on for general configuration as it makes finding which tests have failed in the output trickier than it need be.

That’s it for now. Over and out.

DL

 

Improved quality of serialisation code

In our world of highly distributed applications, tiered architectures and a myriad of different storage technologies, we often need to ship objects across the wire. As projects evolve and new serialisation techniques are bought to bare, test coverage often suffers, especially when new techniques leverage existing metadata or mark-up. In this post I’ll be introducing Populace: a small test library that can be used, and customised, to walk and fully populate object graphs, giving you much higher confidence in your serialisation code.

In the last year I’ve worked on at least three sizeable projects that involved the migration of complex domain models between serialisation methods, (e.g. between Xml using JaxB and Json using Jackson), between technologies, (e.g. between traditional RDBMS, in-memory grids and the new breeds of NoSql DBs), or between data-centre. Each of these projects required a high-level of test coverage to be certain that data wasn’t being lost, corrupted or created.

While it is fairly straight forward to effectively test serialisers and de-serialisers via unit tests, what often complicates matters is when the serialisation code relies on annotations on the domain types themselves, or some other form of metadata. To fully test serialisation in these situations requires tests that cover every field on every domain type, otherwise a single missing or bad annotation can spell disaster. Hand crafting fully populated instances of your complex domain models are time-consuming, error-prone and brittle, requiring developers to enhance tests when new fields are added. A much better approach is to use reflection, to recursively walk the fields of your domain types, ensuring each field is neither null nor the default. It is then fairly trivial to test that objects can round-trip through the serialisation code unchanged.

Having encountered this challenge on more than one occasion, and when a quick Google didn’t find a candidate open-source project to help, I decided to write one, and Populace was born. The goal of Populace is to do the heavy lifting of graph walking and population, requiring the user only to provide code for special types or special requirements, while still allowing users complete flexibility should they need or want it.

A quick example

OBJECT GRAPH WALKING MADE SIMPLE

So as a quick example of how easily Populace can be used for walking an object graph, let’s look at some example code that outputs the type of each object it encounters while recursively walking an object graph:

GraphWalker walker = GraphWalker.newBuilder().build();
walker.walk(thingToWalk,
    f -> System.out.println("Field: " + f.getName() + " - " + f.getGenericType()),
    e -> System.out.println("Element: " + e.getGenericType()));

The walk method simply takes the instance, or type, to walk and two functions that will be used to visit each encountered field and element along the way. That’s pretty simple right?

To help understand the difference between fields and elements, and to help visualise what the output from this might be, consider an object graph that represent a simplistic phone book: a top level book object has a field called contacts that is a collection of elements whose type is Contact. Each Contact instance has a field called name and another called phoneNumber. If GraphWalker were to walk this object graph it would visit the contacts field of the top level book instance, then visit each of the child elements of contacts. For each child Contact element, it would visit the name and the phoneNumber fields. As GraphWalker takes a depth-first approach the output might look something like this, (I’ve indented the output to make it more readable):

Field: contacts - java.util.HashSet<Contact>
  Element: your.package.Contact
    Field: name - java.lang.String
    Field: phoneNumber -  java.lang.String
  Element: your.package.Contact
    Field: name - java.lang.String
    Field: phoneNumber -  java.lang.String
  Element: your.package.Contact
  ...

The eagle-eyed among you might wonder if there is a typo in the above output: while the runtime type of the value held in the contacts field might be HashSet, surely the field would most likely have a compile-time type of Set<Contact>, neither of which match the HashSet<Contact> shown above. Well, there is no typo: Populace will combine the compile-time and runtime type information available to produce the most specialised and concrete type it can, parameterising types and resolving type variables as needed.

RECURSIVE POPULATION OF AN OBJECT GRAPH

Populating an object graph that uses standard Java types and types with a default constructor is similarly simple:

GraphPopulator populator = GraphPopulator.newBuilder().build();
PhoneBook phoneBook = populator.populate(PhoneBook.class);

This time the populate method takes the type, or instance, to populate and returns a populated instance, where each field is non-null and each container type has at least one element in it. This is the default behaviour, but Populace also allows you to customise all aspects of its operation, so that you can tweak it to meet the requirements of your project, your use-case and your code base.

Shaping Populace to meet your needs

In this introductory post I’m not going to go into the customisations in any great detail – I’ll do that in follow on posts. Instead, to give you a feel for the flexibility Populace offers, I’ll list off the types of customisation it supports:

WALKING THE WAY YOU WANT

First, lets look at how you can customise the object graph walking. Filters and Inspectors can be used to control what fields and elements are walked, while Visitors are how you pass logic to Populace that you want running against any unfiltered fields and elements.

Filters

Filters can be installed to stop specific fields or elements from being walked and visited i.e. they stop the walker from re-cursing down any further into the object graph. Field and Element filters are installed when building a walker instance, as shown below:

GraphWalker walker = GraphWalker.newBuilder()
    .withFieldFilter(field -> field.getName().contains("includeMe")
    .withElementFilter(element -> element.getValue() != null)
    .build();

The FieldFilters class defines some common field filters. By default the walker excludes static and transient fields, though this can be overridden. Filters can be combined using the standard <code>Predicate</code> logical operators, e.g. AND, OR and NOT.

Inspectors

Inspectors are similar to filters in some ways. Where as filters, given the list of fields and elements a type exposes, control which fields and elements should be walked, inspectors actual define the set of exposed fields/elements from a type. Inspectors are registered against types, where as filters work across all types.

Populace defines inspector to cover most common types. You may need to write your own inspectors, especially if you have custom container types that are not derived from the standard `List`, `Set` or `Map` interfaces.

Populace provides one special inspector called the TerminateInspector, which makes the type opaque to Populace i.e. the walker will not re-curse into the internals of the type. This inspector is installed for some standard packages by default e.g. java.util, though any defaults can be overridden.

Inspectors can be installed to handle a specific type, but they can also be installed so that they handle any types derived from a particular super-type, or any types that belong to a specific package. Populace also supports default inspectors for array and non-array types. Inspectors are installed when building a walker:

final GraphWalker.Builder builder = GraphWalker.newBuilder();
final GraphWalker walker = builder.withInspectors(builder.inspectors()
        .withPackageInspector("java.util", TerminalInspector.INSTANCE)  // Don't look inside java.until types
        .build())
    .build();

Visitors

The visitors you pass to the `walk` function will be called back as each non-filtered field and element exposed by the inspectors of the types encountered. They are where you define your logic for what you want to do with each field and element encountered. A visitor extract information from the field and it can also mutate the value of the field or element if required.

BENDING THE POPULATION TO YOUR WILL

The population / initialisation side of Populace offers a similar degree of flexibility to the walking side. Mutators allow you to control how the instances encountered are mutated and InstanceFactories give you the same control over how instances of a type are created. Finally, a NullObjectStrategy can be registered for those rare occasions where Populace simply does not have enough information to populate a field.

Mutators

Mutators allow you to install custom code to control how a type is mutated, giving you complete control over what data is pushed into your object graphs as they are populated. You can build different sets of mutators to fulfil different use-cases, e.g. one set may ensure all fields have a value, to ensure no data is lost during (de)serialisation, another set may set all fields to null to ensure no NPEs are thrown during (de)serialisation.

mind controlPopulace comes complete with standard Mutators that can be composed to achieve many tasks, and you are free to implement your own as needed. By default, a new GraphPopulator comes pre-configured with a sensible set of mutators that will populate most common types, though these defaults can be overridden as needed.

Mutators can be installed to mutate specific types, subtypes of some known super-type, or any type belonging to a package, plus default mutators for array and non-array types can also be configured. See below for a simple example of configuring mutators:

GraphPopulator.Builder builder = GraphPopulator.newBuilder();
GraphPopulator populator = builder
    .withMutators(builder.mutators()
        .withSpecificMutator(String.class, new CustomStringMutator())
        .build())
    .build();

Instance Factories

InstanceFactorys allow you to install custom code to control how a type is created. Populace needs this to allow fields or elements that are currently null, or container types that are currently empty, to be populated.

Populace comes complete with factories for standard types and any type that can be constructed using its default constructor, for any other type you will need to implement and register an instance factory.

Instance factories can be installed for specific types, subtypes of some super-type, any type belonging to a package, and default factories for array and non-array types can also be configured.

GraphPopulator.Builder builder = GraphPopulator.newBuilder();
GraphPopulator populator = builder
   .withInstanceFactories(builder.instanceFactories()
        .withSpecificFactory(MyType.class, new MyTypeFactory())
        .build())
   .build();

Null Object Strategy

NullObjectStrategy can be installed to handle any null Objects encountered i.e. null fields or elements where no type information is available because either their compile-type type is Object or an unresolvable TypeVariable. In such situations, Populace does not have enough information to populate the element, but you may chose to handle this however you like by installing a custom NullObjectStrategy. By default, Populace will log a warning.

GraphPopulator.Builder builder = GraphPopulator.newBuilder();
GraphPopulator populator = builder
   .withInstanceFactories(builder.instanceFactories()
        .withNullObjectStrategy(new CustomNullObjectStrategy())
        .build())
   .build();

Wrap up

My main driver for writing Populace was so that I could reuse it next time I need it, but I’m hoping it can be of use to others too. I’m also already working on enhancements and a version 2.0, which will clean up some of the interface anomalies and inconsistencies that have been introduced along the way.  So please add it to your tool box and use it where appropriate, and let me know of your experience with it.

While I failed to find an open-source library that fitted the bill, there may be some out there – there generally is! – so if you know of one, then please let me know.

Finally, if you feel like giving any feedback, suggestions, criticism, etc on Populace, then please post a comment. I’m thick skinned😉

Over and out.

“Man Walking In Airport Escalator” – Image courtesy of lkunl at FreeDigitalPhotos.net

Extracting Index content from Coherence

When storing semi-structured data within a Coherence cache it’s common to add appropriate indexes on key fields to speed up queries and value extraction. Coherence indexes work in a similar way to those of a more traditional database, in that they allow queries to more efficiently target interesting entries, (or rows as they would be in the database world), than full cache (or table) scans.

It’s not uncommon to want to be able to see just want values are within the index, either as a debugging tool or to help build responsive screens and pick-lists in some interactive UI. Below you’ll find a handy aggregator capable of pain-free index content extraction.

The full code is at the end of the post, (stripped down to keep things uncluttered), or if you’re after the full unadulterated version, then take a look here my GitHub account. For those wanting more of a walk-through of the code read on…

Walking through the code

So the first thing we’re going to need is some logic that runs on the storage enabled nodes within the Coherence cluster. It its to gain access to the cache indexes, then it will need to run on the service the cache belongs too. The obvious candidates are Coherence aggregators and entry processors. The latter locks the cache entries and support mutation of the data, which we don’t want or need, so we’re going to built this using an aggregator.

Introducing the IndexContentExtractor – as recommended by dentists

The aggregator takes as contructor arguments the ValueExtractor that was used to create the index and, optionally, a Filter that can be applied server side to filter the index content. Such filtering can be useful if, for example, you’re building a user filterable pick-list. The two main methods that Coherence is going to call on the class are the aggregate() and aggregateResults() methods.

The former, aggregate, is normally call by Coherence to aggregate a set of entries. In our case we’re not interested about the entries – we just need a single entry to get at the index content. Once we have the index we extract the set of unique values it holds, optionally filter them, and the return them as the result of the call. Coherence will normal call aggregate() once on each storage enabled node, though this isn’t guaranteed and may change in future versions. See notes on performance below on how this can effect the functionality.

The latter, aggregateResults, method is called by Coherence to reduce the results from all the aggregate() calls into a single result. This is run either on the cluster member node instigating the aggregate call, or on the proxy node if the call was instigated by an Extend client. In our case the aggregateResults() call just needs to build a ‘master set’ from all the individual result sets returned from each aggregate() invocation.

@Portable
public class IndexContentExtractor implements InvocableMap.ParallelAwareAggregator, Serializable {

    @PortableProperty(value = 1)
    private ValueExtractor indexExtractor;

    @PortableProperty(value = 2)
    private Filter filter;

    public IndexContentExtractor(ValueExtractor indexExtractor) {
        this(indexExtractor, null);
    }

    public IndexContentExtractor(ValueExtractor indexExtractor, Filter filter) {
        this.indexExtractor = indexExtractor;
        this.filter = filter;
    }

    @Override
    public Object aggregate(Set set) {
        if (set.isEmpty()) {
            return Collections.emptySet();
        }
        final Map indexMap = getIndexMap(set);
        final Set values = getIndexedValues(indexMap);
        filterValues(values);
        return values;
    }

    @Override
    public Object aggregateResults(Collection results) {
        Set values = new HashSet();
        for (Collection s : (Collection<Collection>)results){
            values.addAll(s);
        }
        return values;
    }

    ...
}

Getting hold of the elusive index

When Coherence calls our aggregate(Set) method it passing through a set of cache entries. We don’t really care about them, as we’re not actually aggregating from them, except they do give us a way of getting hold of the map of indexes defined on the cache. That’s assuming they are BinaryEntry‘s, otherwise we fail gracefully. This means the extractor won’t work on some cache types, such as replicated caches. However, it will work with both Pof and standard Java serialisation.

    private static Map getIndexMap(Set set) {
        final Object entry = set.iterator().next();
        if (!(entry instanceof BinaryEntry)) {
            throw new UnsupportedOperationException("Only supports binary caches");
        }

        final BinaryEntry binaryEntry = (BinaryEntry) entry;
        return binaryEntry.getBackingMapContext().getIndexMap();
    }

Extracting the index content

Once we have the map of indexes finding the index we’re interested in and grabbing its content is childs play:

    private Set getIndexedValues(Map indexMap) {
        final MapIndex index = (MapIndex) indexMap.get(indexExtractor);
        final Map contents = index.getIndexContents();
        return contents == null ? Collections.emptySet() : new HashSet<Object>(contents.keySet());
    }

Filtering the values server side

Given the index values we can now apply our optional filter. Again the code is pretty straight forward:

    private void filterValues(Set values) {
        if (filter == null) {
            return;
        }
        for (final Iterator it = values.iterator(); it.hasNext(); ) {
            if (!filter.evaluate(it.next())) {
                it.remove();
            }
        }
    }

Bringing it all together

If we bring all of that together we end up with the following:

@Portable
public class IndexContentExtractor implements InvocableMap.ParallelAwareAggregator, Serializable {

    @PortableProperty(value = 1)
    private ValueExtractor indexExtractor;

    @PortableProperty(value = 2)
    private Filter filter;

    public IndexContentExtractor(ValueExtractor indexExtractor) {
        this(indexExtractor, null);
    }

    public IndexContentExtractor(ValueExtractor indexExtractor, Filter filter) {
        this.indexExtractor = indexExtractor;
        this.filter = filter;
    }

    @Override
    public Object aggregate(Set set) {
        if (set.isEmpty()) {
            return Collections.emptySet();
        }
        final Map indexMap = getIndexMap(set);
        final Set values = getIndexedValues(indexMap);
        filterValues(values);
        return values;
    }

    @Override
    public Object aggregateResults(Collection results) {
        Set values = new HashSet();
        for (Collection s : (Collection<Collection>)results){
            values.addAll(s);
        }
        return values;
    }

    private Set getIndexedValues(Map indexMap) {
        final MapIndex index = (MapIndex) indexMap.get(indexExtractor);
        final Map contents = index.getIndexContents();
        return contents == null ? Collections.emptySet() : new HashSet<Object>(contents.keySet());
    }

    private void filterValues(Set values) {
        if (filter == null) {
            return;
        }
        for (final Iterator it = values.iterator(); it.hasNext(); ) {
            if (!filter.evaluate(it.next())) {
                it.remove();
            }
        }
    }

    private static Map getIndexMap(Set set) {
        final Object entry = set.iterator().next();
        if (!(entry instanceof BinaryEntry)) {
            throw new UnsupportedOperationException("Only supports binary caches");
        }

        final BinaryEntry binaryEntry = (BinaryEntry) entry;
        return binaryEntry.getBackingMapContext().getIndexMap();
    }
}

Using the extractor

To use the extractor from client code you just need to call aggregate() on the NamedCache instance. Set the filter parameter to AlwaysFilter.INSTANCE to ensure that the aggregator runs at least once on each storage enabled node. Then just create an instance of the IndexContentExtractor, passing through the extractor used to create the index and, optionally, any filters you might want to apply.

If you do use a filter then it will be passed the contents of the index i.e. what ever the extractor used to build the index returns from its extract() call. If you want the filter to operate the value as a whole you’ll need to pass through the IdentityExtractor.INSTANCE as the extractor in the filters contructor. Alternatively, where the value held in the index is more complex, it is possible to pass Pof or reflection based value extractors to the filter to pick our sub fields, if needed.

In the example below, the index in question stores String values and the code is pulling back any values stored in the index that contain a specific sub-string entered by a user:

    Filter filter = new LikeFilter(IdentityExtractor.INSTANCE, "%" + searchTerm + "%", false);
    Set<String> results = (Set<String>)cache.aggregate(AlwaysFilter.INSTANCE, new IndexContentExtractor(indexExtractor, filter));

You can find some examples in the tests on my GitHub account.

Notes on performance

Coherence can invoke the aggregate() more than once per storage-enabled node if it so chooses e.g. to process different partitions of data in parallel. The criteria it uses to determine how many times to call seems to vary and I’ve not had the time to investigate the exact algorithm, and even if I did, it may well change in a later release. When Coherence does call the aggregator more than once per node, then each call will produce the same results. While this is not ideal from a performance point of view, functionally the reduce phase, i.e. the aggregateResults(), will remove these duplicates, and so functionality is not effected.

I’ve not found a way around this potential duplication, but then again I’ve never seen it be a performance problem, so haven’t felt the need to invest a great deal of time thinking about it. It would be possible to iterate through the key set of the cache to find a key that is owned by each storage node. But such an approach may well be slower due to having to pull the keys from the storage nodes, meaning more machine hops.

That’s it. As ever if you have a questions or comments then please holla!

Over and out.

Using custom cache implementations in Coherence 12

In this post I’ll be explaining how to implement custom schemes and caches using Coherence 12’s custom configuration namespaces. This is a powerful pattern, which allows you to override, extend or even replace functionality within Coherence, and you can do this right from the standard configuration file.

The Background: custom cache implementations in Coherence 3.7

I’ve just been through the endearing process of upgrading a large application built on top of 3.7.1 to the latest 12.1.2 version of Coherence and, at the same time, moving from Incubator 11 to version 12. There were several sticking points along the way, one being that we are using a custom implementation of LocalCache that provides a truly synchronous replicated cache, (Coherence’s own replicated-scheme being mostly asynchronous in its replication pattern). Said SynchronousReplicatedCache needs to know which InvocationService it should use to replicate changes, and the name of the cache itself. With 3.7.1 this could easy be achieved without the use of a custom namespace: all that was required was to pass the name of the invocation service and cache via init-parameters e.g.

<local-scheme>
   <scheme-name>cluster-time-scheme</scheme-name>
   <class-name>fully.qualified.path.to.SynchronousReplicatedCache</class-name>
   <service-name>MyFirstSynchronousReplicatedService</service-name>
   <init-params>
      <init-param>
         <param-name>cache-name</param-name>
         <param-value>{cache-name}</param-value>
      </init-param>
      <init-param>
         <param-name>invocation-service-name</param-name>
         <param-value>TheInvocationService</param-value>
      </init-param>
    </init-params>
</local-scheme>

As you can see, this is pretty straight forward, simple and intuitive.  Our SynchronousReplicatedCache extended Coherence’s LocalCache and, because it also implemented the XmlConfigurable interface, Coherence passes the <init-params> XmlElement through to the replicated cache instance via it’s setConfig(XmlElement) method, from which we can extract the cache & invocation service name. Simples.

You may notice that the cache name is provided using the Coherence macro ‘{cache-name}’. In 3.7 Coherence automatically substitutes ‘{cache-name}’ with the name of the cache being instantiated, prior to calling setConfig on the cache.

This all worked well and life was good. Then along came Coherence 12.1.2, which completely changed the way the configuration file was handled…

The Problem: custom <init-param>’s are not supported in Coherence 12.1.2

UpgradeMy personal opinion is that this is a bug or an oversight on the part of the Tangosol team, and may potentially be fixed in a later version. The above XML configuration snippet is clean, simple and understandable. It’s just unfortunate that it doesn’t work anymore! Luckily for me, this gave me cause to look into new ways of doing things: custom namespaces…

The Solution: use a custom namespace and the new @Injectable attribute to provide your own scheme implementation

SolutionCustom namespaces have been around in the Incubator project for a while, but with the latest version they have moved into the core product. While it was certainly a bit more work to get this working with custom namespaces as opposed to the old <init-params>, the new way of work is far more powerful.

Custom namespaces work well with the new Injectable annotation. This pairing provides support not just for the old ‘{cache-name}’ style macros and your own arbitrary parameters, but also supports injecting any custom resource you fancy, assuming you’ve register said resource with the ResourceRegistry.  This is indeed a very useful pattern.

Unfortunately, at the time of writing, the documentation around custom namespaces, schemes, injectable etc are all a little thin on the ground, which is the main reason for me blogging on it. Luckily, JK, (a.k.a TheGridMan), was available to give me some pointers and example code to get me starting, (thanks JK!).  Indeed, JK has previously covered custom namespaces in his post on Oracle NoSql and extending Coherence Services.  Though they don’t cover how to create new schemes or pass custom arguments to custom cache implementations, as we’ll cover here.

What we’re trying to achieve is support for something like the following in our cache config:

<caching-schemes>

   <acc:synchronous-replicated-scheme>
      <acc:scheme-name>synchronousReplicated</acc:scheme-name>
      <acc:service-name>SynchronousReplicatedService</acc:service-name>
      <acc:invocation-service-name>MyInvocationService</acc:invocation-service-name>
   </acc:synchronous-replicated-scheme>

</caching-schemes>

The custom namespace will introduce a new ‘synchronous-replicated-scheme’, which will subclass Coherence’s LocalScheme. It will need to support the standard <scheme-name> and <service-name> elements if its to work with the scheme mappings, and there’s the custom <invocation-service-name> element that it needs. You may notice there is no <cache-name> element, and that’s because its not needed: thanks to the magic of @Injectable we can get that anyway, as we’ll see later.

This same pattern can also be used to customising the backing map scheme of other caches, (though probably not with a synchronous replicated implementation!). Such a customisation might look something like:

<caching-schemes>

   <distributed-scheme>
      <scheme-name>custom-distributed</scheme-name>
      <service-name>DistributedService</service-name>
      <backing-map-scheme>
         <acc:custom-local-scheme>
            <acc:someCustomParam>SomeValue</acc:someCustomParam>
         </acc:custom-local-scheme>
      </backing-map-scheme>
   </distributed-scheme>

</caching-schemes>

So let’s take a look at what’s needed. We’ll start with an example cache implementation itself, then write a custom Coherence Scheme capable of instantiating our cache, and finally plug the scheme into to the cache configuration by way of a custom namespace.

  1. Custom SynchronousReplicatedCache
    Some of you may be disappointed to hear that I’m not going to be covering how to implement a truly synchronous replicated cache in Coherence here. This post is not about that, though I can cover this later if enough people shout. Suffice to say that the implementation extends Coherence’s in-built LocalCache, makes use of an invocation service to replicate its changes, and that it need to know its own cache name. Apart from that, this implementation is only of use as an example within the context of this post.

    public class SynchronousReplicatedCache extends LocalCache {
       private String cacheName;
       private final String invocationServiceName;
    
       public CustomLocalCache(String invocationServiceName) {
          this.invocationServiceName = invocationServiceName;
       } 
    
       @Injectable(value = "high-units")
       public void setHighUnits(int highUnits) {
          super.setHighUnits(highUnits);
       }
    
       @Injectable(value = "expiry-delay")
       public void setExpiryDelay(int delayMs) {
          super.setExpiryDelay(delayMs);
       }
    
       @Injectable(value = "cache-name")
       public void setCacheName(String cacheName) {
          this.standardInjectableParam = cacheName;
       }
    

    As you can see the class accepts four different parameters: one through its constructor and three via injection. First, it accepts the parameter that was previously provided by the custom init-param element: the name of the invocation service to use, though its constructor. This will be extracted from the config file by our custom scheme, (see below), and passed through as a constructor parameter.

    Next, there are two unexpected injectables: setHighUnits and setExpiryDelat. Though these may look superfluous, as they only call their parent class, but they also add the injectable annotation to the method, which is missing from LocalCache itself. By adding the annotations Coherence will inject the standard expiry-delay and/or high-units settings, if they are defined in the cache configuration.

    Finally, we have the setCacheName method, which coherence will call with appropriate name.

    At the time of writing there was little documentation that covered what macro names are supported by @Injectable, and what is supported varies depending on where they are used. Be warned, if you add an injectable for something that is not supported then Coherence will still call your method, but will pass nulll! So, for now its a matter of trail an error.

  2. Custom SynchronousReplicatedScheme
    The scheme class is the de-serialised form of the scheme’s element(s) found within the coherence cache config file. Coherence 12 provides in-built implementations for all in-built scheme types e.g. DistributedScheme, ReplicatedScheme, InvocationScheme, etc. It is the job of the scheme to hold all the configuration parameters defined within the scheme’s xml config and to instantiate appropriate cache or services when required e.g. when a cache is requested for the first time. For this example we’re going to be extending the in-built LocalScheme as our example cache implementation extends LocalCache.

    public class SynchronousReplicatedScheme
       extends LocalScheme
       implements ParameterizedBuilder<LocalCache> { 
    
       private String invocationSerivceName;
    
       @Injectable(value = "invocation-service-name")
       public void setInvocationServiceName(String name) {
          invocationServiceName= name;
       }
    
       @Override
       public ParameterizedBuilder<LocalCache> getCustomBuilder() { return this; }
    
       @Override
       public LocalCache realize(ParameterResolver resolver, ClassLoader loader, ParameterList params) {
          final SynchronousReplicatedCache cache = new SynchronousReplicatedCache(invocationServiceName);
          initialCacheWithStandardInjectables(cache, resolver, params);
          return cache;
       }
    
       private void initialCacheWithStandardInjectables(LocalCache cache, ParameterResolver paramResolver, ParameterList params) {
          final ChainedParameterResolver chainedResolver = new ChainedParameterResolver(paramResolver, new ResolvableParameterList(params));
          final ResourceResolver resourceResolver = ResourceResolverHelper.resourceResolverFrom(chainedResolver, new NullParameterResolver());
          final Injector injector = new SimpleInjector();
          injector.inject(cache, resourceResolver);
       }
    }
    

    The scheme itself is pretty straight forward. First there’s the injectable for the invocation service name, which Coherence will call at run time.

    Next we then override LocalCache‘s getCustomBuilder(). This method is called by Coherence to obtain the ParameterizedBuilder capable on instantiating caches that belong to this scheme. This could be a separate class, though in this case the scheme implements the ParameterizedBuilder interface itself, and hence returns itself.

    The realise() call from the builder interface creates a new instance of our replicated cache, passing in the invocation service name. Before returning the shinny new cache instance to Coherence there’s a little big of magic code, (i.e. code which is not really covered within the Coherence docs), which ensures the standard high-unit and expiry-delay injectables on the instance are called if such elements exist in the config. Strangely, the cache-name injectable would be called even without this call. It’s all a bit voodoo…

    That’s it – that’s your custom scheme and builder made. There’s not a lot of code to it is there? Next, we need to plug this into the cache config via a custom namespace handler:

  3.  Custom namespace handler
    The final piece of the puzzle is to introduce the custom namespace in the config file and our own namespace handler in the code.

    <?xml version="1.0"?>
    <cache-config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns="http://xmlns.oracle.com/coherence/coherence-cache-config"
       xmlns:acc="class://org.acc.coherence.config.example.custom.scheme.CustomNamespaceHandler">
     ...
    </cache-config>
    

    On line 4 of the config file above you can see the introduction of a new namespace ‘acc’ and its corresponding handler class CustomNamespaceHandler. The handler class is simplicity itself, as all it is required to do is to register a handler for our custom synchronous replicated cache scheme:

    public class CustomNamespaceHandler extends AbstractNamespaceHandler {
        public CustomNamespaceHandler() {
            registerProcessor("synchronous-replicated-scheme", new CustomizableBuilderProcessor<>(SynchronousReplicatedScheme.class)); 
    	}
    }
    

And that’s it done. This may not be as quick to get up and running as the previous custom init-params, but nor is it rocket science. What’s more, the above example really only uses primitive types as injectables, but the framework supports custom types, (which are picked up from the custom registry, providing a very powerful configuration and dependency injection framework), and macros, (macros being things like {cache-name}, {backing-map-context}, etc). Examples of all of this are available on my GitHub project here.

Personally, I’d of liked to have seen support for custom init-params continue, if just to smooth the migration, but I also think the work the Coherence team have done on their configuration improvements have really paid off and result in a very clean and highly customisable system.

That’s it for another post. Thanks for reading! If you have any questions on the above then please post it in a comment. Otherwise, until the next time…

Over and Out.

[Key Images courtesy of Stuart Miles at FreeDigitalPhotos.net]

Temporal Versioning in Coherence

Storing the versions of an entity that make up its history within a Coherence cache is a common pattern. Extending the pattern to support temporal queries, i.e. queries evaluated against, and returning, the different versions of entities, as they existed at previous points in time, requires a little more work and has generally relied on the mutation of metadata. Such data mutations are generally a bad thing, as they increase code complexity and the chance of data corruption. In this post I’ll propose an alternative approach that makes use of custom Coherence indexes in place of the mutations, thereby allowing the data to remain immutable and taming complexity.

Update: While writing this post and its associated code, I was building on top of the patterns documented by Ben Stopford in his blog post on versioning in Coherence. Unbeknownst to me at the time, Alexey Ragozin had previously covered what looks to be a very similar concept to what I’m about to cover share, in his post on time-series index for managing versioned data. Hats off to Alexey. What is it they say about great minds and idiots? Still, as I’ve pretty much finished the post I thought I may as well put it out there.

First, let’s start with a little background on versioning…

Versioning Patterns

Using Coherence to cache not only the latest version of an entity, but its history as well, is a powerful pattern. It’s the first step to building an immutable data fabric within Coherence. It allows not only the current state of an entity to be queried and retrieved, but also the state of the entity at any previous point in time.

As way of an example, take a warehouse stock control system, where each item in the warehouse is assigned a unique stockId and which has a requirement to generate a weekly stock level report and allow people to view historic reports.

In a traditional system, without versioning, data about the stock might be stored in a database or key-value store by its stockId. As stock levels changed, the corresponding entries would be updated, effectively erasing any history of previous states. Each week the system would build up a snapshot of the current state of the stock, generate the report and archive it off somewhere. The report is likely to be relatively large as it must make a copy of the data, so that users can view historic reports as the system keeps running and stock levels change.

What is more, if any mistakes are found in the data following the report generation will probably need a great deal of manual intervention to correct, with any reports generated after the mistake was made having to be re-run. Such correction are generally laborious and error prone. The further back the error, the more manual effort required to correct it.

So let’s look at some ways in which we can improve the design of the system…

Simple Versioning

The most simple versioning pattern is usually just to add a version number to the business key used to store the entity. In our example the data would be keyed against the combination of a stockId and a version number. Each time the data associated with the stock needs to be updated, e.g. as new stock comes in, or stock is sent out to fulfil a customer’s order, the new stock level is stored as a new entry with an incremented version number. In this way the system builds up a series of versions for each item under stock control as the data changes over time.

simple versioned entry

The weekly stock report now needs to store nothing more than the full set of versioned key for the stock items. As updates to the stock come in, new versions will be created, so the report does not need to take a full snapshot of the data, any previous report can be fully rebuilt from just this set of versioned keys as the underlying data is immutable.

Unfortunately, managing the correction of reports when errors in the stock levels are corrected will still be largely a manual process.

Note: To still allow key based access to an entity using the business key alone, (which is a common requirement and one you’d want to keep very quick and efficient), then one of the patterns outlined in Ben’s post on the subject can, and should, be used.

Temporal Versioning

With simple versioning it’s not possible to query the system for the stock levels at a specific point in time, as the system doesn’t track when each version was the live version. Temporal versioning fixes this shortcoming by storing the timestamp of the when, as new versions arrive and supersede previous version the system builds up a time-line of versions, with each version having a time window where it was considered the latest or live version. This makes it possible to retrospectively ask what version of an entity was live at any particular point in time.

temporal versioned entry

Our stock report need now keep nothing more than the timestamp at which the report was generated. From this, the system is able to gather the state of the system at that point in time, and rebuild the report. In fact, the system can now generate the report at any arbitrary previous point in time too.

Still, even with temporal versioning the system doesn’t elegantly deal with retrospective ‘corrections’ to the stock levels as it stands. To solve that side of the problem a different design change needs to be made. All we’ve effectively changed so far is the key by which data entries can be retrieved. To enhance the system to elegantly handle data corrections we’d probably look to change what’s in the value of the entry from the current system’s snapshot of state i.e. the current stock level, to a delta i.e. a positive or negative adjustment to the previous stock level. So, for example, when new stock comes in a new entry is added with a positive stock adjustment record, and when stock goes out to customers then a negative adjustment record is recorded. Such records often reference the related delivery report or customer order that was the source of the change. What we’re discussing here is effectively an event sourcing model, which Martin Fowler covers in his own blog and, as it’s pretty much off topic for this post, I’ll leave you to read Martin’s posts on the subject if you so please.  In addition to event sourcing, a system that needs to support changes to historical data may well also benefit from support for bi-temporal data, which I’ll leave as another exercise for the reader – though as a hint, the custom indexes covered below can be used to index as many temporal dimensions as you fancy having, (or can get your head around!).

As you can see, storing changes to systems as a time-series of facts about how the system changes is a powerful pattern. Now let’s take a look at how you might implement this on top of Oracle Coherence.

Implementing versioning within Coherence

Simple versioning in Coherence

The implementation details of simple versioning in Coherence are nothing new. I learned about them from posts such as Ben Stopford’s Coherence Versioning, and I won’t go into any detail here. Suffice to say that all flavors of the pattern use key affinity to ensure all versions of an entity reside on the same Coherence node, new versions are routed through a single latest entry to remove race conditions, and a trigger is used to set the version of entries. The types used as the key and the value within the grid would look something like this:

@Portable
public class VKey {
    @PortableProperty(value = 1)
    private int version;

    @PortableProperty(value = 2)
    private Object domainKey;

    @Override
    public int getVersion() { return version; }

    @Override
    public Object getDomainObject() { return domainKey; }

    ...
}

@Portable
public class VValue {
    @PortableProperty(value = 1)
    private int version;

    @PortableProperty(value = 2)
    private Object domainValue;

    @Override
    public int getVersion() { return version; }

    @Override
    public Object getDomainObject() { return domainValue; }

    ...
}

If you’re not familiar with these patterns, then I strongly suggest you read Ben’s post before carrying on.

Temporal versioning in Coherence

Extending the model to include a temporal dimension requires an arrived timestamp to be captured as part of each values metadata. A first attempt at this might look like:

@Portable
public class VValue {
   @PortableProperty(value = 1)
   private int version;

   @PortableProperty(value = 2)
   private Object domainValue;

   @PortableProperty(value = 3)
   private Date arrived;

   @Override
   public int getVersion() { return version; }

   @Override
   public Object getDomainObject() { return domainValue; }

   @Override
   public Date getArrived() { return arrived; }

...
}

However, there is a problem with this basic approach : while capturing the timestamp at the point of entry into the system is simple enough, (problems of distributed clocks aside), implementing temporal queries on top of the data is problematic: it is easy to add a lower-bound constraint to retrieve all entries where the created timestamp is greater-than-or-equal-to the required snapshot. Unfortunately, adding the upper bound is not so easy, as the time at which a version is superseded by another is not stored on the entry itself, but is implied from the arrived timestamp of the subsequent version or, in the case of the currently live version, is open-ended.

The problem of no upper-bounds for temporal queries is most commonly resolved by adding a superseded timestamp to each entry, and setting it to match the arrived timestamp of the superseding version, when that arrives. With this approach, it is now trivial to add an upper-bound constraint along side the lower-bound so that only objects that were live at the specified snapshot time are retained within the data set i.e.:

arrived <= snapshot && snapshot < superseded

To improve the performance of such a common filter operation it is common to add indexes to both the created and superseded fields.

The problem with the above approach is that it copies data, i.e. the arrived timestamp from one entry to the superseded of another, which can become out of sync, and it mutates an existing entry, (previous version), when adding a new one, which opens up many more possibilities for data corruption and increased code complexity.

The solution I’m proposing is to not store the superseded timestamp on the entity itself, but to make use of Coherence’s custom index support to build a time-line index designed to support temporal queries. The index will store a time-line of versions per business-key e.g. the stockId in our example above. The time-line will consist of the set of versioned keys, ordered by their arrived timestamp, (or indeed any other timestamp). This is relatively simple, requiring only a custom index, extractor and filter. So let’s see some code…

1. Create a IndexAwareExtractor to install your custom index

Indexes in Coherence are tied to an Extractor. For normal indexes this needs to be the same extractor that is used by any filter that you expect to make use of the index. It is possible to add custom indexes and extractors that are compatible with the in-built Coherence filters, but that is not what we are trying to achieve here, as no in-built filters are going to be able to understand the time-line stored within the index, and so the role of the extractor is just to install and un-install the index.

@Portable
public class TemporalExtractor implements IndexAwareExtractor, Serializable {

   @PortableProperty(value = 1)
   private ValueExtractor businessKeyExtractor;

   @PortableProperty(value = 2)
   private ValueExtractor timestampExtractor;

   @PortableProperty(value = 3)
   private ValueExtractor versionExtractor;

   public TemporalExtractor(ValueExtractor businessKeyExtractor, ValueExtractor timestampExtractor, ValueExtractor versionExtractor) {
      this.businessKeyExtractor = businessKeyExtractor;
      this.timestampExtractor = timestampExtractor;
      this.versionExtractor = versionExtractor;
   }

   @Override
   public MapIndex createIndex(boolean ordered, Comparator comparator, Map mapIndex, BackingMapContext context) {
      MapIndex index = new TemporalIndex(this, context);
      mapIndex.put(this, index);
      return index;
   }

   @Override
   public MapIndex destroyIndex(Map mapIndex) {
      return (MapIndex) mapIndex.remove(this);
   }
}

The extractor takes the ValueExtractors that the index will need to extract the business key from the entry’s key, and both the version and timestamp from the entry’s value. These are made available to our custom TemporalIndex which is covered below.

The only methods of note within the extractor itself are the createIndex(), and destroyIndex(), which are called by Coherence as our index is added and removed, respectively, (or on new nodes as they join the cluster). The create method adds an instance of our custom index to the cache’s index map, keyed off the extractor, while the destroy removes the said same entry.  Because the extractor is used as the key into the index map it is important that both equals() and hashcode() are implemented correctly.

2. Create a custom index that tracks the timeline of each object

The index will hold a map of business key to the time line of versions of that key. The time line stores the temporal points, i.e. the arrived time stamps, at which each new version arrives and the corresponding entry’s key.

Temporal reverse index

So let’s first create something to hold the time line. The time line holds an ordered map of timestamp to their key. However, things are slightly complicated by the fact that it is possible to have two versions of an entity at the same timestamp, so the class need to be able to accommodate this, which it does by storing not a single entry against each timestamp, but an ordered set of entries, ordered by version.

public class TimeLine {
   private TreeMap<Long, TreeSet> timeLine;

   public TimeLine(Comparator comparator) {
      this.comparator = comparator;
      this.timeLine = new TreeMap<Long, TreeSet>();
   }

   public void insert(Object key, long timestamp) {
      Set<Object> entries = getEntriesAtTimestamp(timestamp, true);
      entries.add(key);
   }

   public boolean remove(Object key, long timestamp) {
      Set<Object> keys = getEntriesAtTimestamp(timestamp, false);
      if (keys == null || !keys.remove(key)) {
         return false;
      }

      if (keys.isEmpty()) {
         timeLine.remove(timestamp);
      }

      return true;
   }

   public boolean isEmpty() {
      return timeLine.isEmpty();
   }

   public Object get(long snapshot) {
      Map.Entry<?, TreeSet<Object>> floor =timeLine.floorEntry(snapshot);
      return floor == null ? null : floor.getValue().last();
   }

   private Set<Object> getEntriesAtTimestamp(long timestamp, boolean createIfNecessary) {
      TreeSet<Object> keys = timeLine.get(timestamp);
      if (keys == null && createIfNecessary) {
         keys = new TreeSet<Object>(comparator);
         timeLine.put(timestamp, keys);
      }
      return keys;
   }
}

Next, we’ll need the custom reverse index itself. To conform to what Coherence expects the index needs to implement the MapIndex interface. However, most of the methods on the interface will never be called as the index will only be used from our custom filter(s), so the set of methods on the interface used by the standard Coherence filters can throw an UnsupportedOperationException, (and have been excluded from the code below to improve readability).

public class TemporalIndex implements MapIndex {
   private final TemporalExtractor extractor;
   private final Serializer serialiser;
   private final Map<Object, TimeLine> timeLineIndex;

   public TemporalIndex(TemporalExtractor extractor, BackingMapContext ctx) {
      this.extractor = extractor;
      this.serialiser = ctx.getManagerContext().getCacheService().getSerializer();
      this.timeLineIndex = new HashMap<Object, TimeLine>();
   }

   @Override
   public void insert(Map.Entry entry) {
      TimeLine timeLine = getTimeLine(entry, true);
      addToTimeLine(entry, timeLine);
   }

   @Override
   public void delete(Map.Entry entry) {
      TimeLine timeLine = getTimeLine(entry, true);
      removeFromTimeLine(entry, timeLine);
   }

   public TimeLine getTimeLine(Object fullKey) {
      Object businessKey = extractBusinessKeyFromKey(fullKey);
      return getTimeLine(businessKey, false);
   }

   private TimeLine getTimeLine(Map.Entry entry, boolean createIfNeeded) {
      Object businessKey = extractBusinessKeyFromEntry(entry);
      return getTimeLine(businessKey, createIfNeeded);
   }

   private TimeLine getTimeLine(Object businessKey, boolean createIfNeeded) {
      TimeLine timeLine = timeLineIndex.get(businessKey);
      if (timeLine == null && createIfNeeded) {
         timeLine = new TimeLine();
         timeLineIndex.put(businessKey, timeLine);
      }

      return timeLine;
   }

   private void addToTimeLine(Map.Entry entry, TimeLine timeLine) {
      Long arrived = extractFromEntry(extractor.getTimestampExtractor(), entry);
      timeLine.insert(getCoherenceKey(entry), arrived);
   }

   private void removeFromTimeLine(Map.Entry entry, TimeLine timeLine) {
      Object arrived = extractFromEntry(extractor.getTimestampExtractor(), entry);
      timeLine.remove(getCoherenceKey(entry), arrived);
      if (timeLine.isEmpty()) {
         timeLineIndex.remove(extractBusinessKeyFromEntry(entry));
      }
   }

   private Object extractBusinessKeyFromEntry(Map.Entry entry) {
      return InvocableMapHelper.extractFromEntry(extractor.getKeyExtractor(), entry);
   }

   private Object extractBusinessKeyFromKey(Object fullKey) {
      if (extractor.getKeyExtractor() instanceof PofExtractor) {
         PofExtractor keyExtractor = (...) extractor.getKeyExtractor();
         PofNavigator navigator = keyExtractor.getNavigator();
         PofValue pofValue = PofValueParser.parse((Binary) fullKey, (PofContext) serialiser);
         return navigator.navigate(pofValue).getValue();
      }
      return extractor.getKeyExtractor().extract(fullKey);
   }

   private static Object getCoherenceKey(Map.Entry entry) {
      if (entry == null) {
         return null;
      }

      return entry instanceof BinaryEntry ? ((BinaryEntry) entry).getBinaryKey() : entry.getKey();
   }
}

So there’s a bit of complexity in there around handling both standard Java serialization and Coherence’s POF, (which can be removed if you only use one), but the crux of the class is the implementation of the MapIndex calls for insertEntry() and deleteEntry(), which update the index as the state of the cache changes. By the way, updateEntry() throws an unsupported operation in my implementation as the data is immutable, right!? But its easy enough to enhance it to support mutation should you want.

The class also has a public getTimeLine(Object fullKey) method, which will be used later by our custom filters.

3. Create some custom filters to use the new temporal index

The custom filters that will use our custom index must implement  IndexAwareFilter. Below is a simple SnapshotFilter class, which will limit the result set of an operation to the versions of the entities that were live at a particular point in time, though it is easy enough to add other filters to bring back a range of versions, for example.

@Portable
public class SnapshotFilter implements IndexAwareFilter, Serializable {
   @PortableProperty(value = 1)
   private TemporalExtractor extractor;

   @PortableProperty(value = 2)
   private long snapshot;

   public SnapshotFilter(TemporalExtractor extractor, Object snapshot) {
      this.extractor = extractor;
      this.snapshot = snapshot;
   }

   @Override
   public Filter applyIndex(Map indexMap, Set keys) {
      TemporalIndex index = getIndex(indexMap);
      List<Object> matches = new ArrayList<Object>();
      Iterator it = keys.iterator();
      while (it.hasNext()) {
         Object key = it.next();
         if (!matches(key, index)) {
            matches.remove(key);
         }
      }
      return null;
   }

   private TemporalIndex getIndex(Map indexMap) {
      Object index = indexMap.get(extractor);
      if (index instanceof TemporalIndex) {
         return (TemporalIndex) index;
      }

      throw new UnsupportedOperationException(...);
   }

   private boolean matches(Object fullKey, TemporalIndex index) {
      TimeLine timeLine = index.getTimeLine(fullKey);
      return fullKey.equals(timeLine.get(snapshot));
   }
}

And that’s it, oh mind you, it might be handy if you install you index…

4. Installing and use your new shiny temporal index

Is as easy as…

   // Create it:
   TemporalExtractor extractor = new TemporalExtractor(buisnessKeyExtractor, createdTimestampExtractor, versionExtractor);
   cache.addIndex(extactor, false, null);

   // Use it:
   Set results = cache.keySet(new SnapshotFilter(extractor, timestamp));

Congratulation, you now have a temporal index that doesn’t rely on mutating the entries within your cache, and take it from me, such mutations are a constant source of bugs.

The full source code for this is available on my GitHub account.

What is not covered above is how to change the system to use an event-sourcing pattern or bi-temporal queries. This being the second part of the design changes we discussed. However, event-sourcing and all of its benefits is covered in great detail on Martin’s blog, and adding support for bi-temporal queries using the custom index is actually fairly trivial.

This has been a fairly lengthy post, so I’ll leave it there. Please feel free to ask any questions, (or point out any mistakes!). Happy coding all.

Over and out.

(I’d like to say thanks to Harvey Raja from Oracle for the discussions we had on this subject, which led to the design going down the routes it did – thanks Harvey!).

More BAM! for your buck

Integrating classic business application monitoring solutions from the major players involve a considerable commitment from companies and SMTs, requiring hefty infrastructure and personnel budgets, a drawn out implementation phase, and not inconsiderable ongoing OpEx costs. In this post I’ll be putting forward an alternative lightweight BAM solution, that also leverages full stack monitoring to allow root cause analysis when your KPIs drop.

This post follows on from my previous post ‘The truth about your logs’ and looks at ways in which you can ingrain monitoring and supportability into the mainstay of your development life cycle.

But first… A question

Before we get down to it, I’m interested to hear how many of you work in organisations that produce applications with little or no monitoring, so let’s start with a poll. (While I realise this fledgling blog has few readers, I’m hoping in time we’ll rack up enough votes to see a trend).

 

Accessible Business Application Monitoring

In some respects, as the title of this post suggests, part of what I’m about to talk about fits into the field of ‘Business Application Monitoring’, or BAM for short, but I’m not going to be using a dedicated off-the-shelf BAM solution, such as those available from your major IT suppliers, like Oracle, IBM and WSO2. When I looked into such products, admittedly a year or two ago, they involved setting up some serious infrastructure and were generally centred around a large bottleneck of a database. Some vendors were looking at next gen solutions, built around more scalable tech such as Cassandra, MongoDB and Hadoop, but these still required infrastructure for your monitoring solution to scale up along side your production environment and for a team of specialists to manage them.

BAM!

What I ended up choosing at the time, and for several clients since, was to leverage a monitoring application the client already had: Logscape, and use it to built a more lightweight BAM solution on top of their log data, (though as the BBC like to say ‘Other distributed log analytics tools do exist’). This may lack some enterprise level functionality on offer from top vendors, but required a much smaller commitment from the business to get up and running, and hence is more likely to float and get funding.

Logscape didn’t require extensive infrastructure to be purchased and racked, as it made use of existing infrastructure; it didn’t require specialise knowledge in support staff for products such a Hadoop and Cassandra, which this client didn’t have at the time; it automatically scaled with the company as data was left and searched in-place on individual server; and, perhaps most importantly, it was quick to get up, running and returning value to the business. Additionally, because Logscape is not a specialist BAM solution, but a full stack monitoring tool, it could offer some things a specialist solution couldn’t, as you’ll see.

10 steps to improve your applications’ NFRs

Hang on a minute, I thought we were talking about BAM, monitoring and ‘support-ability’, not all NFRs? Well, yes we are, but in my experience better monitoring, being looked at by more people, inevitably starts to improve your application quality across the board, as a whole plethora of categories of previously unnoticed issues find themselves in the light, for everyone to see.

So, if you would like to see the benefit proper monitoring can bring, have a read of the steps I’ve listed below and look to introduce monitoring into your organisation and start seeing the value it brings. Once you’re up and running, you’ll be amazed how quickly you start getting buy-in for others within IT, followed quickly by other areas of the business. Plus, your Operations team will love you!

If, while reading through some of the points, you’re thinking great: but how do I do this? Fear not! As I’ll likely be covering some of the points in more detail in future posts, (so let me know if there are any you’re particularly interested in seeing first).

  • Build an ‘environment health’ page – Probably the simplest first step is to create a ‘environment health overview’ page within your monitoring tool of choice. I would suggest this should at least include graphs showing warnings & errors by type and by component. Likewise, detect and graph exceptions by type and component. If you can, add it any appropriate component life cycle events such as things starting, stopping or changing mode. This can allow users to quickly correlate issues being reported in one component with another changing state.pow
    Finally, think about getting something on there to highlight struggling infrastructure e.g. low memory, high CPU, swap utilisation or servers running out of disk. This combined view will give a lot of different people a quick landing page, which they can use to check the health of any environment, and something you can add too as you go on. If you can, get each environments’ health page up on a big monitor that everyone can see, so that issues have even less chance of going unnoticed.
  • Don’t release with errors or exception being reported – What?! Seriously?? Yes! Now you’ve got your health page, make it part of QAs test/release check list. Releases should not go ahead while any component is spitting out errors or exceptions.

    kapow_from_batmanDoes this sound too strict? Yes, it may take you a little effort to tidy things up, but try it and see the quality of your product sky rocket!  QA should regularly be checking your env health page during testing and raising tickets against dev to clean up issues. Believe me this does work. The bottom line is: if an application is logging an error, then either something is wrong and needs to be fixed, or its not, and in which case why was it logging an error? Either way, it should be fixed before release.

    Once you’ve got errors and exceptions in check, bring warnings into the fold and start clean these up too. At the end of your initial clean up process you will have succeeded in removing a lot of the noise from your logs, which just made them more valuable. Cleaning up your logs stops ‘can’t see the wood for the trees’ issues in prod where true errors are hidden among known, expected & ignored ones.

  • Turn on garbage collection logs – assuming your running applications that use garbage collection and you haven’t already… Turn them on and configure them to be fairly verbose. Make this data available in your log analytics tool, so that its available for Developers and DevOps teams to use.
    7011249Being able to aggregate the GC statistics of 60 web servers under load is a wonderfully powerful thing! Developers that make use of such data will gain fresh insight into how their applications are performing, and only good things can come of that.
  • Build an ‘Application Memory’ page – To really leverage the information in those hot-off-the-press gc logs, build one or more pages to slice and dice the information being output. Go for at least a single overview page covering every component in your estate, (you may be surprised when this immediately flags up several components constantly running at high occupancy and spending most of their time GC’ing!).Next, look to build specific ones with more customisation around specific groupings of components e.g. a page covering all your web servers, another for your data grid, or service layer, etc.
  • Add instrumentation logging – you’ve now covered the basics, so it’s time to start turbo charging your logs and adding the foundations to your BAM solution: take your key business operations, e.g. ‘create user’, ‘process payment’, etc, and high level technical tasks, e.g. ‘save x’, ‘load y’, or ‘transform z’ and put machine friendly instrumentation logging around these. Instrumentation logging should have entries for when an operation is starting, any key steps along the way, and when it is finished.
    BoooommEach log entry should include a short name for the task being performed, optionally one or more levels of categorisation, and key information about the request e.g. which user / component instigated the request, what were the parameters of the request, how long did it take, whether it was successful or not, if not, was there an error code or description? Then list any pertinent stats and KPIs about the operation e.g. number of objects, throughputs, rates, volumes, ratios, etc. Standardise the format of these logs across all your applications e.g. by building a shared instrumentation library.Done right, instrumentation logging can provide unparalleled insight into what large distributed applications are doing, and it’s this logging that is key to enabling you to build a lightweight BAM solution on top of your log data. This information can also be invaluable for spotting issues and investigating them when they happen.
  • Build a ‘tasks’ page – Really harness those instrumentation logs by building a generic cross-estate monitoring page that visualises the tasks/business operations that are running across your estate. You can visualise business and technical KPIs such as sales per hour, transaction throughputs, request latencies, aggregate processing times, success/failure ratios, failure reasons, etc – all sliced and diced by component, time and/or operation.Once up and running you may soon identify where you want to create other more specialised pages to drill into certain types of tasks, business units, features, etc or to provide different views of the data to different types of users. After all Operations probably wants a different view to your business stakeholders.ka_pow_cartoon_expression_by_mike44nh-d4ut76q
    Development teams can use these views to understand how load is being distributed across a server farm, how the profile of the KPIs change under different scenarios, etc.  QA teams can use such views to compare response times, throughput, failure rates etc with previous runs and releases – allowing to more easily stop performance issues early. Operations teams will be thankful, and your stakeholders and managers will be amazed and transfixed by their real-time business process dashboards.Having the ability to aggregate processing time for tasks and operations can even be used to build up reports on the monetary cost of business operations or features: now so wouldn’t that be a powerful thing!
  • Cross pollinate your monitoring pages – Now you’ve got pages covering environmental issues, garbage collection, distributed processing, business operations etc, start to take high-level overviews from one area into other another, to allow users to readily and easily correlate between cause and effect. For example, bringing in CPU and GC load graphs into your ‘tasks’ page, so that users can correlate their effect against latency, or indeed if specific tasks unduly affect the CPU load or GC activity.Now questions such as ‘Do we have one or two specific tasks that really hammer your GC and need profiling?’, or ‘Are one or more users hogging system resources to the detriment of others?’ can readily be answered.
    kapow_from_batman
  • Make your logs machine friendly – No one was looking at those application logs anyway, right? But now they are constantly being read and searched by your monitoring tool – so why not make it easier for your tools to do that?First, separate your logs – gc, instrumentation, 3rd-party library logs, your application’s own logs, should all go to separate files – you can always use you monitoring tool to splice these back together into a single timeline. In the mean time, this means your monitoring tool needs to search less data when it knows where the data it wants is located.

    Standardise on certain conventions when logging data, both it the format of logging such things as key value pairs e.g. key=value, or key:value, and on the names of those common keys. Better still, knock up some simple logging wrapper libraries and APIs to make it easy for developers to follow these conventions – this is particularly true of instrumentation logs, which should rely heavily on such conventions and patterns.

  • Keep your logs – Don’t delete them! They now hold useful information… finally! Archive them somewhere where they can still be searched and indexed. Ideally you’ll be able to compress them, depending on your tool set. Such history can come in handy when trying to track down when a specific issue was introduced, comparing performance runs, or for capacity planning and management.BBP2

    Aim to keep at least 18 months of log data. Yes, you read that right! You wouldn’t throw away historic customer data, so why throw away metrics on what your applications are or were doing? Storage is cheap in comparison to the man days this data can save. If storage is an issue, look to thin out your logs as they age e.g. if you have a pool of identical servers, only keep logs from one or two, or, keep logs from alternate days, etc. At least until you get a chance to buy some more storage that is…

  • Configure alerting & reports – Now you have this wealth of information at your finger tips, don’t wait for someone to go look at it! Set up appropriate alerts and reports, integrated with what ever notification channels you use, be it email, instant message, or some incident management software. But remember the golden rule with alerting: less is more! Only send alerts to a person if they’ve either asked for it or if they need to help resolve it. Don’t spam your users!

A word about log volumes

The natural tendency for many people upon reading the above advice will be that so much logging is going to hinder application performance. Its been my experience that this is simply not the case. Yes, you need to ensure you’re logging asynchronously, (though errors and above should obviously be synchronous and flush any pending logs), and of course you’re going to cause problems if you put oodles of logging around some critical small piece of logic that’s called thousands of times a second – but that is not what’s being suggested: instrumentation logging is about higher level operations, that generally involve systems collaborating, and which by their very nature tend to be in the order of millisecond to complete, or close to. Such operations can easily support a level of instrumentation logging without fear of hurting performance.

Of course, that’s not to say you shouldn’t measure the impact it has, so that you can make an informed decision and work around any issues or tune your logging as needed. Though personally, if I had the choice between having to add another server to my pool in order to maintain throughput, and not having any idea what my application was up to, I know which I’d chose.

That’s it

It really is possible to change the direction of even the most stuck-in-its-ways development team using these steps, because your developers will no longer see monitoring as a hassle or a ‘nice to have’, but an essential and useful tool. Don’t shy away from taking the effort to add instrumentation logging either: start small by adding it to new code and functionality and you’ll soon seen its worth and be itching to retrofit into key business operations. With its introduction, there are a lot of previously unanswerable questions who’s answers your only a few minutes away from. But what’s even cooler, is that as you build in experience, of both your tools and data, you’ll find yourself thinking up new and wonderful questions and extracting answers that will add value or save money in ways you never thought possible – and that’ when the penny will really have dropped!

Well, that’s it for another post. I hope you’ve found it useful and not too dry. If you have any questions, then please ask them via the comments below, and don’t forget to share the love, (and this post).

Over and out.

The truth about your logs

The simple truth is that most developers don’t think about how their applications will be monitored or supported. They fail to add sufficient or useful logging to the applications they write. Most log entries they do add are little more than the log equivalent of spam emails. This makes the life of your Operations team much harder than it need be. In this post I look at why this is, what the impact of this can be and some ways of improving the situation.

Developers don’t write good logs

Take a look at any application’s log files, be it one written in house or one bought off the shelf, and you’ll clearly see it was written by humans, for humans: each log line is a little sentence that tries to impart some information about what the application is about to do, or has just done.  Unfortunately, in generally, it fails. More often than not, logs don’t deliver the cohesive description of what your application is doing any more than a jumbled up shredded copy of Hamlet is a masterpiece.  Logs invariably fail to deliver the important information and/or context of what those threads of execution are up to. Worse, such short comings often only come to light when the missing information is crucial to fixing a production incident. Let’s take a simple example:

2014-06-19T08:51:02,365 ERROR Failed to update the database. Error code 1239.

Harddrive

Are your drives full of spam?

Look familiar? Well done to the developer: at least they included the error code! But such a log line is somewhat lacking in information: how about which database? Even if the application only uses a single database, outputting the connection string would mean that a simple misconfiguration would be spotted quickly, or at least staff wouldn’t need to look through config files to find connection strings. What about a little more context? What operation failed? Maybe it was executing some SQL…? If so, what was it? Finally, what about the bigger picture? Sure, we know we failed to update the db: but what does that mean? Were we trying to update a trade or delete a user? What was the business operation this failure related too and what’s the fallout from this failure? It’s this kind of information that can be the difference between a five minute production outage, and a protracted support call, that eventually escalates to someone having to wake up some lucky developer at 3 am, so they can look through the source code.

The common pattern of logging call stacks can go some way to improve things. Unfortunately, a call stack only gives half the picture at best: it may give a history of which way this thread of execution went at each junction, but it doesn’t provide any insight in to the information it gathered along the way.

Enriching your logs with this additional information can make any investigation or root cause analysis so much smoother. What’s more, by making your log entries more machine, rather than human, friendly, you can really start to harness the power of your logs through the use of distributed log analytics tools, such as Splunk, Logscape or Kibana/Logstash, which can pick out and aggregate fields:

2014-06-19T08:51:02,365 ERROR Task: Persist_trade, Result: Failed, Duration: 39ms, database:jdbc:oracle:bob:@builder.com:1632:MainDB, sql:setTradePrice($0, $1), params:$0="1237";$1="12.398", Error_code:1239, Error_desc:trade does not exist

Why developers don’t write supportable code

Question

What stops developers writing supportable applications?

The problem stems from the fact that developers don’t care about monitoring and logging. Most put it in either as a debugging aid, with each log line little more than an ‘I’m here’ marker, as an after thoughts, or because they know they should and producing code without would be frowned upon. They pay little heed as to what information might be useful when someone else is later analysing the logs.

Most log entries are little more than the log equivalent of spam emails.

This is perpetuated by the simple truth that no one looks at logs until something goes wrong. By which point, the code is in production and it’s too late or costly to fix.  After all, who’s ever heard of QA checking application logs? Let alone raising a bug if those logs didn’t contain the right information.

The bottom line is that over 80% of the lines in most application logs are of little or no use.

OK, so I made that last figure up! But experience tells me I’m probably being overly generous.

Talking of things going wrong; another factor is that a lot of the logging code is only executed under failure conditions: conditions that often aren’t covered by developer or QA testing scenarios, and rarely crop up of their own accord. Even assuming error handling code doesn’t cause its own problems, if the logs don’t provide enough information to resolve the issue, you’ve got a big problem, and that lucky support dev is getting an early morning call!

The failure of development teams to release a product that your Operations team can monitor and fix will generally result in a demoralised Ops team and a nice bit of ‘us and them’ mentality between teams. In a more experienced Operations team, in a more mature company, such shortcomings may get kicked back to the dev team to be resolved, but this is rarely the case. Even in organisations where this does happen, it’s now more costly to add, and bolting on monitoring after the fact is second rate at best.

This space between your Dev and Operations team is where your DevOps team should sit. DevOps teams are building in popularity and with good reason. They help to bridge the gap between the two teams, and allow the development team concentrate more on features, while they cover more of the non-functionals like monitoring and stability. Good as DevOps teams can be, even an experienced one can’t turn a unsupportable pile of features into a production ready app.

The future’s orange: building supportable applications

“So given my development and ops teams hate each other, and throwing money at QA doesn’t seem to help: what can I do?”

OLYMPUS DIGITAL CAMERAI’ve no silver bullet: there is nothing I can write here that will ensure error handling code will contain all the necessary information come a Production failure, for example. That comes from an experienced dev team, who ensure test coverage of error handling code and understand the benefits of logging out sufficient context. Such wisdom has generally been gleaned at 3 am! What you can improve is the baseline quality and usefulness of your logs and the monitoring that sits on top. You can achieve this only by ensuring that it is used on a daily basis, across all your environments, by as wide a variety of people as possible.

Forming a DevOps team, if you don’t already have one, is a good first step. A proper DevOps team is built up of developers with a passion for NFRs and automation, who will likely evangelise about the importance of monitoring. Correctly integrated early in the development cycle, a good DevOps team can be an effective stick with which to beat your Dev team into meeting your NFRs. Sticks are great, but only to a point: but we all know carrots work better…

For your developers to fall in love with monitoring there has to be something in it for them. Developers, and I’m talking as one, are lazy beasts by nature: only doing what’s necessary to get the job done – often rightly so. I’ve found that the most effective carrot for a Developer is something that makes their life easier. So, the key is to make your monitoring solution as useful to your development & QA teams, as it is to your Ops team. Dev teams need tools to help them work out what applications are doing, when things are going wrong, how one version of the software compares to another, and where the bottle necks are. QA teams need tools to tell them the general health of an environment, and alert when it’s deteriorated, both before and during testing cycles. In this world of increasingly distributed software systems, the best tools for this are no longer debuggers or grepping a single log file, buts tools that allow them to collect, aggregate and visualise the KPIs and metrics from across an entire server farm: distributed log analytics tools. These same tools, KPIs, metrics, alerts, reports and visualisations are what your Ops team would love too.

By embracing monitoring right at the start, it costs less, is used by more , and invariably results in better monitoring and a better product.

So, we’ve looked at why developers don’t write supportable components, how this can cause friction in your organisation, and some high level waffle about how to improve things. It’s all very good me saying ‘make your developers love monitoring’ but I’m sure your now wondering erm… how? Well, its lucky that you asked, as that’ the topic of my next post! I’ll be looking at how you can get monitoring ingrained as a first class citizen within your requirements & processes. This will be heavily based on distributed log analytics, so if you don’t have a decent tool in this space yet why not down load one and see what it has to offer? I’ll be using Logscape, though the post will be equally valid if your using something else.

That’s it for this post, if you interested in hearing more please subscribe and if you’ve felt a certain kinship with what you’ve read, or just want to help me raise the profile of my fledgling blog, then please share this post or give it a +1 using the buttons below. If, on the other hand, you think I’m talking rubbish – get commenting!

Over and out.