More BAM! for your buck

Integrating classic business application monitoring solutions from the major players involve a considerable commitment from companies and SMTs, requiring hefty infrastructure and personnel budgets, a drawn out implementation phase, and not inconsiderable ongoing OpEx costs. In this post I’ll be putting forward an alternative lightweight BAM solution, that also leverages full stack monitoring to allow root cause analysis when your KPIs drop.

This post follows on from my previous post ‘The truth about your logs’ and looks at ways in which you can ingrain monitoring and supportability into the mainstay of your development life cycle.

But first… A question

Before we get down to it, I’m interested to hear how many of you work in organisations that produce applications with little or no monitoring, so let’s start with a poll. (While I realise this fledgling blog has few readers, I’m hoping in time we’ll rack up enough votes to see a trend).

 

Accessible Business Application Monitoring

In some respects, as the title of this post suggests, part of what I’m about to talk about fits into the field of ‘Business Application Monitoring’, or BAM for short, but I’m not going to be using a dedicated off-the-shelf BAM solution, such as those available from your major IT suppliers, like Oracle, IBM and WSO2. When I looked into such products, admittedly a year or two ago, they involved setting up some serious infrastructure and were generally centred around a large bottleneck of a database. Some vendors were looking at next gen solutions, built around more scalable tech such as Cassandra, MongoDB and Hadoop, but these still required infrastructure for your monitoring solution to scale up along side your production environment and for a team of specialists to manage them.

BAM!

What I ended up choosing at the time, and for several clients since, was to leverage a monitoring application the client already had: Logscape, and use it to built a more lightweight BAM solution on top of their log data, (though as the BBC like to say ‘Other distributed log analytics tools do exist’). This may lack some enterprise level functionality on offer from top vendors, but required a much smaller commitment from the business to get up and running, and hence is more likely to float and get funding.

Logscape didn’t require extensive infrastructure to be purchased and racked, as it made use of existing infrastructure; it didn’t require specialise knowledge in support staff for products such a Hadoop and Cassandra, which this client didn’t have at the time; it automatically scaled with the company as data was left and searched in-place on individual server; and, perhaps most importantly, it was quick to get up, running and returning value to the business. Additionally, because Logscape is not a specialist BAM solution, but a full stack monitoring tool, it could offer some things a specialist solution couldn’t, as you’ll see.

10 steps to improve your applications’ NFRs

Hang on a minute, I thought we were talking about BAM, monitoring and ‘support-ability’, not all NFRs? Well, yes we are, but in my experience better monitoring, being looked at by more people, inevitably starts to improve your application quality across the board, as a whole plethora of categories of previously unnoticed issues find themselves in the light, for everyone to see.

So, if you would like to see the benefit proper monitoring can bring, have a read of the steps I’ve listed below and look to introduce monitoring into your organisation and start seeing the value it brings. Once you’re up and running, you’ll be amazed how quickly you start getting buy-in for others within IT, followed quickly by other areas of the business. Plus, your Operations team will love you!

If, while reading through some of the points, you’re thinking great: but how do I do this? Fear not! As I’ll likely be covering some of the points in more detail in future posts, (so let me know if there are any you’re particularly interested in seeing first).

  • Build an ‘environment health’ page – Probably the simplest first step is to create a ‘environment health overview’ page within your monitoring tool of choice. I would suggest this should at least include graphs showing warnings & errors by type and by component. Likewise, detect and graph exceptions by type and component. If you can, add it any appropriate component life cycle events such as things starting, stopping or changing mode. This can allow users to quickly correlate issues being reported in one component with another changing state.pow
    Finally, think about getting something on there to highlight struggling infrastructure e.g. low memory, high CPU, swap utilisation or servers running out of disk. This combined view will give a lot of different people a quick landing page, which they can use to check the health of any environment, and something you can add too as you go on. If you can, get each environments’ health page up on a big monitor that everyone can see, so that issues have even less chance of going unnoticed.
  • Don’t release with errors or exception being reported – What?! Seriously?? Yes! Now you’ve got your health page, make it part of QAs test/release check list. Releases should not go ahead while any component is spitting out errors or exceptions.

    kapow_from_batmanDoes this sound too strict? Yes, it may take you a little effort to tidy things up, but try it and see the quality of your product sky rocket!  QA should regularly be checking your env health page during testing and raising tickets against dev to clean up issues. Believe me this does work. The bottom line is: if an application is logging an error, then either something is wrong and needs to be fixed, or its not, and in which case why was it logging an error? Either way, it should be fixed before release.

    Once you’ve got errors and exceptions in check, bring warnings into the fold and start clean these up too. At the end of your initial clean up process you will have succeeded in removing a lot of the noise from your logs, which just made them more valuable. Cleaning up your logs stops ‘can’t see the wood for the trees’ issues in prod where true errors are hidden among known, expected & ignored ones.

  • Turn on garbage collection logs – assuming your running applications that use garbage collection and you haven’t already… Turn them on and configure them to be fairly verbose. Make this data available in your log analytics tool, so that its available for Developers and DevOps teams to use.
    7011249Being able to aggregate the GC statistics of 60 web servers under load is a wonderfully powerful thing! Developers that make use of such data will gain fresh insight into how their applications are performing, and only good things can come of that.
  • Build an ‘Application Memory’ page – To really leverage the information in those hot-off-the-press gc logs, build one or more pages to slice and dice the information being output. Go for at least a single overview page covering every component in your estate, (you may be surprised when this immediately flags up several components constantly running at high occupancy and spending most of their time GC’ing!).Next, look to build specific ones with more customisation around specific groupings of components e.g. a page covering all your web servers, another for your data grid, or service layer, etc.
  • Add instrumentation logging – you’ve now covered the basics, so it’s time to start turbo charging your logs and adding the foundations to your BAM solution: take your key business operations, e.g. ‘create user’, ‘process payment’, etc, and high level technical tasks, e.g. ‘save x’, ‘load y’, or ‘transform z’ and put machine friendly instrumentation logging around these. Instrumentation logging should have entries for when an operation is starting, any key steps along the way, and when it is finished.
    BoooommEach log entry should include a short name for the task being performed, optionally one or more levels of categorisation, and key information about the request e.g. which user / component instigated the request, what were the parameters of the request, how long did it take, whether it was successful or not, if not, was there an error code or description? Then list any pertinent stats and KPIs about the operation e.g. number of objects, throughputs, rates, volumes, ratios, etc. Standardise the format of these logs across all your applications e.g. by building a shared instrumentation library.Done right, instrumentation logging can provide unparalleled insight into what large distributed applications are doing, and it’s this logging that is key to enabling you to build a lightweight BAM solution on top of your log data. This information can also be invaluable for spotting issues and investigating them when they happen.
  • Build a ‘tasks’ page – Really harness those instrumentation logs by building a generic cross-estate monitoring page that visualises the tasks/business operations that are running across your estate. You can visualise business and technical KPIs such as sales per hour, transaction throughputs, request latencies, aggregate processing times, success/failure ratios, failure reasons, etc – all sliced and diced by component, time and/or operation.Once up and running you may soon identify where you want to create other more specialised pages to drill into certain types of tasks, business units, features, etc or to provide different views of the data to different types of users. After all Operations probably wants a different view to your business stakeholders.ka_pow_cartoon_expression_by_mike44nh-d4ut76q
    Development teams can use these views to understand how load is being distributed across a server farm, how the profile of the KPIs change under different scenarios, etc.  QA teams can use such views to compare response times, throughput, failure rates etc with previous runs and releases – allowing to more easily stop performance issues early. Operations teams will be thankful, and your stakeholders and managers will be amazed and transfixed by their real-time business process dashboards.Having the ability to aggregate processing time for tasks and operations can even be used to build up reports on the monetary cost of business operations or features: now so wouldn’t that be a powerful thing!
  • Cross pollinate your monitoring pages – Now you’ve got pages covering environmental issues, garbage collection, distributed processing, business operations etc, start to take high-level overviews from one area into other another, to allow users to readily and easily correlate between cause and effect. For example, bringing in CPU and GC load graphs into your ‘tasks’ page, so that users can correlate their effect against latency, or indeed if specific tasks unduly affect the CPU load or GC activity.Now questions such as ‘Do we have one or two specific tasks that really hammer your GC and need profiling?’, or ‘Are one or more users hogging system resources to the detriment of others?’ can readily be answered.
    kapow_from_batman
  • Make your logs machine friendly – No one was looking at those application logs anyway, right? But now they are constantly being read and searched by your monitoring tool – so why not make it easier for your tools to do that?First, separate your logs – gc, instrumentation, 3rd-party library logs, your application’s own logs, should all go to separate files – you can always use you monitoring tool to splice these back together into a single timeline. In the mean time, this means your monitoring tool needs to search less data when it knows where the data it wants is located.

    Standardise on certain conventions when logging data, both it the format of logging such things as key value pairs e.g. key=value, or key:value, and on the names of those common keys. Better still, knock up some simple logging wrapper libraries and APIs to make it easy for developers to follow these conventions – this is particularly true of instrumentation logs, which should rely heavily on such conventions and patterns.

  • Keep your logs – Don’t delete them! They now hold useful information… finally! Archive them somewhere where they can still be searched and indexed. Ideally you’ll be able to compress them, depending on your tool set. Such history can come in handy when trying to track down when a specific issue was introduced, comparing performance runs, or for capacity planning and management.BBP2

    Aim to keep at least 18 months of log data. Yes, you read that right! You wouldn’t throw away historic customer data, so why throw away metrics on what your applications are or were doing? Storage is cheap in comparison to the man days this data can save. If storage is an issue, look to thin out your logs as they age e.g. if you have a pool of identical servers, only keep logs from one or two, or, keep logs from alternate days, etc. At least until you get a chance to buy some more storage that is…

  • Configure alerting & reports – Now you have this wealth of information at your finger tips, don’t wait for someone to go look at it! Set up appropriate alerts and reports, integrated with what ever notification channels you use, be it email, instant message, or some incident management software. But remember the golden rule with alerting: less is more! Only send alerts to a person if they’ve either asked for it or if they need to help resolve it. Don’t spam your users!

A word about log volumes

The natural tendency for many people upon reading the above advice will be that so much logging is going to hinder application performance. Its been my experience that this is simply not the case. Yes, you need to ensure you’re logging asynchronously, (though errors and above should obviously be synchronous and flush any pending logs), and of course you’re going to cause problems if you put oodles of logging around some critical small piece of logic that’s called thousands of times a second – but that is not what’s being suggested: instrumentation logging is about higher level operations, that generally involve systems collaborating, and which by their very nature tend to be in the order of millisecond to complete, or close to. Such operations can easily support a level of instrumentation logging without fear of hurting performance.

Of course, that’s not to say you shouldn’t measure the impact it has, so that you can make an informed decision and work around any issues or tune your logging as needed. Though personally, if I had the choice between having to add another server to my pool in order to maintain throughput, and not having any idea what my application was up to, I know which I’d chose.

That’s it

It really is possible to change the direction of even the most stuck-in-its-ways development team using these steps, because your developers will no longer see monitoring as a hassle or a ‘nice to have’, but an essential and useful tool. Don’t shy away from taking the effort to add instrumentation logging either: start small by adding it to new code and functionality and you’ll soon seen its worth and be itching to retrofit into key business operations. With its introduction, there are a lot of previously unanswerable questions who’s answers your only a few minutes away from. But what’s even cooler, is that as you build in experience, of both your tools and data, you’ll find yourself thinking up new and wonderful questions and extracting answers that will add value or save money in ways you never thought possible – and that’ when the penny will really have dropped!

Well, that’s it for another post. I hope you’ve found it useful and not too dry. If you have any questions, then please ask them via the comments below, and don’t forget to share the love, (and this post).

Over and out.

Logscape: what is it and why would I need it?

In some up coming posts that I want to cover I’m going to be relying a lot on Logscape to visualise application log data. It therefore makes sense that I start off with a quick introduction to this great product.

The problem: so many logs, so little data

It seems that when ever I start working with a new company, they will inevitably have a wide variety of applications deployed across many environments, on a selection of operating systems and infrastructure. Most, if not all, are happily spitting out log data to disk. Yet, no one looks at these logs unless something goes ‘pop!’.  Sound familiar?

Tall Filling Cab

Tapping into all that data can seem like a tall order

There is a wealth of data in this logs. This data can tell you what your software stack is doing! Or if it can’t – then you need to start questioning why its being logged in the first place.  Often, with just a little tweaking, your log data can give you insights into your stack that you never previously thought possible: any thing from indicators of latent bugs, performance bottle necks, or serious application failures, to trends indicating scrapers hitting your website, important information to feed into your capacity management, or indeed, even tracking what your customers are doing or bailing out. On top of this, by leveraging your log data on a daily basis you naturally improve the quality, and hence value, of this data over time, there by reducing the frequency of those ‘Oh @$#!, we don’t log that information!’ moments while investigating Production issues.

The problem is the volume of log data you need to process and its distributed nature: no longer do we live in a world where looking at a single application’s log can provide you enough information to diagnose a production issue – applications are highly distributed and the tools you use to view, collate and visualise their log data needs to be too.

So what is Logscape? Distributed Log analytics on a grand scale.

Put simply, it’s an application that allows you to run distributed searches across all the disparate log data your applications, containers, infrastructure and OS are producing. Searches are run on demand, in a matter of seconds, i.e. not off-line batch processing like Hadoop would do, and the results can be visualised in a variety of ways.

Once you’ve found something interesting you can refine your search and interact with the data, zoom in or out to look at smaller or longer periods of time, or even drill right down to the matching log line highlighted in the full log file.

At a basic level you can look for occurrences of particular words, phrases or regular expressions within your log data, and then graph when, where and how often these occur. This is powerful enough in itself, but Logscape can do more! Given a log line such as:

2014:06:15 14:03:52,128 Loader INFO load complete! objectCount: 1102, took: 12

Using Logscape you can pull out the metrics/fields out of the line, using regular expressions, custom data types, (more on those later), or even use Logscape’s in-built automatic field detection algorithms to do the hard work for you. Once you’ve got access to these metrics you really can start to explore the wealth of information that’s just been sitting in your logs for all those years.

Visualising queries running across a Coherence cluster

Visualising queries running across a server farm

On top of this powerful base functionality are the ability to set up alerts, reporting, and build up ‘workspaces’ of related graphs, that can really help you visualise what is happening in your application ecosystem and correlate between cause and effect.

Basically, Logscape is Big Data for your logs.

Having said that, Logscape’s not a silver bullet: you still need to know what your data means. While Logscape may often flag up unusual trends or unexpected correlations, you still need to bring your own smarts to the table.

Logscape is an alternative to the rather over priced Splunk, while being a more mature and feature rich alternative to the open source alternatives such as Logstash/Kibana, and a more real-time alternative to Hadoop.

Data Volumes: how much log data is too much?

In case you’re wondering what sort of data volumes we’re talking about. Well, I’ve got environments with over 40 servers spitting out over 150GB of log data per day.  Take that back over a month, two or more, (storage willing, luckily Logscape can look inside zips!), and you’re talking many Terabytes of data at your finger tips.

Have you noticed that your application got slower recently?  Maybe you’re wondering what code change might be to blame. Well, if you’ve sensibly archived off your UAT environments log data, you can now plot response times over the past month or two and spot exactly which deployment changed the profile.

How does it work? Those clever little secret agents.

I’m not going to go into any details here, but from the 10K foot view there are two main deployment models: search your data in-place, using the resources of the local server, or forward the logs off to some dedicated kit – though you can mix’n’match the two. Both models use a small Java agent running on each server, to either index & search the logs locally, or forward them off, and a main Logscape server, which hosts the UI and coordinates everything.
TH30-SUPERCOMPUTER-_822090f
Personally, I’m a fan of leveraging the spare capacity of my compute grids to analysis my log data. I know others who are more cautious about having the ‘monitoring solution hog the CPUs’, but with the agent all ‘niced’ down I’ve never experienced any issues. Plus, lets face it – all those servers cost a lot, why not use them?

Why do I need it? Root cause analysis made easy.

Well, to be honest, I’m kinda hoping that it’s becoming self-evident – but just in case it isn’t, let me run you through the basics one more time – Logscape allows you to take all the log data spat out by:

  • Operating Systems like Linux, Windows, etc
  • Infrastructure such as switches, load-balancers, and more
  • Containers such as web servers, application servers, to name but a few
  • And of cause your own applications or those you’ve purchased
  • Basically anything that outputs log or syslog data
  • Oh, and also include anything which has an API which you can query for data too.

It allows you to take that data, search it, pull important fields from it, aggregate it, chop it, dice it, and display it in a variety of ways, alongside other related information – giving you insight into what your infrastructure and software stacks are doing now, or at any point in the past for which you still have your log data… you do still have your log data, don’t you?

Well, that’s it for my first real post. Topics I’m looking to cover shortly include ‘tracking what your distributed application is doing’, ‘How to GC tune a 300 node Coherence Cluster’, and probably more of an introduction to Logscape. If you’ve got any opinions on which I should do first, please add a comment. Given that this blog is only just starting I’d appreciate anyone spreading the word – share and tweet to your heart’s content! Finally, remember, when you next look at the gigabytes of log data going unloved in your production and uat environments:

“It’s not about what it is, it’s about what it can become.”

― Dr. SeussThe Lorax

over & out…