Counters, gauges, timers, and more
The most famous library is probably Metrics from Dropwizard (http://metrics.dropwizard.io) but all libraries share more or less the same sort of API. The metrics are centered around a few important concepts:
- Gauges: These provide the measure of a value at a certain time. They are intended to build a time series. Most famous examples are the CPU or memory usages.
- Counters: These are long values, often associated with a gauges in order to build time series.
- Histogram: This structure allows you to compute the statistics around a value, for instance, the mean or the percentiles of request lengths.
- Timers: These are a bit like histograms; they compute other metrics based on one metric. Here, the goal is to have information about the rate of a value.
- Health checks: These are less related to the performance; they allows you to validate that a resource (such as a database) is working or not. Health checks throw a warning/error if the resource isn't working.
All these libraries provide different ways to export/expose the collected data. Common configurations are related to JMX (through MBeans), Graphite, Elasticsearch, and so on, or just the console/logger as the output.
How can these concepts be linked to the performance? The most important features for us will be the gauges and the counters. The gauges will enable us to make sure the server is doing well (for example, the CPU is not always at 100%, the memory is well released, and so on). The counters will enable us to measure the execution time. They will also enable us to export the data in an aggregated storage if you test against multiple instances, allowing you to detect some potential side effects of one instance on another one (if you have any clustering for example).
Concretely, we want to measure some important segments of our code. In the extreme case, if you don't know anything about the application, you will likely want to measure all parts of the code then refine it when you have more knowledge about your application.
To be very concrete and illustrate what we are trying to achieve, we want to replace application methods by this kind of pattern:
@GET
@Path("{id}")
public JsonQuote findById(@PathParam("id") final long id) {
final Timer.Context metricsTimer = getMonitoringTimer("findById").time();
try {
return defaultImpl();
} finally {
metricsTimer.stop();
}
}
In other words, we want to surround our business code with a timer to collect statistics about our execution time. One common and poor man solution you can be tempted to start with is to use loggers to do it. It often looks as follows:
@GET
@Path("{id}")
public JsonQuote findById(@PathParam("id") final long id) {
final long start = System.nanoTime();
try {
return defaultImpl();
} finally {
final long end = System.nanoTime();
MONITORING_LOGGER.info("perf(findById) = " +
TimeUnit.NANOSECONDS.toMillis(end - start) + "ms");
}
}
The preceding code manually measures the execution time of the method and, then, dumps the result with a description text in a specific logger to identify the code portion it is related to.
In doing so, the issue you will encounter is that you will not get any statistics about what you measure and will need to preprocess all the data you collect, delaying the use of the metrics to identify the hotspots of your application and work on them. This may not seem like a big issue, but as you are likely to do it many times during a benchmark phase, you will not want to do it manually.
Then, the other issues are related to the fact that you need to add this sort of code in all the methods you want to measure. Thus, you will pollute your code with monitoring code, which is rarely worth it. It impacts even more if you add it temporarily to get metrics and remove it later. This means that you will try to avoid this kind of work as much as possible.
The final issue is that you can miss the server or library (dependency) data, as you don't own this code. That means that you may spend hours and hours working on a code block that is, in fact, not the slowest one.