DevOps Monitoring Tools

Jürgen Cito – cito[at]ifi.uzh.ch

One of the pillars of CloudWave is to provide application developers with runtime information on how their applications are performing when deployed on a Cloud platform. This gives the developer more insight on how the application is used in production and gives support for further root-cause analysis of possible issues.

In order to evaluate the current state of the art of monitoring metrics and tools in the DevOps space, we created a sample application in Ruby on Rails to be deployed in a cloud platform (Heroku). We enabled different monitoring solutions (focusing on application level metrics) and stress tested the application, in order to inspect the monitoring and measurement capabilities of the tools.

The following monitoring and logging tools were enabled during the evaluation period:

Monitoring

  • NewRelic
  • TraceView
  • Librato

Logging

  • FlyData
  • Logentries
  • Papertrail
  • Loggly

Monitoring Metrics

We briefly list and describe the metrics collected through these tools and show examples on how these metrics are visualized in our evaluated monitoring and logging tools.

Throughput

Throughput measures the number of HTTP requests being handled by a web application. It is usually given in requests per minute (rpm). This metric is an indicator of the volume of requests move through an application and provides context to understand how well load can be handled.

Web Transaction Response Time

A web transaction is a web service method call that invokes further functionality within the application. The response time is usually measured in milliseconds (ms) for each web transaction and can be broken down to:

  • Request Queuing
  • Server Processing Time
  • Database Calls
  • External Calls

Some of the tools also provided a breakdown table, indicating the performance of further method calls and external calls (DB, HTTP) following a web transaction.

Web Transaction Usage

Web transaction usage measures the frequency of requests of certain web transactions. It is usually measured in percent in relation to the total frequency of requests.

Database Frequency and Processing Time

Similar to the metrics on web transactions, the tools measure the frequency of issued queries on database tables and the average processing time (in milliseconds) of SQL queries.

Response Time of External Services

Data obtained through an external HTTP call is considered an external service that might be a potential bottleneck, making the response time of such calls an important metric.

Logging

In logging there hasn't been too much of a change from a DevOps perspective. A few features have been identified within all tools:

  • Log aggregation between different servers (if applicable)
  • Grouping of specific log items through tags/labels
  • Filtering of specific log items through pattern matching
  • Alerts and Notifications on patterns

In this post we provided a brief summary of monitoring metrics that offer insight into runtime information of cloud-based applications. In CloudWave we want to expand on this idea and pair these application level metrics with static code analysis to provide a richer software development experience and improve the feedback cycle.