Analytics - Initial implementation

Building

Checkout rhq-metrics branch with analytics work from github.com/mstruk:

git clone https://github.com/mstruk/rhq-metrics.git
git checkout liveoak
mvn clean install -DskipTests

Checkout liveoak branch with analytics work from github.com/mstruk:

git clone https://github.com/mstruk/liveoak.git
git checkout analytics
mvn clean install -DskipTests

Using

Run Liveoak:

dist/target/liveoak/bin/standalone.sh

Go to Liveoak Admin Console at http://localhost:8080/admin, and create a new project or import an example application or an existing project.

Notice log entries in terminal output where liveoak was started.

In admin console click Storage, and then Dashboard. Repeat this several times and notice how Dashboard numbers grow with usage.

Switch between dashboards of multiple applications and notice how each Dashboard only shows usage for its application.

How it works

RHQ-Metrics is used via REST to serve as storage for event log, and provide responses to event log queries. Event log is a concept added to RHQ-Metrics to provide analytics support suited to our needs. Each log event represents a single fulfilled client request - meaning that response was fully returned to the client. For each request we know the length of request head and body, the length of response head and body, response status code, and other request response info that is relevant for analytics purposes.

RHQ-Metrics is deployed as a .war during liveoak startup and bound to /rhq-metrics url. The extensions to rhq-metrics build on rhq-metrics version 0.2.4.

Analytics module in Liveoak provides a custom extension (installed via conf/extensions/analytics.json), configuration resources, and analytics component for collecting log events, and dispatching them to RHQ-Metrics endpoint.

Liveoak container is enhanced to be analytics aware - it installs extra handlers for the purposes of log events generation.

Analytics extension contains a few configuration options:

{
  module-id: 'io.liveoak.analytics',
  config: {
    enabled: true,
    rhq-metrics-url: 'http://${jboss.bind.address:localhost}:${jboss.http.port:8080}/rhq-metrics/log-metrics',
    apps-enabled-by-default: true
  }
}

There is one extension, and one AnalyticsService shared by all the applications.

By setting ‘enabled’ to false, the event logging can be turned off completely - in this case analytics handlers won’t be added to connection’s handler pipeline.

‘rhq-metrics-url’ specifies the default endpoint with rhq-metrics REST service with event log support. This could point to another host / port where rhq-metrics might be installed to minimize the risk of it interfering (e.g. OOME crashes) with more essential Liveoak container services.

‘apps-enabled-by-default’ setting means that analytics doesn’t have to be explicitly enabled in individual Liveoak applications as it is automatically included in log events collection. By setting this to false, an application would only collect log events if application specific analytics context is configured in its application.json. In this mode no log events are collected for /admin context for example.

Request logging is configured in standalone/configuration/standalone.xml. It can be adjusted via logger category io.liveoak.container.analytics.Analytics.log and console handler named CONSOLE-LOG.

Only those log events are printed to log that are collected by analytics. If analytics is disabled there won’t be any log events showing up in the log.

Event log REST API

Endpoint url is at: http://localhost:8080/rhq-metrics/event-log

Performing a GET without any parameters returns a collection of all logging events for the last minute.

Example of an event:

{
  userId: null,
  application: "admin",
  clientAddress: "127.0.0.1",
  uri: "/admin/applications/test/resources?fields=*(*)",
  requestBytes: 1871,
  responseBytes: 760,
  status: 200,
  method: "GET",
  timestamp: 1421329506521,
  duration: 26,
  apiRequest: false,
  tags: [
    "path:/admin",
    "path:/admin/applications",
    "path:/admin/applications/test",
    "path:/admin/applications/test/resources"
  ],
  totalBytes: 2631
}

We can filter events by time using query parameters start, and end. They take a timestamp expressed as current time in milliseconds. If value is negative, System.currentTimeMillis() will be added to make it relative to current time.

The following request would return all log events for the last 10 minutes:

GET /rhq-metrics/event-log?start=-600000

There are several event log metrics available that perform data filtering, aggregation, and collection over log events.

Metric id is specified via ‘id’ query parameter. Currently supported values are:

requests
unique-users
bandwidth
notifications
bandwidth-by-service
bandwidth-by-path-XXX (where XXX is url path we’re interested in)

Whenever metric id is specified, query parameter ‘buckets’ or ‘bucketWidthSeconds’ also has to be specified in order to control the number of datapoints generated.

Requests metric

Requests metric returns the number of requests matching additional filtering criteria, and segmented into one or more time intervals. For each time interval a 'count' is returned, and a 'speed' which expresses the average number of requests per millisecond for the interval. All intervals returned are of the same duration as specified by 'duration' attribute.

GET /rhq-metrics/event-log?id=requests&start=-3600000&buckets=1

will return a single aggregation for the interested time period - the last hour.

That may look something like:

{
  _default: [
  {
    timestamp: 1421328791678,
    value: 0,
    id: "requests",
    min: 0,
    max: 1,
    avg: 1,
    duration: 3600000,
    count: 6,
    speed: 0.0016666666666666666,
    sum: 6
  }]
}

We see that there were 6 requests in the last hour. If we want to know how they were distributed within an hour we would need more buckets. Let’s try 2:

{
  _default: [
  {
    timestamp: 1421329500951,
    value: 0,
    id: "requests",
    min: 0,
    max: 1,
    avg: 1,
    duration: 1800000,
    count: 3,
    speed: 0.0016666666666666666,
    sum: 3
  },
  {
    timestamp: 1421331300951,
    value: 0,
    id: "requests",
    min: 0,
    max: 1,
    avg: 1,
    duration: 1800000,
    count: 3,
    speed: 0.0016666666666666666,
    sum: 3
  }]
}

We see that the first half of an hour there were 3, and the last half of an hour there were also 3. We can segment the stats further by increasing the number of buckets.

In the example above we didn't request any filtering by application, resulting in combined stats for all applications, including the admin console itself.

The event-log endpoint allows targetting by a url prefix, which makes it possible to focus on specific segment / subset of resources available on the server.

The following request for example, would only take into account requests against an application deployed at http://localhost:8080/chat-html:

GET /rhq-metrics/event-log?id=requests&start=-3600000&buckets=1&tag=path:/chat-html

And if we are only interested in storage requests for chat-html we would use:

GET /rhq-metrics/event-log?id=requests&start=-3600000&buckets=1&tag=path:/chat-html/storage

Unique Users metric

Unique Users metric returns a number of unique users matching additional filtering criteria, and segmented into one or more time intervals. For each time interval a 'count' is returned, and a 'speed' which expresses the average number of unique users per millisecond for the interval. All intervals returned are of the same duration as specified by 'duration' attribute.

GET /rhq-metrics/event-log?id=unique-users&start=-3600000&buckets=1

will return a single aggregation for the interested time period - the last hour.

{
  _default: [
    {
      timestamp: 1421668783637,
      value: 0,
      id: "unique-users",
      min: 1,
      max: 1,
      avg: 1,
      duration: 3600000,
      count: 1,
      speed: 0.0002777777777777778,
      sum: 1
    }
  ]
}

When your application has no login mechanism, all requests will be non-authenticated, resulting in this metric always returning count: 1 for any time period where at least one request was performed.

Bandwidth metric

Bandwidth metric returns total bandwidth consumption (total bytes received, and sent) of the requests captured by analytics, and further filtered by query criteria. Not all requests fulfilled by LiveOak are captured by analytics - only those hitting container handlers through a socket connection are captured (local client in-memory requests like authorization queryies are omitted). Analytics extension can also be configured to only capture events directly attributable to specific LiveOak application, and only if application configuration enables analytics. Therefore bandwidth metrics don't necessarily represent a full request serving I/O load.

GET /rhq-metrics/event-log?tag=path:/test&start=-3600000&buckets=2&id=bandwidth

{
  _default: [
    {
      timestamp: 1421669184390,
      value: 0,
      id: "bandwidth",
      min: 0,
      max: 2132,
      avg: 2126,
      duration: 1800000,
      count: 2,
      speed: 2.362222222222222,
      sum: 4252
    },
    {
      timestamp: 1421670984390,
      value: 0,
      id: "bandwidth",
      min: 0,
      max: 2132,
      avg: 1516.6666666666667,
      duration: 1800000,
      count: 3,
      speed: 2.5277777777777777,
      sum: 4550
    }
  ]
}

In this example we can see for the first time that 'count' has a different value than 'sum', and that 'max', and 'avg' have values other than 1.

Here we can see metric segmented into two time intervals of 30 minutes each. There were two requests in the first interval, and three requests in the second. In each interval the biggest request (request + response byte count) - measured by 'max' - is the same, but an average request size - measured by 'average' - is smaller in the second interval. Nevertheless, total bandwidth - measured by 'sum' - is bigger in the second interval, and consequently the average speed - measured by 'speed' as bytes per millisecond - is greater in the second interval.

Notifications metrics

Notifications metric returns the number of STOMP notifications dispatched by server, matching additional filtering criteria, and segmented into one or more time intervals. For each time interval a 'count' is returned, and a 'speed' which expresses the average number of notifications per millisecond for the interval. All intervals returned are of the same duration as specified by 'duration' attribute.

GET /rhq-metrics/event-log?id=notifications&tag=path:/test&start=-1800000&buckets=1

will return a single aggregation interval for the interested time period - the last 30 minutes.

That may look something like:

{
  _default: [
    {
      timestamp: 1421862277817,
      value: 0,
      id: "notifications",
      min: 0,
      max: 1,
      avg: 1,
      duration: 1800000,
      count: 2,
      speed: 0.0011111111111111111,
      sum: 2
    }
  ]
}

Omitting 'buckets' parameter will return event log entries for notifications only:

GET /rhq-metrics/event-log?id=notifications&tag=path:/test&start=-1800000

Will return a list of items like that:

[
  {
    userId: null,
    application: "test",
    clientAddress: "127.0.0.1",
    uri: "/test/storage/books/ObjectId%28%2254bff119c2e6c46193093f37%22%29",
    requestBytes: 0,
    responseBytes: 0,
    status: 200,
    method: null,
    notification: "update",
    timestamp: 1421866048145,
    duration: 0,
    apiRequest: false,
    tags: [
      "path:/test/storage",
      "path:/test/storage/books/ObjectId%28%2254bff119c2e6c46193093f37%22%29",
      "path:/test/storage/books",
      "path:/test"
    ],
    totalBytes: 0
  },
  {
    userId: null,
    application: "test",
    clientAddress: "127.0.0.1",
    uri: "/test/storage/books/ObjectId%28%2254bff119c2e6c46193093f37%22%29",
    requestBytes: 0,
    responseBytes: 0,
    status: 200,
    method: null,
    notification: "update",
    timestamp: 1421866057390,
    duration: 0,
    apiRequest: false,
    tags: [
      "path:/test/storage",
      "path:/test/storage/books/ObjectId%28%2254bff119c2e6c46193093f37%22%29",
      "path:/test/storage/books",
      "path:/test"
    ],
    totalBytes: 0
  }
]

'notification' field is set to notification type. Many fields are zero or null for notifications - e.g. 'method' is null, since notification is a server-side event, 'requestBytes', 'responseBytes', 'totalBytes', and 'duration' are all 0.

Bandwidth-by-service metric

We may often be interested in breaking down stats based on uri. For example we may be interested in which of our application services represents the biggest network I/O burden on our server.

GET /rhq-metrics/event-log?tag=path:/test&start=-1800000&buckets=1&id=bandwidth-by-service

{
  : [
    {
      timestamp: 1421680754795,
      value: 0,
      id: "bandwidth-by-service",
      min: 0,
      max: 1195,
      avg: 1195,
      segment: "",
      duration: 1800000,
      count: 1,
      speed: 0.6638888888888889,
      sum: 1195
    }
  ],
  /test/storage: [
    {
      timestamp: 1421680754795,
      value: 0,
      id: "bandwidth-by-service",
      min: 0,
      max: 2132,
      avg: 2126,
      segment: "/test/storage",
      duration: 1800000,
      count: 2,
      speed: 2.362222222222222,
      sum: 4252
    }
  ],
  /test/client: [
    {
      timestamp: 1421680754795,
      value: 0,
      id: "bandwidth-by-service",
      min: 0,
      max: 23727,
      avg: 12460.5,
      segment: "/test/client",
      duration: 1800000,
      count: 2,
      speed: 13.844999999999999,
      sum: 24921
    }
  ]
}

Bandwidth-by-service metric groups log events by uri prefix, taking into account the first two uri segments. First one is used to identify an application, the second one is used to identify a service / resource type.

In the above example we can see that in the time period of the last 30 minutes, /test itself was requested once, there were two requests for uris starting with /test/storage, and also two requests for uris starting with /test/client.

Known Issues

When 'buckets' parameter is more than one for 'bandwidth-by-service', no metric info is returned for intervals that have no requests (log events).

mstruk/Analytics.md