Skip to content

Instantly share code, notes, and snippets.

@joerodriguez
Last active October 2, 2018 19:13
Show Gist options
  • Save joerodriguez/99d5fc6d03ec11591532abcb3871276a to your computer and use it in GitHub Desktop.
Save joerodriguez/99d5fc6d03ec11591532abcb3871276a to your computer and use it in GitHub Desktop.
simplified indicators format
---
apiVersion: v0
product: healthwatch
version: v1.4
metadata:
deployment: <%= spec.deployment %>
indicators:
- name: cf_cli_probe_availability_percentage
promql: health_check_cliCommand_probe_available{source_id="healthwatch-forwarder",deployment="$deployment$"}
thresholds:
- level: critical
lt: 1
slo: 0.999
documentation:
title: CLI Health Test Availability
description: |
**Use**: Indicates that PCF Healthwatch is assessing the health of the Cloud Foundry Command Line Interface (cf CLI) commands. If these continuous validation tests fail to make up-to-date assessments, they are no longer a reliable warning mechanism.
**Loggregator Name**: health.check.cliCommand.probe.available
**Firehose Origin**: healthwatch
**Log Cache Source ID**: healthwatch-forwarder
**Type**: gauge
**Frequency**: 60s
recommended_response: |
1. Ensure the `cf-health-check` app is running in the `healthwatch` space of the `system` org.
1. Check the app logs for any obvious errors.
- name: bosh_probe_rate
promql: rate(health_check_bosh_director_probe_count{source_id="healthwatch-forwarder",deployment="$deployment$"}[10m])
documentation:
title: CLI Health Test Availability
description: |
Number of PCF Healthwatch BOSH Director Health probe assessments completed in the measured time interval.
**Use**: For alerting purposes, Pivotal suggests using health.check.bosh.director.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.
When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.
In the default installation, these tests run every 10 minutes using 1 runner app.
**Loggregator Name**: health.check.bosh.director.probe.count
**Firehose Origin**: healthwatch
**Log Cache Source ID**: healthwatch-forwarder
**Type**: gauge
**Frequency**: 60s
recommended_response: |
1. Ensure the `cf-health-check` app is running in the `healthwatch` space of the `system` org.
1. Check the app logs for any obvious errors.
threshold_note: These thresholds depend on many variables may need to be adjusted to suit your deployment environment.
documentation:
owner: PCF Healthwatch
title: Monitoring PCF Healthwatch
description: |
This topic explains how to monitor the health of Pivotal Cloud Foundry (PCF) Healthwatch using the metrics and key performance indicators (KPIs) generated by the service.
For general information about monitoring PCF, see [Monitoring Pivotal Cloud Foundry](https://docs.pivotal.io/pivotalcf/monitoring/index.html).
sections:
- title: Service Level Indicators for PCF Healthwatch
description: Service Level Indicators monitor that key features of the PCF Healthwatch product are working as expected. These SLIs are the most important operational metrics emitted about Healthwatch itself, as they indicate the reliability of the assessments Healthwatch is making.
indicators:
- cf_cli_probe_availability_percentage
- title: Other Metrics
description: This section describes other metrics that you can use to monitor PCF Healthwatch.
indicators:
- bosh_probe_rate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment