Last active
October 2, 2018 19:13
-
-
Save joerodriguez/99d5fc6d03ec11591532abcb3871276a to your computer and use it in GitHub Desktop.
simplified indicators format
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
apiVersion: v0 | |
product: healthwatch | |
version: v1.4 | |
metadata: | |
deployment: <%= spec.deployment %> | |
indicators: | |
- name: cf_cli_probe_availability_percentage | |
promql: health_check_cliCommand_probe_available{source_id="healthwatch-forwarder",deployment="$deployment$"} | |
thresholds: | |
- level: critical | |
lt: 1 | |
slo: 0.999 | |
documentation: | |
title: CLI Health Test Availability | |
description: | | |
**Use**: Indicates that PCF Healthwatch is assessing the health of the Cloud Foundry Command Line Interface (cf CLI) commands. If these continuous validation tests fail to make up-to-date assessments, they are no longer a reliable warning mechanism. | |
**Loggregator Name**: health.check.cliCommand.probe.available | |
**Firehose Origin**: healthwatch | |
**Log Cache Source ID**: healthwatch-forwarder | |
**Type**: gauge | |
**Frequency**: 60s | |
recommended_response: | | |
1. Ensure the `cf-health-check` app is running in the `healthwatch` space of the `system` org. | |
1. Check the app logs for any obvious errors. | |
- name: bosh_probe_rate | |
promql: rate(health_check_bosh_director_probe_count{source_id="healthwatch-forwarder",deployment="$deployment$"}[10m]) | |
documentation: | |
title: CLI Health Test Availability | |
description: | | |
Number of PCF Healthwatch BOSH Director Health probe assessments completed in the measured time interval. | |
**Use**: For alerting purposes, Pivotal suggests using health.check.bosh.director.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting. | |
When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality. | |
In the default installation, these tests run every 10 minutes using 1 runner app. | |
**Loggregator Name**: health.check.bosh.director.probe.count | |
**Firehose Origin**: healthwatch | |
**Log Cache Source ID**: healthwatch-forwarder | |
**Type**: gauge | |
**Frequency**: 60s | |
recommended_response: | | |
1. Ensure the `cf-health-check` app is running in the `healthwatch` space of the `system` org. | |
1. Check the app logs for any obvious errors. | |
threshold_note: These thresholds depend on many variables may need to be adjusted to suit your deployment environment. | |
documentation: | |
owner: PCF Healthwatch | |
title: Monitoring PCF Healthwatch | |
description: | | |
This topic explains how to monitor the health of Pivotal Cloud Foundry (PCF) Healthwatch using the metrics and key performance indicators (KPIs) generated by the service. | |
For general information about monitoring PCF, see [Monitoring Pivotal Cloud Foundry](https://docs.pivotal.io/pivotalcf/monitoring/index.html). | |
sections: | |
- title: Service Level Indicators for PCF Healthwatch | |
description: Service Level Indicators monitor that key features of the PCF Healthwatch product are working as expected. These SLIs are the most important operational metrics emitted about Healthwatch itself, as they indicate the reliability of the assessments Healthwatch is making. | |
indicators: | |
- cf_cli_probe_availability_percentage | |
- title: Other Metrics | |
description: This section describes other metrics that you can use to monitor PCF Healthwatch. | |
indicators: | |
- bosh_probe_rate |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment