joerodriguez · October 2, 2018 19:13
diff --git a/gistfile1.txt b/gistfile1.txt
 ---
 apiVersion: v0

 product: healthwatch
 version: v1.4

 metadata:
  deployment: <%= spec.deployment %>

 indicators:

 - name: cf_cli_probe_availability_percentage
  promql: health_check_cliCommand_probe_available{source_id="healthwatch-forwarder",deployment="$deployment$"}
  thresholds:
  - level: critical
    lt: 1
  slo: 0.999
  documentation:
    title: CLI Health Test Availability
    description: |
      **Use**: Indicates that PCF Healthwatch is assessing the health of the Cloud Foundry Command Line Interface (cf CLI) commands. If these continuous validation tests fail to make up-to-date assessments, they are no longer a reliable warning mechanism.

      **Loggregator Name**: health.check.cliCommand.probe.available
      **Firehose Origin**: healthwatch
      **Log Cache Source ID**: healthwatch-forwarder
      **Type**: gauge
      **Frequency**: 60s
    recommended_response: |
      1. Ensure the `cf-health-check` app is running in the `healthwatch` space of the `system` org.
      1. Check the app logs for any obvious errors.

 - name: bosh_probe_rate
  promql: rate(health_check_bosh_director_probe_count{source_id="healthwatch-forwarder",deployment="$deployment$"}[10m])
  documentation:
    title: CLI Health Test Availability
    description: |
      Number of PCF Healthwatch BOSH Director Health probe assessments completed in the measured time interval.

      **Use**: For alerting purposes, Pivotal suggests using health.check.bosh.director.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.

      When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

      In the default installation, these tests run every 10 minutes using 1 runner app.

      **Loggregator Name**: health.check.bosh.director.probe.count
      **Firehose Origin**: healthwatch
      **Log Cache Source ID**: healthwatch-forwarder
      **Type**: gauge
      **Frequency**: 60s

    recommended_response: |
      1. Ensure the `cf-health-check` app is running in the `healthwatch` space of the `system` org.
      1. Check the app logs for any obvious errors.
    threshold_note: These thresholds depend on many variables may need to be adjusted to suit your deployment environment.

 documentation:
  owner: PCF Healthwatch
  title: Monitoring PCF Healthwatch
  description: |
    This topic explains how to monitor the health of Pivotal Cloud Foundry (PCF) Healthwatch using the metrics and key performance indicators (KPIs) generated by the service.

    For general information about monitoring PCF, see [Monitoring Pivotal Cloud Foundry](https://docs.pivotal.io/pivotalcf/monitoring/index.html).

  sections:
  - title: Service Level Indicators for PCF Healthwatch
    description: Service Level Indicators monitor that key features of the PCF Healthwatch product are working as expected. These SLIs are the most important operational metrics emitted about Healthwatch itself, as they indicate the reliability of the assessments Healthwatch is making.
    indicators:
    - cf_cli_probe_availability_percentage
  - title: Other Metrics
    description: This section describes other metrics that you can use to monitor PCF Healthwatch.
    indicators:
    - bosh_probe_rate
	---
	apiVersion: v0

	product: healthwatch
	version: v1.4

	metadata:
	deployment: <%= spec.deployment %>

	indicators:

	- name: cf_cli_probe_availability_percentage
	promql: health_check_cliCommand_probe_available{source_id="healthwatch-forwarder",deployment="$deployment$"}
	thresholds:
	- level: critical
	lt: 1
	slo: 0.999
	documentation:
	title: CLI Health Test Availability
	description: \|
	Use: Indicates that PCF Healthwatch is assessing the health of the Cloud Foundry Command Line Interface (cf CLI) commands. If these continuous validation tests fail to make up-to-date assessments, they are no longer a reliable warning mechanism.

	Loggregator Name: health.check.cliCommand.probe.available
	Firehose Origin: healthwatch
	Log Cache Source ID: healthwatch-forwarder
	Type: gauge
	Frequency: 60s
	recommended_response: \|
	1. Ensure the `cf-health-check` app is running in the `healthwatch` space of the `system` org.
	1. Check the app logs for any obvious errors.

	- name: bosh_probe_rate
	promql: rate(health_check_bosh_director_probe_count{source_id="healthwatch-forwarder",deployment="$deployment$"}[10m])
	documentation:
	title: CLI Health Test Availability
	description: \|
	Number of PCF Healthwatch BOSH Director Health probe assessments completed in the measured time interval.

	Use: For alerting purposes, Pivotal suggests using health.check.bosh.director.probe.available instead. This metric is most helpful for additional diagnostics or secondary alerting.

	When monitoring this metric, the primary indicator of concern is an unexpected negative variance from the normal pattern of checks per test type. If an operator has not made changes that impact the number of checks being made, such as scaling the test runner or changing the frequency of the test, an unexpected variance from normal likely indicates problems in the test runner functionality.

	In the default installation, these tests run every 10 minutes using 1 runner app.

	Loggregator Name: health.check.bosh.director.probe.count
	Firehose Origin: healthwatch
	Log Cache Source ID: healthwatch-forwarder
	Type: gauge
	Frequency: 60s

	recommended_response: \|
	1. Ensure the `cf-health-check` app is running in the `healthwatch` space of the `system` org.
	1. Check the app logs for any obvious errors.
	threshold_note: These thresholds depend on many variables may need to be adjusted to suit your deployment environment.

	documentation:
	owner: PCF Healthwatch
	title: Monitoring PCF Healthwatch
	description: \|
	This topic explains how to monitor the health of Pivotal Cloud Foundry (PCF) Healthwatch using the metrics and key performance indicators (KPIs) generated by the service.

	For general information about monitoring PCF, see [Monitoring Pivotal Cloud Foundry](https://docs.pivotal.io/pivotalcf/monitoring/index.html).

	sections:
	- title: Service Level Indicators for PCF Healthwatch
	description: Service Level Indicators monitor that key features of the PCF Healthwatch product are working as expected. These SLIs are the most important operational metrics emitted about Healthwatch itself, as they indicate the reliability of the assessments Healthwatch is making.
	indicators:
	- cf_cli_probe_availability_percentage
	- title: Other Metrics
	description: This section describes other metrics that you can use to monitor PCF Healthwatch.
	indicators:
	- bosh_probe_rate