Enterprise CI / CD Release - Keepalive / Prime Site Portfolio

Keepalive Checks Script

This script is a wrapper to ensure that services are healthy and ready for code push and deployment tasks on Pantheon sites. It hardens the release process by adding:

  • Parallel execution of healthcheck and deployment tasks on many sites
  • Race condition handling for workflows and terminus commands
  • Automatic retry and wake of failed sites' services

Table of Contents


To use this script, follow these steps:

  1. Clone the gist and make the scripts executable:

    git clone keepalive-checks && cd keepalive-checks && chmod +x && chmod +x backend-tasks && chmod +x healthcheck-tasks
  2. Set the $SITES variable in to the list of customer site UUIDs which should be primed for release. See The easiest method is via a terminus command: eg:

    Voya clone org, sites which are tagged with "deploy":

    SITES=$(terminus org:sites voya-financial-clone --field id --tag deploy --format=list)

    Voya live org, sites using drupal platform upstream:

    SITES=$(terminus org:sites voya-financial --upstream=b3d1c0a1-7aed-417e-95a6-510f0fe6ce34 --field=id --format=list)
  3. Set the $ENV variable in to the desired Pantheon environment to run against - this should be the target environment for the release. See

  4. Ensure that you have installed and configured nspct, the CSE command line tool. See and

  5. To start the keepalive tasks against a customer site portfolio, execute the script in your cli:

    ❯ ./

A logs directory will be created in your working directory indicating the status of each task per site as well as any failures that are retried.

Tips / Tricks

  • Ensure that you have the required dependencies installed (e.g., terminus, parallel, etc.) before running the script.

  • Adjust the number of parallel jobs by modifying the --jobs flag in the run_healthcheck_tasks and run_backend_tasks functions.

  • Check the logs directory for execution logs and failed task retry information.

  • To run this script at scheduled intervals or for a specific amount of time, use crontab or execute the script in a loop as a background process. eg:

    while true
         ./ &
        sleep 1200
# This script performs backend checks for a Drupal site on a specific Pantheon environment
# It checks if Drupal is bootstrapped and retries up to 3 times if it fails.
# If the bootstrap fails after 3 retries, the site is considered failed and added to the failed-sites-<env>.txt file.
# The script logs the progress and duration of the backend tasks.
# Arguments:
# $1 - The Pantheon site name
# $2 - The environment (e.g., dev, test, live)
# $3 - The current directory
set +x
# Check if Drupal is bootstrapped, retrying up to 3 times if it fails
function check_bootstrap() {
local retries=3
local SITE=$1
local ENV=$2
echo "Checking if Drupal is bootstrapped"
while [ $retries -gt 0 ]; do
terminus -n drush ${SITE}.${ENV} -- status --field=bootstrap 2>&1
if [[ "$DRUPAL_BOOTSTRAPPED" == 0 ]]; then
echo "Drupal bootstrapped successfully."
echo "Drupal not bootstrapped, waiting 15 seconds and checking again."
sleep 15
retries=$((retries - 1))
if [[ $retries -eq 0 ]]; then
echo "Drupal not bootstrapped after 3 retries, exiting."
# Backend task failed with a non-zero exit code, so add the site to the failed-sites-<env>.txt file
echo "Backend tasks failed for $SITE.$ENV"
echo "[backend] $SITE" >>"${current_dir}/logs/failed-sites-$ENV.txt"
# Perform backend tasks for a Drupal site
function backend_tasks() {
local SITE=$1
local ENV=$2
local current_dir=$3
echo "Starting backend tasks on ${SITE}.${ENV}" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
check_bootstrap $SITE $ENV 2>&1 | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
# If any of the functions above fail, the code deployment has failed. Add the site to the failed-sites.txt file.
if [[ $? -eq 0 ]]; then
echo "Site backend tasks successful for $SITE.$ENV" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
# If the site was previously marked as failed, remove it from the failed-sites-$ENV.txt file.
if [[ -f "${current_dir}/logs/failed-sites-$ENV.txt" ]]; then
sed -i "/$SITE/d" "${current_dir}/logs/failed-sites-$ENV.txt"
echo "Site backend tasks failed for $SITE.$ENV" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
echo "[backend] $SITE" >>"${current_dir}/logs/failed-sites-$ENV.txt"
# Report time to results.
TIME_DIFF=$(bc <<<"scale=2; $DURATION / 60")
MIN=$(printf "%.2f" $TIME_DIFF)
echo -e "Finished ${SITE}.${ENV} in ${MIN} minutes" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
# Call the backend_tasks function with the provided arguments
backend_tasks $SITE $ENV $current_dir
set +x
# Function to perform healthchecks on a Pantheon site in a specific environment
# Arguments:
# $1 - The Pantheon site name
# $2 - The environment (e.g., dev, test, live)
# $3 - The current directory
# Output:
# Writes the execution logs to "${current_dir}/logs/keepalive-$ENV.log"
# Writes the failed site names to "${current_dir}/logs/failed-sites-$ENV.txt" if healthchecks fail
# Prints the execution time to the console
function healthcheck_tasks() {
local SITE=$1
local ENV=$2
local current_dir=$3
echo -e "Executing healthchecks on Pantheon site $SITE in $ENV environment" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
nspct appserver:healthcheck --all --uuid=$SITE --env=$ENV
# if the push fails with a non-zero exit code, add the site to the failed-sites-<env>.txt file
if [[ "$?" -ne 0 ]]; then
echo "Healthchecks failed for $SITE.$ENV"
echo "[healthcheck] $SITE" >>"${current_dir}/logs/failed-sites-$ENV.txt"
# Report time to results.
TIME_DIFF=$(bc <<<"scale=2; $DURATION / 60")
MIN=$(printf "%.2f" $TIME_DIFF)
echo -e "Finished ${SITE}.${ENV} in ${MIN} minutes" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
# Call the healthcheck_tasks function with the provided arguments
healthcheck_tasks $SITE $ENV $current_dir
# Wrapper script to ensure that services are healthy and ready for code push and deployment tasks.
# This script hardens the release process by adding:
# - Parallel execution of healthcheck and deployment tasks on many sites
# - Race condition handling for workflows and terminus commands
# - Automatic retry / wake of failed sites services
# To use this script, run:
# ./
# To run this script as a background process every 20 mins, use crontab or execute:
# while true
# do
# ./ &
# sleep 1200
# done
set +x
current_dir=$(dirname "$(realpath "${BASH_SOURCE[0]}")")
# Array of sites to execute
SITES=$(terminus org:sites voya-financial-clone --field id --tag deploy --format=list)
#SITES=$(terminus org:sites voya-financial --upstream=b3d1c0a1-7aed-417e-95a6-510f0fe6ce34 --field=id --format=list)
# export vars to be used in other scripts
export current_dir=$current_dir
export ENV=$ENV
export SITES=("${SITES[@]}")
# Create logs directory if it does not exist
mkdir -p "${current_dir}/logs"
# Create file with a list of sites to retry in failed-sites-<env>.txt.
# A site will be added to this list if code push or deployment tasks fail.
touch "${current_dir}/logs/failed-sites-$ENV.txt"
# Run backend-tasks which ensures that Drupal is bootstrapped and ready to execute release tasks.
# Adjust the number of jobs by modifying the --jobs flag. eg --jobs 30
function run_backend_tasks() {
local SITES=("$@")
echo "Running backend tasks on site $ENV environment"
printf '%s\n' "${SITES[@]}" | parallel --jobs 10 ./backend-tasks {} $ENV "$current_dir"
#terminus org:sites voya-financial --upstream=b3d1c0a1-7aed-417e-95a6-510f0fe6ce34 --field=id --format=list
# Run healthcheck-tasks which ensures that services are healthy prior to attempting to execute backend deployment tasks.
# Adjust the number of jobs by modifying the --jobs flag. eg --jobs 30
function run_healthcheck_tasks() {
local SITES=("$@")
echo "Running healthcheck tasks on Pantheon $ENV environment"
# run the healthcheck_tasks function in parallel
printf '%s\n' "${SITES[@]}" | parallel --jobs 10 ./healthcheck-tasks {} "$ENV" "$current_dir"
# Retry failed tasks for up to 2 hours until all sites are validated to be healthy and ready for deployment.
# A site can be in the failed-sites-<env>.txt file if it fails to run healthcheck or backend tasks successfully.
# run_healthcheck_tasks is re-run for each failed site listed in failed-sites-<env>.txt with the [healthcheck] tag.
# run_backend_tasks is re-run for each failed site listed in failed-sites-<env>.txt with the [backend] tag.
# Update timeout to change the maximum time to retry failed sites.
function retry_failed_tasks() {
echo "Retrying failed sites in the Pantheon $ENV environment" >>"${current_dir}/logs/keepalive-$ENV.log"
local start_time=$(date +%s)
local timeout=7200 # 2 hours
while [ -s "${current_dir}/logs/failed-sites-$ENV.txt" ]; do
local current_time=$(date +%s)
local elapsed_time=$((current_time - start_time))
if [ $elapsed_time -ge $timeout ]; then
echo -e "Timeout reached without resolving all site failures." >>"${current_dir}/logs/keepalive-$ENV.log"
if grep -q '\[healthcheck\]' "${current_dir}/logs/failed-sites-$ENV.txt"; then
mapfile -t SITES < <(grep '\[healthcheck\]' "${current_dir}/logs/failed-sites-$ENV.txt")
run_healthcheck_tasks "${SITES[@]}"
mapfile -t SITES < <(grep '\[backend\]' "${current_dir}/logs/failed-sites-$ENV.txt")
run_backend_tasks "${SITES[@]}"
echo "Finished retrying failed sites" >>"${current_dir}/logs/keepalive-$ENV.log"
run_healthcheck_tasks "${SITES[@]}"
run_backend_tasks "${SITES[@]}"
