Skip to content

Instantly share code, notes, and snippets.

@lcatlett
Last active June 8, 2024 02:31
Show Gist options
  • Save lcatlett/c0216cea7acfc042dcc78785ac1d0e7c to your computer and use it in GitHub Desktop.
Save lcatlett/c0216cea7acfc042dcc78785ac1d0e7c to your computer and use it in GitHub Desktop.
Enterprise CI / CD Release - Keepalive / Prime Site Portfolio

Keepalive Checks Script

This script is a wrapper to ensure that services are healthy and ready for code push and deployment tasks on Pantheon sites. It hardens the release process by adding:

  • Parallel execution of healthcheck and deployment tasks on many sites
  • Race condition handling for workflows and terminus commands
  • Automatic retry and wake of failed sites' services

Table of Contents

Usage

To use this script, follow these steps:

  1. Clone the gist and make the scripts executable:

    git clone git@gist.github.com:c0216cea7acfc042dcc78785ac1d0e7c.git keepalive-checks && cd keepalive-checks && chmod +x keepalive-checks.sh && chmod +x backend-tasks && chmod +x healthcheck-tasks
  2. Set the $SITES variable in keepalive-checks.sh to the list of customer site UUIDs which should be primed for release. See https://gist.github.com/lcatlett/c0216cea7acfc042dcc78785ac1d0e7c#file-keepalive-checks-sh-L20-L22. The easiest method is via a terminus command: eg:

    Voya clone org, sites which are tagged with "deploy":

    SITES=$(terminus org:sites voya-financial-clone --field id --tag deploy --format=list)

    Voya live org, sites using drupal platform upstream:

    SITES=$(terminus org:sites voya-financial --upstream=b3d1c0a1-7aed-417e-95a6-510f0fe6ce34 --field=id --format=list)
  3. Set the $ENV variable in keepalive-checks.sh to the desired Pantheon environment to run against - this should be the target environment for the release. See https://gist.github.com/lcatlett/c0216cea7acfc042dcc78785ac1d0e7c#file-keepalive-checks-sh-L20-L22

  4. Ensure that you have installed and configured nspct, the CSE command line tool. See https://github.com/pantheon-systems/nspct and https://dev-nspct.pantheonsite.io/getting-started.

  5. To start the keepalive tasks against a customer site portfolio, execute the keepalive-checks.sh script in your cli:

    ~/scripts/keepalive-checks
    ❯ ./keepalive-checks.sh

A logs directory will be created in your working directory indicating the status of each task per site as well as any failures that are retried.

Tips / Tricks

  • Ensure that you have the required dependencies installed (e.g., terminus, parallel, etc.) before running the script.

  • Adjust the number of parallel jobs by modifying the --jobs flag in the run_healthcheck_tasks and run_backend_tasks functions.

  • Check the logs directory for execution logs and failed task retry information.

  • To run this script at scheduled intervals or for a specific amount of time, use crontab or execute the script in a loop as a background process. eg:

    while true
    do
         ./keepalive-checks.sh &
        sleep 1200
    done
    
#!/bin/bash
# This script performs backend checks for a Drupal site on a specific Pantheon environment
# It checks if Drupal is bootstrapped and retries up to 3 times if it fails.
# If the bootstrap fails after 3 retries, the site is considered failed and added to the failed-sites-<env>.txt file.
# The script logs the progress and duration of the backend tasks.
# Arguments:
# $1 - The Pantheon site name
# $2 - The environment (e.g., dev, test, live)
# $3 - The current directory
set +x
SITE=$1
ENV=$2
current_dir=$3
START=$SECONDS
# Check if Drupal is bootstrapped, retrying up to 3 times if it fails
function check_bootstrap() {
local retries=3
local DRUPAL_BOOTSTRAPPED=1
local SITE=$1
local ENV=$2
echo "Checking if Drupal is bootstrapped"
while [ $retries -gt 0 ]; do
terminus -n drush ${SITE}.${ENV} -- status --field=bootstrap 2>&1
DRUPAL_BOOTSTRAPPED="$?"
if [[ "$DRUPAL_BOOTSTRAPPED" == 0 ]]; then
echo "Drupal bootstrapped successfully."
break
else
echo "Drupal not bootstrapped, waiting 15 seconds and checking again."
sleep 15
retries=$((retries - 1))
if [[ $retries -eq 0 ]]; then
echo "Drupal not bootstrapped after 3 retries, exiting."
DRUPAL_BOOTSTRAPPED=1
# Backend task failed with a non-zero exit code, so add the site to the failed-sites-<env>.txt file
echo "Backend tasks failed for $SITE.$ENV"
echo "[backend] $SITE" >>"${current_dir}/logs/failed-sites-$ENV.txt"
fi
fi
done
return $DRUPAL_BOOTSTRAPPED
}
# Perform backend tasks for a Drupal site
function backend_tasks() {
local SITE=$1
local ENV=$2
local current_dir=$3
START=$SECONDS
echo "Starting backend tasks on ${SITE}.${ENV}" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
check_bootstrap $SITE $ENV 2>&1 | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
# If any of the functions above fail, the code deployment has failed. Add the site to the failed-sites.txt file.
if [[ $? -eq 0 ]]; then
echo "Site backend tasks successful for $SITE.$ENV" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
# If the site was previously marked as failed, remove it from the failed-sites-$ENV.txt file.
if [[ -f "${current_dir}/logs/failed-sites-$ENV.txt" ]]; then
sed -i "/$SITE/d" "${current_dir}/logs/failed-sites-$ENV.txt"
fi
else
echo "Site backend tasks failed for $SITE.$ENV" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
echo "[backend] $SITE" >>"${current_dir}/logs/failed-sites-$ENV.txt"
fi
# Report time to results.
DURATION=$((SECONDS - START))
TIME_DIFF=$(bc <<<"scale=2; $DURATION / 60")
MIN=$(printf "%.2f" $TIME_DIFF)
echo -e "Finished ${SITE}.${ENV} in ${MIN} minutes" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
}
# Call the backend_tasks function with the provided arguments
backend_tasks $SITE $ENV $current_dir
#!/bin/bash
set +x
SITE=$1
ENV=$2
current_dir=$3
START=$SECONDS
# Function to perform healthchecks on a Pantheon site in a specific environment
#
# Arguments:
# $1 - The Pantheon site name
# $2 - The environment (e.g., dev, test, live)
# $3 - The current directory
#
# Output:
# Writes the execution logs to "${current_dir}/logs/keepalive-$ENV.log"
# Writes the failed site names to "${current_dir}/logs/failed-sites-$ENV.txt" if healthchecks fail
# Prints the execution time to the console
function healthcheck_tasks() {
local SITE=$1
local ENV=$2
local current_dir=$3
START=$SECONDS
echo -e "Executing healthchecks on Pantheon site $SITE in $ENV environment" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
nspct appserver:healthcheck --all --uuid=$SITE --env=$ENV
# if the push fails with a non-zero exit code, add the site to the failed-sites-<env>.txt file
if [[ "$?" -ne 0 ]]; then
echo "Healthchecks failed for $SITE.$ENV"
echo "[healthcheck] $SITE" >>"${current_dir}/logs/failed-sites-$ENV.txt"
fi
# Report time to results.
DURATION=$((SECONDS - START))
TIME_DIFF=$(bc <<<"scale=2; $DURATION / 60")
MIN=$(printf "%.2f" $TIME_DIFF)
echo -e "Finished ${SITE}.${ENV} in ${MIN} minutes" | sed "s/^/[$SITE] /" >>"${current_dir}/logs/keepalive-$ENV.log"
}
# Call the healthcheck_tasks function with the provided arguments
healthcheck_tasks $SITE $ENV $current_dir
#!/bin/bash
# Wrapper script to ensure that services are healthy and ready for code push and deployment tasks.
# This script hardens the release process by adding:
# - Parallel execution of healthcheck and deployment tasks on many sites
# - Race condition handling for workflows and terminus commands
# - Automatic retry / wake of failed sites services
# To use this script, run:
# ./keepalive-checks.sh
# To run this script as a background process every 20 mins, use crontab or execute:
# while true
# do
# ./keepalive-checks.sh &
# sleep 1200
# done
set +x
current_dir=$(dirname "$(realpath "${BASH_SOURCE[0]}")")
ENV="dev"
# Array of sites to execute
SITES=$(terminus org:sites voya-financial-clone --field id --tag deploy --format=list)
#SITES=$(terminus org:sites voya-financial --upstream=b3d1c0a1-7aed-417e-95a6-510f0fe6ce34 --field=id --format=list)
# export vars to be used in other scripts
export current_dir=$current_dir
export ENV=$ENV
export SITES=("${SITES[@]}")
# Create logs directory if it does not exist
mkdir -p "${current_dir}/logs"
# Create file with a list of sites to retry in failed-sites-<env>.txt.
# A site will be added to this list if code push or deployment tasks fail.
touch "${current_dir}/logs/failed-sites-$ENV.txt"
# Run backend-tasks which ensures that Drupal is bootstrapped and ready to execute release tasks.
# Adjust the number of jobs by modifying the --jobs flag. eg --jobs 30
function run_backend_tasks() {
local SITES=("$@")
echo "Running backend tasks on site $ENV environment"
printf '%s\n' "${SITES[@]}" | parallel --jobs 10 ./backend-tasks {} $ENV "$current_dir"
}
#terminus org:sites voya-financial --upstream=b3d1c0a1-7aed-417e-95a6-510f0fe6ce34 --field=id --format=list
# Run healthcheck-tasks which ensures that services are healthy prior to attempting to execute backend deployment tasks.
# Adjust the number of jobs by modifying the --jobs flag. eg --jobs 30
function run_healthcheck_tasks() {
local SITES=("$@")
echo "Running healthcheck tasks on Pantheon $ENV environment"
# run the healthcheck_tasks function in parallel
printf '%s\n' "${SITES[@]}" | parallel --jobs 10 ./healthcheck-tasks {} "$ENV" "$current_dir"
}
# Retry failed tasks for up to 2 hours until all sites are validated to be healthy and ready for deployment.
# A site can be in the failed-sites-<env>.txt file if it fails to run healthcheck or backend tasks successfully.
# run_healthcheck_tasks is re-run for each failed site listed in failed-sites-<env>.txt with the [healthcheck] tag.
# run_backend_tasks is re-run for each failed site listed in failed-sites-<env>.txt with the [backend] tag.
# Update timeout to change the maximum time to retry failed sites.
function retry_failed_tasks() {
echo "Retrying failed sites in the Pantheon $ENV environment" >>"${current_dir}/logs/keepalive-$ENV.log"
local start_time=$(date +%s)
local timeout=7200 # 2 hours
while [ -s "${current_dir}/logs/failed-sites-$ENV.txt" ]; do
local current_time=$(date +%s)
local elapsed_time=$((current_time - start_time))
if [ $elapsed_time -ge $timeout ]; then
echo -e "Timeout reached without resolving all site failures." >>"${current_dir}/logs/keepalive-$ENV.log"
break
fi
if grep -q '\[healthcheck\]' "${current_dir}/logs/failed-sites-$ENV.txt"; then
mapfile -t SITES < <(grep '\[healthcheck\]' "${current_dir}/logs/failed-sites-$ENV.txt")
run_healthcheck_tasks "${SITES[@]}"
else
mapfile -t SITES < <(grep '\[backend\]' "${current_dir}/logs/failed-sites-$ENV.txt")
run_backend_tasks "${SITES[@]}"
fi
done
echo "Finished retrying failed sites" >>"${current_dir}/logs/keepalive-$ENV.log"
}
run_healthcheck_tasks "${SITES[@]}"
run_backend_tasks "${SITES[@]}"
retry_failed_tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment