Skip to content

Instantly share code, notes, and snippets.

@bobthemighty
Created April 18, 2018 13:14
Show Gist options
  • Save bobthemighty/3e822e98885678d1817159bca983a390 to your computer and use it in GitHub Desktop.
Save bobthemighty/3e822e98885678d1817159bca983a390 to your computer and use it in GitHub Desktop.
Production Readiness Checklist

Production Readiness April 2018

Although the pace of change at Made has slowed over the last 12 months, we are still testing new techniques and re-examining best practices. It's useful to periodically reflect on what good practice looks like and make sure we're spreading that knowledge to our dev teams. This is a quick guide to what the Ops team need from developers in 2018. If your systems don't comply with these guidelines, you should chat to a friendly ops person so we can help you prioritise and fix issues.

Operator.MD

Ops can't operate your system if they don't know how to work it. Every system should have an operators.md file in the root of the github repository that describes:

  • What the system does.
  • What the business impact of an outage is.
  • How to deploy the system.
  • How the system is backed up.
  • How to tell whether the system is running correctly.

The service should be added to the Service Matrix page on the intranet wiki.

Infrastructure Playbook

Sometimes bad things happen and we need to rebuild your systems. Every system should have an Ansible playbook that creates all the infrastructure needed by the project. This includes:

  • RDS, Elasticsearch, or Elasticache databases.
  • DNS entries.
  • IAM groups, policies, and users.
  • Vault policies
  • Load balancers.
  • EC2 Instances.

The playbook should be described in your operators.md and set up in Tower with a survey so that develops and operators can build your system from scratch. Occasionally we manually configure part of our infrastructure, particularly when trialling new technologies. We should explicitly describe the manual configuration in the operators.md and plan to automate it as soon as possible.

Docker-Compose + Makefile

We use Docker to package our applications and to support our development workflows. It's helpful to developers if we have a standard way of running apps locally. Recently we've been standardising on a docker-compose.yml and a Makefile.

Applications should use Docker Compose to describe the images their system uses. In order to run on Jenkins it's important that your docker-compose.yml doesn't bind to any static ports, instead we can split port bindings and development volumes into a separate docker-compose.dev.yml. Hacienda and Emporio use this pattern.

Older applications tend to have shell scripts that perform initial setup, run containers, and execute test runs. We have started using Makefiles for this purpose. Using Make gives developers a standard way to setup and run applications.

Applications should provide a Makefile in the root of their repository. The default task should install any dependencies and run any unit tests. We use Make to wrap our calls to docker-compose and to perform build-time tasks. Good examples include Cancellation and Emporio.

Structured Logs to STDOUT

We can't manage your application in production if we can't see it running. We have a mature logging stack that can handle large volumes of log data, please use it thoughtfully.

Systems should log to STDOUT. Application logs should be json-formatted with @message and @timestamp fields. Ops will be introducing new standards for json logging schema soon. If your application handles HTTP traffic, we can also process nginx logs that are written to STDOUT, but consider whether you need this data, or whether ELB traffic logs will suffice.

Systems should ensure that their logs are written with a sensible verbosity, and include contextual identifiers for later correlation. There is a document on the Wiki that covers our recommendations for logging.

If you log an ERROR, we will forward it automatically to Slack. Please take the time to fix ERRORs, either by reducing the log level or fixing bugs. If an ERROR can't be fixed or reduced, make note of it in the operators.md so that ops and devs understand what it means. For example, periodically Availability experiences timeouts oto the Redis server. This is an error situation, and if it persists requires manual intervention to fix, but ther's no bug that can be fixed to completely solve the problem. This ERROR should be described in the operator file.

Jenkinsfile

Jenkins is our build server of necessity, if not choice. In recent months we have been writing Jenkinsfiles for our servies. A Jenkinsfile describes the steps for a build on Jenkins 2. Using a Jenkinsfile to control the build means we can minimise the amount of configuration we need on the build server, and that developers are in control of their own build processes.

Services should have a Jenkinsfile in the root of their git repository. The build should result in a Docker container pushed to the registry. If possible, the build should push a new docker image out to the test environment. We should run unit tests and acceptance tests as part of our build processes. We should be able to tie a docker container back to the build and commit that created it.

If your service has a Makefile, the Jenkinsfile can consist of calls to make. This means that developers can perform exactly the same steps locally in order to diagnose problems.

Ops need to do some more work to fix git tag support in Jenkins for use in build numbers, and we may provide a DSL step to make calling Make from Jenkins more natural.

Availability and Comms are good examples of this pattern.

Record Application Metrics

When there is a problem with your application in production, the first thing ops want to check is your metrics.

Systems should record business-level metrics about their application. If you have a web app, you should record metrics for request throughput and latencies tagged by status. If you have an event consumer, you should ship metrics for the number and type of events you're processing. Along with this, consider what the primary aim of your system is. Are you taking payments? Record a payments_received metric. Are you selling furniture? Record orders_placed.

Systems should have a dashboard available in Grafana that shows the most important health indicators. The dashboard should be linked in the operators.md file, and you should explain how to read the dashboard, and any common patterns to watch for.

We have more information and suggestions for metrics on our Wiki.

Expose a Health Endpoint

For loadbalancing and container provisioning, it's useful to have a healthcheck that tells us a system is up and running.

Services should expose a URI that we can call to test whether the system is running. By convention, we use /_health as the endpoint. Ideally, we would show some recent metrics at that location, for example Availability. It can be useful to return the currently deployed version number, like Cancellation. The only real requirement is that your health endpoint must return a 200 OK if and only if your system is up and able to serve requests. Please document your health endpoint in the operators file with a description of the response payload.

If you health endpoint is public (like availability) you should take steps to ensure it can't be used as a denial of service vector, eg. by caching the result for 30 seconds.

Use Alpine + S6

Over time we've built containers in a variety of ways. For some time, our recommendation has been to use Alpine as the base operating system, with S6 as a service manager. S6 is a lightweight process manager that runs as PID 1 inside a container and handles signals and zombie reaping.

Systems should deploy docker containers based on the latest alpine-s6 image from the ops docker registry. Ops can help you with structuring your docker container and running your app under S6.

It's helpful if we use the Filesystem Hierarchy Standard in containers. If your application needs deployment-time setup, eg. creating Eventstore streams or running database migrations, we can host these in your container as part of your S6 initialisation stage.

Recently, we have preferred to lay out the docker filesystem in a separate directory of the repository named root or docker-root which simplifies the Dockerfile and keeps the repository clean. See, for example, El-Rec or Availability

Move Config to consul and vault

When we first started deploying containers to AWS, we kept our configuration in yaml files and applied them to containers using environment variables. Over time we have been moving more of our configuration to Consul. This allows us to change configuration without redeploying code.

Systems should keep config in Consul and support reloading the application when configuration changes. Ops can provide guidance on how to perform service discovery and how to consume key/value data from Consul. Good examples of this pattern include Hacienda, Emporio, and Comms

Systems should use appropriate labels on their containers so that Consul adds them to the service catalogue.

We are slowly moving secrets, including AWS credentials, database passwords etc. to the Vault server. This allows developers to manage their own secrets in a secure way without relying on Dropbox and similar hacks.

Run Your Application on Nomad

Most of our applications run as stateless docker containers that consume events or expose an HTTP API. These services are suitable for deployment on our Nomad cluster.

Deploying apps to a Nomad cluster reduces the amount of hardware we need, and provides greater resilience. We are slowly rolling out Nomad to more systems. If you are starting a system from scratch, Nomad should be your default.

Developers should speak to Ops to get guidance on how to migrate their systems to Nomad and how to manage their deployments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment