Introduction

Goal of the SRE team

The main goal of the Site Reliability Engineering (SRE) team is to create scalable and highly reliable software systems that fit any particular situation, and feedback mechanisms to provide insight into options for optimising these systems.

The contribution of the SRE team results in strict improvements across the following areas: software development, availability, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

Responsibilities of the SRE team

Help developers deliver more value to users faster.
Create and maintain the CI/CD process and optimize operations.
Release engineering for CE teams.
Bring together the workflows and responsibilities of operations and CE teams.
Use infrastructure as code to manage environments in a repeatable and more efficient manner.
Collaboratively decide on a system's availability targets.
Measure availability with input from engineers and product owners.
Have a formula for balancing accidents and failures against new releases.

Shared Goals and Expectations

Setting the right expectations is critical for meeting deadlines and task completion.

Our principles:

We emphasize that the application owners, not SREs, are directly responsible for making changes to an application.
SRE engagement is for company-wide benefit. Any new automation or tooling should improve common tools and automation used across the company and avoid one-off script development.
SREs should give the product development team a heads up about any new processes the engagement might introduce (for example, load testing).

Our common pattern when setting goals is to:

Define the scope of the engagement.
Identify the end result success story, and call it out explicitly.

Continuous Build and Deployment

In order to make sure our release processes meet business requirements, release engineers and SREs work together to develop strategies for gradually release new features and pushing out new releases without interrupting services, and rolling back features that demonstrate problems.

Automation

SREs uses automation to implement progressive rollouts, quickly and accurately detect problems, and roll back changes safely when problems arise.

Self-Service Model

In order to work at scale, teams must be self-sufficient. SREs develop best practices and tools that allow product development teams to control and run their own release processes.

Build Tools

Build tools must allow the SREs to ensure consistency and repeatability. The build process must be self-contained, not rely on services that are external to the build environment and be insensitive to the libraries and other software installed on the build machine.

Artifacts

We use Docker to create images and running those images as containers. Private images that are to be deployed in the clusters are expected to be pushed to the turner organization in quay.io.

The CI server builds an image tag with the following parts:

      quay.io/company/name-of-package:1.0.0-jenkins.master.3795
      ──────┬─────── ────────┬─────── ────────────┬────────────
      container_host   package_name             semver

The semver is a full version with build metadata such as:

                1.0.0-jenkins.master.13280
                ──┬── ───┬─── ──┬─── ──┬──
      package_version  build    │     build number
                       host   branch

Branching Strategy

The key to a true CI/CD workflow starts with the branching strategy. The three key branch types in this workflow are:

Trunk
Short-lived feature branches
Release branches

This branching strategy is a key enabler of Continuous Integration (CI) and by extension Continuous Delivery (CD). All developers work on short-lived feature branches off of a trunk branch (master in this case). CI jobs that execute an automated test suite are run against every commit. If the build is green, a pull request can be merged pending proper approval.

Release Process

https://sequencediagram.org/

title CD Workflow

participant Developer
participant Feature Branch (Git)
participant Pull Request
participant Feature Branch QA/UAT (AWS)
participant Master Branch
participant QA (AWS)
participant Release Tag (Git)
participant Staging (AWS)
participant Production (AWS)

Developer->Feature Branch (Git): Commit code

Feature Branch (Git)->Pull Request: Create pull request

Pull Request->Feature Branch QA/UAT (AWS):Deploy feature branch\nto <branch_name>.company.com

Pull Request->Developer: Request approval
Developer->Pull Request: Approve pull request

Pull Request->Master Branch: Merge feature branch into master

Master Branch->QA (AWS):Deploy master branch\nto: qa.company.com

Master Branch->Release Tag (Git): Create tag v0.0.1

Release Tag (Git)->Staging (AWS):Deploy tag v0.0.1\nto staging.company.com

Release Tag (Git)->Production (AWS):Deploy tag v0.0.1\nto company.com

Changelog

Each release should contain a changelog with a chronologically ordered list of notable changes. By allowing SREs to understand what changes are included in a new release of a project, this report can expedite troubleshooting when there are problems with a release.

Infrastructure Components

Kubernetes

Kubernetes is an open source, declarative platform for running containers at scale. Kubernetes contains a number of abstractions that represent the state of your system: deployed containerized applications and workloads, their associated network and disk resources, and other information about what your cluster is doing.

These abstractions are represented by objects in the Kubernetes API. The basic Kubernetes objects include:

Pods: The smallest schedule-able unit in Kubernetes. A Pod is one or more containers running on the same host with the configuration defined in your Deployment. One Pod can be thought of as one instance of your application.
Deployments: This is where the specification for your application lives. In other words, this is where you define how many Pods of your application will run, what environment variables they will have, how Kubernetes will health check those pods, etc.
Service: The network name that can be addressed within the cluster to send traffic to the collection of pods configured in your Deployment. This is a simple DNS entry that round-robins traffic between all of your Deployment's Pods. You hook this up to an Ingress object to get traffic to your application from outside of the cluster.
Ingress: This usually maps 1:1 to a load balancer in your cloud provider. In our current configuration, each Ingress object is used to configure an Application Load Balancer in AWS.

Kubectl

kubectl is command line tool for all things Kubernetes. It's the official way for interacting with a Kubernetes cluster.

EKS

EKS stands for Elastic Container Service for Kubernetes. It's Amazon's offering for managed Kubernetes as a service and what we're using for our production clusters.

Rancher

Rancher is an open source project for providing a web GUI on top of Kubernetes. Most day-to-day tasks like creating a new Deployment, upgrading in image and viewing application logs can all be done through the Rancher UI. In Rancher, the combination of Deployments and their Services are called a Workload.

Logging

Audit Logs should consist of:
- the action taken
- the identity that performed the action
- the source IP address from which the action was taken
- The resource(s) or record(s) upon which the action was taken

Measuring Impact

To make sure that SREs are doing high-value work, we measure the impact of the engagement by conducting a point-in-time assessment with leads of the product engineering team before starting the engagement. After the engagement ends and the development team is performing on its own, we perform the assessment again to measure the value SRE added.

fedecarg/cnn_sre.md