Security

Let's take a look at how the Service Mesh protects your applications,

We have an application deployed at present, by default the service mesh protects our application

Basic Example

Let's look at this example,

k apply -f ./security/basic.yaml

This is the most basic example, it allows any connection to reach the api service

L7

We can also enforce intentions based on L7, let's add the rest of the basic examples so that you can see things working

First the payments service

k apply -f ./security/payments.yaml

Then the gRPC currency service

k apply -f ./security/currency.yaml

How the security works

Let's examine how this works, I am going to grab a shell on a service mesh enabled pod

If I curl the local service, you can see that everything works as expected, the local service is making an outbound call to the data plane this is redirected to the other end.

curl localhost:9090

If I try to make that call direct

curl payments.default.svc

it resolves correctly, however if I try to go to the pod it will fail

Let me just disable that rule in the control pane

you can see the request failing

and if we add it again, it is working

Granular security

Because we are using a software defined network we can actually be quite smart about what can access services

Consider this example, what if we only want to allow access to certain path, let's apply the following configuration and see what happens

k apply -f ./security/payments_deny.yaml

This also works with gRPC, let's see this in action, the currency service is a gRPC service so the RBAC looks like the following example

This allows the endpoint to be accessed but not to use gRPC curl as grpc curl is trying to use the reflection API

grpcurl -plaintext -d currency.ingress.shipyard.run:18443 FakeService.Handle

To add this we can enable access to the ServerReflection

k apply -f ./security/currency_with_reflection.yaml

grpcurl -plaintext -d currency.ingress.shipyard.run:18443 FakeService.Handle

Just to sanity check this, lets disable the handle method and only enable the reflection

grpcurl -plaintext -d currency.ingress.shipyard.run:18443 FakeService.Handle

grpcurl -plaintext -d currency.ingress.shipyard.run:18443 list

Observability

By default when you configure a service with the service mesh will expose some default metrics based on the configured service type. Let's take a look at a HTTP service.

http://localhost:8080/explore

Let's build a quick dashboard for the first service in our chain API.

Metrics are going to be specific to your service mesh but any that use Envoy, Kong, Consul, Istio, should produce metrics that look like this.

envoy_listener_http_downstream_rq_xx{local_cluster="api", envoy_http_conn_manager_prefix="public_listener"}

Let's create a new dashboard.

This metric shows the response that is sent by the service it is a histogram so we need to apply the rate function to it

rate(envoy_listener_http_downstream_rq_xx{local_cluster="api", envoy_http_conn_manager_prefix="public_listener"}[$__rate_interval])

We can also report the upstream calls from the API to the Payments service

rate(envoy_cluster_external_upstream_rq{consul_source_service="api", envoy_cluster_name="payments"}[$__rate_interval])

Let's add a few more metrics, we can report the duration of the requests and also show the duration for the service calls

histogram_quantile(0.5, rate(envoy_cluster_upstream_rq_time_bucket{consul_destination_service="api"}[$__rate_interval]))

Dynamic Charts

Now that we have the charts, let's see how they can be made generic

First we add a variable

envoy_cluster_upstream_rq_time_bucket

Then we add a Regex for the values we would like to extract

/consul_destination_service="([^"]*).*/

Then we can make the chart dynamic

rate(envoy_listener_http_downstream_rq_xx{local_cluster="$service", envoy_http_conn_manager_prefix="public_listener"}[$__rate_interval])

Deployment Reliability

Now we have some basic metrics for our service, let's look at some common problems with deployment reliability that often occurs and see what we can do about them with service mesh.

First let's modify one of our payment service versions to fail intermittently

kubectl apply -f ./reliability/failing_v2.yaml

Adding retries

Let's see what we can do about this, because service mesh allows you to control a software defined network you can apply patterns such as retries without needing to change the application code.

kubectl apply -f ./reliability/retry.yaml

What you can immediately see is that the 501 from the downstream requests have disappeared, you can also see the latency increase the reason for the increase is due to the retry.

Now, since we have default metrics, let's see how we can add those retries to our chart

rate(envoy_cluster_retry_upstream_rq_xx{consul_source_service="$service", consul_destination_service="$upstream"}[$__rate_interval])

Isolating failing service

First let's remove that retry

k delete -f ./reliability/retry.yaml

Now let's see how we can isolate the failing version of the service so we can test it

k apply -f ./reliability/isolate.yaml

We can now see that only the v1 service is being hit as the errors have disappeared

How can we test this manually, what we can do is to use the mesh to route to this specific version

k apply -f ./reliability/isolate.yaml

Next we add some specific routing to

k apply -f ./reliability/router.yaml

We can select the individual endpoints

curl "payments.ingress.shipyard.run:18080/?version=1"

curl "payments.ingress.shipyard.run:18080/?version=1"

Splitting traffic

But what if you want to control the traffic splitting, for example you might want to do a canary deployment

k apply -f ./reliability/splitter.yaml

Automated release

Let's put all of this together now and start to see how we can use these techniques to automate a deployment

First let's clean up

k delete -f ./reliability

We are using a release controller which will basically create the configuration that you have seen, tools like flagger / argo, release controller for consul and other amazing open source tools.

First we create the release

k apply -f ./reliability/release.yaml

Then we create an new deployment

k apply -f ./reliability/working_v2.yaml

What if the deployment was broken tho, well first we can create our retry

k apply -f ./reliability/retry.yaml

Now let's apply our broken service

k apply -f ./reliability/failing_v2.yaml

You can see the traffic is being split but it is not raising errors to the end user

nicholasjackson/readme.md