Let's take a look at how the Service Mesh protects your applications,
We have an application deployed at present, by default the service mesh protects our application
Let's look at this example,
k apply -f ./security/basic.yaml
This is the most basic example, it allows any connection to reach the api service
We can also enforce intentions based on L7, let's add the rest of the basic examples so that you can see things working
First the payments service
k apply -f ./security/payments.yaml
Then the gRPC currency service
k apply -f ./security/currency.yaml
Let's examine how this works, I am going to grab a shell on a service mesh enabled pod
If I curl the local service, you can see that everything works as expected, the local service is making an outbound call to the data plane this is redirected to the other end.
curl localhost:9090
If I try to make that call direct
curl payments.default.svc
it resolves correctly, however if I try to go to the pod it will fail
Let me just disable that rule in the control pane
you can see the request failing
and if we add it again, it is working
Because we are using a software defined network we can actually be quite smart about what can access services
Consider this example, what if we only want to allow access to certain path, let's apply the following configuration and see what happens
k apply -f ./security/payments_deny.yaml
This also works with gRPC, let's see this in action, the currency service is a gRPC service so the RBAC looks like the following example
This allows the endpoint to be accessed but not to use gRPC curl as grpc curl is trying to use the reflection API
grpcurl -plaintext -d currency.ingress.shipyard.run:18443 FakeService.Handle
To add this we can enable access to the ServerReflection
k apply -f ./security/currency_with_reflection.yaml
grpcurl -plaintext -d currency.ingress.shipyard.run:18443 FakeService.Handle
Just to sanity check this, lets disable the handle method and only enable the reflection
grpcurl -plaintext -d currency.ingress.shipyard.run:18443 FakeService.Handle
grpcurl -plaintext -d currency.ingress.shipyard.run:18443 list
By default when you configure a service with the service mesh will expose some default metrics based on the configured service type. Let's take a look at a HTTP service.
Let's build a quick dashboard for the first service in our chain API
.
Metrics are going to be specific to your service mesh but any that use Envoy, Kong, Consul, Istio, should produce metrics that look like this.
envoy_listener_http_downstream_rq_xx{local_cluster="api", envoy_http_conn_manager_prefix="public_listener"}
Let's create a new dashboard.
This metric shows the response that is sent by the service it is a histogram so we need to apply the rate function to it
rate(envoy_listener_http_downstream_rq_xx{local_cluster="api", envoy_http_conn_manager_prefix="public_listener"}[$__rate_interval])
We can also report the upstream calls from the API to the Payments service
rate(envoy_cluster_external_upstream_rq{consul_source_service="api", envoy_cluster_name="payments"}[$__rate_interval])
Let's add a few more metrics, we can report the duration of the requests and also show the duration for the service calls
histogram_quantile(0.5, rate(envoy_cluster_upstream_rq_time_bucket{consul_destination_service="api"}[$__rate_interval]))
Now that we have the charts, let's see how they can be made generic
First we add a variable
envoy_cluster_upstream_rq_time_bucket
Then we add a Regex for the values we would like to extract
/consul_destination_service="([^"]*).*/
Then we can make the chart dynamic
rate(envoy_listener_http_downstream_rq_xx{local_cluster="$service", envoy_http_conn_manager_prefix="public_listener"}[$__rate_interval])
Now we have some basic metrics for our service, let's look at some common problems with deployment reliability that often occurs and see what we can do about them with service mesh.
First let's modify one of our payment service versions to fail intermittently
kubectl apply -f ./reliability/failing_v2.yaml
Let's see what we can do about this, because service mesh allows you to control a software defined network you can apply patterns such as retries without needing to change the application code.
kubectl apply -f ./reliability/retry.yaml
What you can immediately see is that the 501 from the downstream requests have disappeared, you can also see the latency increase the reason for the increase is due to the retry.
Now, since we have default metrics, let's see how we can add those retries to our chart
rate(envoy_cluster_retry_upstream_rq_xx{consul_source_service="$service", consul_destination_service="$upstream"}[$__rate_interval])
First let's remove that retry
k delete -f ./reliability/retry.yaml
Now let's see how we can isolate the failing version of the service so we can test it
k apply -f ./reliability/isolate.yaml
We can now see that only the v1 service is being hit as the errors have disappeared
How can we test this manually, what we can do is to use the mesh to route to this specific version
k apply -f ./reliability/isolate.yaml
Next we add some specific routing to
k apply -f ./reliability/router.yaml
We can select the individual endpoints
curl "payments.ingress.shipyard.run:18080/?version=1"
curl "payments.ingress.shipyard.run:18080/?version=1"
But what if you want to control the traffic splitting, for example you might want to do a canary deployment
k apply -f ./reliability/splitter.yaml
Let's put all of this together now and start to see how we can use these techniques to automate a deployment
First let's clean up
k delete -f ./reliability
We are using a release controller which will basically create the configuration that you have seen, tools like flagger / argo, release controller for consul and other amazing open source tools.
First we create the release
k apply -f ./reliability/release.yaml
Then we create an new deployment
k apply -f ./reliability/working_v2.yaml
What if the deployment was broken tho, well first we can create our retry
k apply -f ./reliability/retry.yaml
Now let's apply our broken service
k apply -f ./reliability/failing_v2.yaml
You can see the traffic is being split but it is not raising errors to the end user