While running a multi-cluster Istio service mesh can help to increase capacity and reliablitlity, it introduces new operational concerns. Removing a cluster from the mesh, either temporarily or permanently, requires special considerations.
The easiest way to disconnect workloads from one cluster in your mesh from another is to delete the remote secret that allows the control plane to access the remote cluster's API server.
There are a few downsides to this:
- Deleting the remote secret will drop the endpoints, but not open connections. You will need to validate that these connections have been fully drained before considering the cluster "out of rotation".
- The ability to send new connections to the cluster will immediately drop to 0. If load suddenly shifts elsewhere, there could be service degradations or other unpredictable consequences.
Deleting the remote secret is fine when experimenting, but for production clusters it should be the last step when removing a cluster from rotation.
First, find and delete the remote secret.
$ kubectl --context "${CTX_CLUSTER1}" -n istio-system get secrets
NAME TYPE DATA AGE
...
istio-remote-secret-cluster-2 Opaque 1 16m
...
$ kubectl --context "${CTX_CLUSTER1}" -n istio-system delete secret istio-remote-secret-cluster-2
After doing this, you should no longer see endpoints from cluster-2
on proxies in cluster-1
. You can verify
the endpoints using istioctl
, and compare them with the Pod IPs in the remote cluster as given by kubectl
.
$ istioctl --context $CTX_CLUSTER1 \
proxy-config ep \
$(kubectl --context $CTX_CLUSTER1 -n sample get po -lapp=sleep -ojsonpath='{.items[0].metadata.name}').sample
Verify the output doesn't contain the IPs in the remote cluster.
$ kubectl --context $CTX_CLUSTER2 get pods -n sample -owide
In a production environment, more precaution needs to be taken when removing the cluster. Rather than immediately
moving all traffic to another cluster(s), we should gradually move the traffic over. This can be done with a simple
traffic shifting rule using a transparent label topology.istio.io/cluster
.
The following rule is based on the apps in the multi-cluster verification step.
We can shift most of our traffic over, and eventually change the weights to 100 and 0 for cluster-1
and cluster-2
, respectively.
NOTE: The rule will need to be adjusted for and applied to all the clusters besides the cluster you are removing.
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: helloworld
spec:
hosts:
- helloworld
http:
- route:
- destination:
host: helloworld
subset: cluster-1
weight: 80
- destination:
host: helloworld
subset: cluster-2
weight: 20
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: helloworld
spec:
host: helloworld
subsets:
- name: cluster-1
labels:
topology.istio.io/cluster: cluster-1
- name: cluster-2
labels:
topology.istio.io/cluster: cluster-2
---
At each percentage of traffic shifting, monitor metrics and alerts in case the new load to cluster-1
causes issues.
Once it is confirmed that things are stable, advance the shifting percentage until cluster-2
has a weight of 0.
TODO: MCS will be much better
There are two options for stopping traffic from leaving the cluster:
- Using traffic-shifting rules similar to the above.
- Changing Mesh Config's service settings
While traffic-shifting rules must be applied per-service, they're a bit safer because they won't suddenly shift all traffic originating from within the mesh to endpoints with in the mesh.
A benefit of the mesh config based approach is that it can be applied to all services in the cluster at once, with the rule below:
serviceSettings:
- settings:
clusterLocal: true
hosts:
- *
The hosts
field accepts wildcards, allowing to enforce cluster-local rules at the service, namespace or cluster
scopes.
Keep in mind, for a primary-remote setup, this rule will affect all clusters that receive config from the primary
cluster where the serviceSettings
are applied.
Once it can be confirmed that the remaining clusters are stable, it is safe to remove the remote secret. This is necessary because new services added during this "maintenance period" may not have the traffic rules in place to keep from contacting the removed cluster.
TODO find metrics that allow us to ensure cross-cluster connections have actually closed even after removing secrets/endpoints long-lived TCP connections can stay open